Kenneth Ward Church, Weizhong Zhu, Jason W. Pelecanos
{"title":"C2D2E2:使用呼叫中心激励在实体抽取中使用对话和分组","authors":"Kenneth Ward Church, Weizhong Zhu, Jason W. Pelecanos","doi":"10.18653/v1/W16-6008","DOIUrl":null,"url":null,"abstract":"This paper introduces a deceptively simple entity extraction task intended to encourage more interdisciplinary collaboration between fields that don’t normally work together: diarization, dialog and entity extraction. Given a corpus of 1.4M call center calls, extract mentions of trouble ticket numbers. The task is challenging because first mentions need to be distinguished from confirmations to avoid undesirable repetitions. It is common for agents to say part of the ticket number, and customers confirm with a repetition. There are opportunities for dialog (given/new) and diarization (who said what) to help remove repetitions. New information is spoken slowly by one side of a conversation; confirmations are spoken more quickly by the other side of the conversation. 1 Extracting Ticket Numbers Much has been written on extracting entities from text (Etzioni et al., 2005), and even speech (Kubala et al., 1998), but less has been written in the context of dialog (Clark and Haviland, 1977) and diarization (Tranter and Reynolds, 2006; Anguera et al., 2012; Shum, 2011). This paper describes a ticket extraction task illustrated in Table 1. The challenge is to extract a 7 byte ticket number, “902MDYK,” from the dialog. Confirmations ought to improve communication, but steps need to be taken to avoid undesirable repetition in extracted entities. Dialog theory suggests it should be possible to distinguish first mentions (bold) from confirmations (italics) based on prosodic cues such as pitch, energy and duration. t0 t1 S1 S2 278.16 281.07 I do have the new hardware case number for you when you’re ready 282.60 282.85 okay 284.19 284.80 nine 285.03 285.86 zero 286.22 286.74 two 290.82 291.30 nine 292.87 293.95 zero two 297.87 298.24 okay 299.30 300.49 M. as in Mike 301.97 303.56 D. as in delta 304.89 306.31 Y. as in Yankee 307.50 308.81 K. as in kilo 310.14 310.57 okay 310.77 311.70 nine zero two 311.73 312.49 M. D. 312.53 313.18 Y. T. 313.75 314.21 correct 314.21 317.28 and thank you for calling IBM is there anything else I can assist you with Table 1: A ticket dialog: 7 bytes (902MDYK) at 1.4 bps. First mentions (bold) are slower than confirmations (italics). phone matches calls ticket matches (edit dist) 66% 238 0 59% 82 1 55% 40 2 4.1% 4033 3+ Table 2: Phone numbers are used to confirm ticket matches. Good ticket matches (top row) are confirmed more often than poor matches (bottom row). Poor matches are more common because ticket numbers are relatively rare, and most calls don’t","PeriodicalId":274608,"journal":{"name":"Proceedings of the Workshop on Uphill Battles in Language Processing:\n Scaling Early Achievements to Robust Methods","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarization in Entity Extraction\",\"authors\":\"Kenneth Ward Church, Weizhong Zhu, Jason W. Pelecanos\",\"doi\":\"10.18653/v1/W16-6008\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces a deceptively simple entity extraction task intended to encourage more interdisciplinary collaboration between fields that don’t normally work together: diarization, dialog and entity extraction. Given a corpus of 1.4M call center calls, extract mentions of trouble ticket numbers. The task is challenging because first mentions need to be distinguished from confirmations to avoid undesirable repetitions. It is common for agents to say part of the ticket number, and customers confirm with a repetition. There are opportunities for dialog (given/new) and diarization (who said what) to help remove repetitions. New information is spoken slowly by one side of a conversation; confirmations are spoken more quickly by the other side of the conversation. 1 Extracting Ticket Numbers Much has been written on extracting entities from text (Etzioni et al., 2005), and even speech (Kubala et al., 1998), but less has been written in the context of dialog (Clark and Haviland, 1977) and diarization (Tranter and Reynolds, 2006; Anguera et al., 2012; Shum, 2011). This paper describes a ticket extraction task illustrated in Table 1. The challenge is to extract a 7 byte ticket number, “902MDYK,” from the dialog. Confirmations ought to improve communication, but steps need to be taken to avoid undesirable repetition in extracted entities. Dialog theory suggests it should be possible to distinguish first mentions (bold) from confirmations (italics) based on prosodic cues such as pitch, energy and duration. t0 t1 S1 S2 278.16 281.07 I do have the new hardware case number for you when you’re ready 282.60 282.85 okay 284.19 284.80 nine 285.03 285.86 zero 286.22 286.74 two 290.82 291.30 nine 292.87 293.95 zero two 297.87 298.24 okay 299.30 300.49 M. as in Mike 301.97 303.56 D. as in delta 304.89 306.31 Y. as in Yankee 307.50 308.81 K. as in kilo 310.14 310.57 okay 310.77 311.70 nine zero two 311.73 312.49 M. D. 312.53 313.18 Y. T. 313.75 314.21 correct 314.21 317.28 and thank you for calling IBM is there anything else I can assist you with Table 1: A ticket dialog: 7 bytes (902MDYK) at 1.4 bps. First mentions (bold) are slower than confirmations (italics). phone matches calls ticket matches (edit dist) 66% 238 0 59% 82 1 55% 40 2 4.1% 4033 3+ Table 2: Phone numbers are used to confirm ticket matches. Good ticket matches (top row) are confirmed more often than poor matches (bottom row). Poor matches are more common because ticket numbers are relatively rare, and most calls don’t\",\"PeriodicalId\":274608,\"journal\":{\"name\":\"Proceedings of the Workshop on Uphill Battles in Language Processing:\\n Scaling Early Achievements to Robust Methods\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Workshop on Uphill Battles in Language Processing:\\n Scaling Early Achievements to Robust Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/W16-6008\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Workshop on Uphill Battles in Language Processing:\n Scaling Early Achievements to Robust Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W16-6008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
本文介绍了一个看似简单的实体提取任务,旨在鼓励通常不一起工作的领域之间更多的跨学科协作:diarization、对话和实体提取。给定140万个呼叫中心呼叫的语料库,提取故障单号码。这项任务具有挑战性,因为需要将首次提及与确认区分开来,以避免不必要的重复。通常情况下,代理商会说出机票号码的一部分,而顾客会重复一遍来确认。有机会进行对话(已知的/新的)和记录(谁说了什么),以帮助消除重复。谈话的一方缓慢地说出新信息;谈话的另一方会说得更快。关于从文本(Etzioni et al., 2005)甚至语音(Kubala et al., 1998)中提取实体的文章很多,但关于对话(Clark and Haviland, 1977)和数字化(Tranter and Reynolds, 2006;Anguera et al., 2012;Shum, 2011)。本文描述了表1所示的票据提取任务。挑战在于从对话框中提取一个7字节的票号“902MDYK”。确认应该改善沟通,但需要采取措施避免在提取的实体中出现不必要的重复。对话理论认为,应该可以根据音高、能量和持续时间等韵律线索区分首次提及(粗体)和确认(斜体)。t0 t1 S1 S2 278.16 - 281.07有新硬件案例数量为你当你准备好282.60 - 282.85好九285.03 - 285.86 284.19 - 284.80 286.22 - 286.74 292.87 - 293.95 290.82 - 291.30两名0两个297.87 298.24好299.30 - 300.49 m .迈克301.97 - 303.56 d一样在δ304.89 - 306.31 y在洋基307.50 - 308.81 k . 310.14公斤310.57好310.77 311.70九百零二311.73 312.49 312.53 - 313.18 m . d . y . t .正确313.75 - 314.21 314.21 - 317.28和谢谢你调用IBM表1:票证对话框:7字节(902MDYK),速度为1.4 bps。第一次提及(粗体)比确认(斜体)慢。电话匹配电话票务匹配(编辑区)66% 238 0 59% 82 1 55% 40 2 4.1% 4033 3+表2:电话号码用于确认票务匹配。好票比赛(上排)比差票比赛(下排)更容易被确认。糟糕的匹配更常见,因为票号相对较少,而大多数判罚都没有
C2D2E2: Using Call Centers to Motivate the Use of Dialog and Diarization in Entity Extraction
This paper introduces a deceptively simple entity extraction task intended to encourage more interdisciplinary collaboration between fields that don’t normally work together: diarization, dialog and entity extraction. Given a corpus of 1.4M call center calls, extract mentions of trouble ticket numbers. The task is challenging because first mentions need to be distinguished from confirmations to avoid undesirable repetitions. It is common for agents to say part of the ticket number, and customers confirm with a repetition. There are opportunities for dialog (given/new) and diarization (who said what) to help remove repetitions. New information is spoken slowly by one side of a conversation; confirmations are spoken more quickly by the other side of the conversation. 1 Extracting Ticket Numbers Much has been written on extracting entities from text (Etzioni et al., 2005), and even speech (Kubala et al., 1998), but less has been written in the context of dialog (Clark and Haviland, 1977) and diarization (Tranter and Reynolds, 2006; Anguera et al., 2012; Shum, 2011). This paper describes a ticket extraction task illustrated in Table 1. The challenge is to extract a 7 byte ticket number, “902MDYK,” from the dialog. Confirmations ought to improve communication, but steps need to be taken to avoid undesirable repetition in extracted entities. Dialog theory suggests it should be possible to distinguish first mentions (bold) from confirmations (italics) based on prosodic cues such as pitch, energy and duration. t0 t1 S1 S2 278.16 281.07 I do have the new hardware case number for you when you’re ready 282.60 282.85 okay 284.19 284.80 nine 285.03 285.86 zero 286.22 286.74 two 290.82 291.30 nine 292.87 293.95 zero two 297.87 298.24 okay 299.30 300.49 M. as in Mike 301.97 303.56 D. as in delta 304.89 306.31 Y. as in Yankee 307.50 308.81 K. as in kilo 310.14 310.57 okay 310.77 311.70 nine zero two 311.73 312.49 M. D. 312.53 313.18 Y. T. 313.75 314.21 correct 314.21 317.28 and thank you for calling IBM is there anything else I can assist you with Table 1: A ticket dialog: 7 bytes (902MDYK) at 1.4 bps. First mentions (bold) are slower than confirmations (italics). phone matches calls ticket matches (edit dist) 66% 238 0 59% 82 1 55% 40 2 4.1% 4033 3+ Table 2: Phone numbers are used to confirm ticket matches. Good ticket matches (top row) are confirmed more often than poor matches (bottom row). Poor matches are more common because ticket numbers are relatively rare, and most calls don’t