Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa
{"title":"面向大型文本语料库的可扩展摘要和可视化(仅限摘要)","authors":"Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa","doi":"10.1145/2213836.2213970","DOIUrl":null,"url":null,"abstract":"Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.","PeriodicalId":212616,"journal":{"name":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards scalable summarization and visualization of large text corpora (abstract only)\",\"authors\":\"Tyler Sliwkanich, Douglas Schneider, Aaron Yong, M. Home, Denilson Barbosa\",\"doi\":\"10.1145/2213836.2213970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.\",\"PeriodicalId\":212616,\"journal\":{\"name\":\"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2213836.2213970\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2213836.2213970","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Towards scalable summarization and visualization of large text corpora (abstract only)
Society is awash with problems requiring the analysis of vast quantities of text and data. From detecting flu trends out of twitter conversations to finding scholarly works answering specific questions, we rely more and more on computers to process text for us. Text analytics is the application of computational, mathematical, and statistical models to derive information from large quantities of data coming primarily as text. Our project provides fast and effective text-analytics tools for large document collections, such as the blogosphere. We use natural language processing and database techniques to extract, collect, analyze, visualize, and archive information extracted from text. We focus on discovering relationships between entities (people, places, organizations, etc.) mentioned in one or more sources (blog posts or news articles). We built a custom solution using mostly off-the-shelf, open-source tools to provide a scalable platform for users to search and analyze large text corpora. Currently, we provide two main outlets for users to discover these relations: (1) full-text search over the documents and (2) graph visualizations of the entities and their relationships. This provides the user with succinct and easily digestible information gleaned from the corpus as a whole. For example, we can easily pose queries like which companies were bought by Google? as entity:google relation:bought. The extracted data is stored on a combination of the noSQL database CouchDB and Apache's Lucene. This combination is justified as our work-flow consists of offline batch insertions with almost no updates. Because we support specialized queries, we can forgo the flexibility of traditional SQL solutions and materialize all necessary indices, which are used to quickly query large amounts of de-normalized data using MapReduce. Lucene provides a flexible and powerful query syntax to yield relevant ranked results to the user. Moreover, its indices are synchronized by a process subscribed to the list of database changes published by CouchDB. The graph visualizations rely on CouchDB's ability to export the data in any format: we currently use a customized graph visualization relying on XML data. Finally, we use memcached to further improve the performance, especially for queries involving popular entities.