{"title":"Session details: Keynote address","authors":"Miao A. Chen","doi":"10.1145/3250039","DOIUrl":"https://doi.org/10.1145/3250039","url":null,"abstract":"","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124751326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Borin, Devdatt P. Dubhashi, Markus Forsberg, Richard Johansson, D. Kokkinakis, P. Nugues
{"title":"Mining semantics for culturomics: towards a knowledge-based approach","authors":"L. Borin, Devdatt P. Dubhashi, Markus Forsberg, Richard Johansson, D. Kokkinakis, P. Nugues","doi":"10.1145/2513549.2513551","DOIUrl":"https://doi.org/10.1145/2513549.2513551","url":null,"abstract":"The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114789389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing future communities in growing citation networks","authors":"Sukhwan Jung, Aviv Segev","doi":"10.1145/2513549.2513553","DOIUrl":"https://doi.org/10.1145/2513549.2513553","url":null,"abstract":"Citation networks contain temporal information about what researchers are interested in at a certain time. A community in such a network is built around either a renowned researcher or a common research field; either way, analyzing how the community will change in the future will give insight into the research trend in the future. The paper proposes methods to analyze how communities change over time in the citation network graph without additional external information and based on node and link prediction and community detection. Different combinations of the proposed methods are also analyzed. Experiments show that the proposed methods can identify the changes in citation communities multiple years in the future with performance differing according to the analyzed time span. Furthermore, the method is shown to produce higher performance when analyzing communities to be disbanded and to be formed in the future.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116287896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Review rating prediction based on the content and weighting strong social relation of reviewers","authors":"Bing-kun Wang, Yulin Min, Yongfeng Huang, Xing Li, Fangzhao Wu","doi":"10.1145/2513549.2513554","DOIUrl":"https://doi.org/10.1145/2513549.2513554","url":null,"abstract":"Review rating is more helpful than review binary classification for many decision processes such as consumption decision-making, company product quality tracking and public opinion mining. In the review rating, reviewers are influenced not only by their own subjective feelings, but also by others' rating to the same product. Existing review rating prediction methods are mainly based on the content of reviews, which only consider the subjective factors of reviewers, but not consider the impact of other people in the social relations of reviewers. Based on it, we propose a review rating prediction method by incorporating the character of reviewer's social relations, as regularization constraints, into content-based methods. In addition, we further propose a method to classify the social relations of reviewers into strong social relation and ordinary social relation. For strong social relation of reviewers, we give higher weight than ordinary social relation when incorporating the two social relations into content-based methods. Experiments on two real movie review datasets demonstrate that the method of considering different social relations has better performance than the content-based methods and the method of considering social relations as a whole.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116168233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information fusion in taxonomic descriptions","authors":"Qin Wei","doi":"10.1145/2513549.2513552","DOIUrl":"https://doi.org/10.1145/2513549.2513552","url":null,"abstract":"Providing a single access point to an information system from multiple documents is helpful for biodiversity researchers as it is true in many fields. It not only saves the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels of description. This paper investigates the potential of information fusion techniques in biodiversity area since the researchers in this domain desperately need information from different sources to verify their decision. In another sense, there are massive amounts of collections in this area. It is not easy or even possible for the researcher to manually collect information from different places. The proposed system contains 4 steps: Text segmentation and Taxonomic Name Identification, Organ-level and Sub-organ level Information Extraction, Relationship Identification, and Information fusion. Information fusion is based on the seven out of the twenty-four relationships in CST (Cross-document Sentence Theory). We argue that this kind of information fusion system might not only save the researchers the time for going back and forth from different sources but also provides the opportunity to generate new information out of the complementary information in different sources and levels.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131535046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Liu, Xiaohui Yu, Zhongshuai Chen, Bingbing Liu
{"title":"Sentiment analysis of sentences with modalities","authors":"Yang Liu, Xiaohui Yu, Zhongshuai Chen, Bingbing Liu","doi":"10.1145/2513549.2513556","DOIUrl":"https://doi.org/10.1145/2513549.2513556","url":null,"abstract":"This paper is concerned with sentiment analysis of sentences with modality. Modality is a commonly occuring linguistic phenomenon. Due to its special characteristics, the sentiment borne by modality may be hard to determine by existing methods. We first present a linguistic analysis of modality, and then identify some valuable features to train a support vector machine classifier to determine the sentiment orientation of such sentences. We show experimental results on sentences with modality that are extracted from the reviews of four different products to illustrate the effectiveness of the proposed method.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting topic tracking in real-time tweet streams","authors":"Yihong Hong, Yue Fei, Jianwu Yang","doi":"10.1145/2513549.2513555","DOIUrl":"https://doi.org/10.1145/2513549.2513555","url":null,"abstract":"Microblogs such as Twitter have become an increasingly popular source of real-time information.Users tend to keep up-to-date with the developments of topics they are interested in. In this paper, we present an effective real-time tweets filtering system to exploit topic tracking in social media streams. We combine background corpus with foreground corpus to handle the cold start problem. Then we build the Content Model to describe the characteristics of tweets, in which we utilize the link information to expand tweets' content aiming at enriching the semantic information of tweets, and we also analyze the influence of tweet's quality measured by a group of well-defined symbols. Moreover, the Pseudo Relevance Feedback approach triggered by a fixed-width temporal sliding window is employed to adapt our system to the alteration of topics over time. Experimental results on Tweet11 corpus indicate that our system achieves good performance in both T11SU and F-0.5 metrics, and the proposed system has better performance than the best one of TREC2012 real-time filtering pilot task.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121387393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big data opportunities and challenges for IR, text mining and NLP","authors":"Beth Plale","doi":"10.1145/2513549.2514739","DOIUrl":"https://doi.org/10.1145/2513549.2514739","url":null,"abstract":"Big Data poses challenges for text analysis and natural language processing due to its characteristics of volume, veracity, and velocity of the data. The sheer volume in terms of numbers of documents challenges traditional local repository and index systems for large-scale analysis and mining. Computation, storage and data representation must work together to provide rapid access, search, and mining of the deep knowledge in the large text collection. Text under copyright poses additional barriers to computational access, where analysis has to be separated from human consumption of the original text. Data preprocessing, in most cases, remains a daunting task for big textual data particularly data veracity is questionable due to age of original materials. Data velocity is rate of change of the data but can also be the rate at which changes and corrections are made. The HathiTrust Research Center (HTRC) provides new opportunities for IR, NLP and text mining research. HTRC is the research arm of HathiTrust, a consortium that stewards the digital library of content from research libraries around the country. With close to 11 million volumes in HathiTrust collection, HTRC aims to provide large-scale computational access and analytics to these text resources. With the goal of facilitating scholar's work, HTRC establishes a cyberinfrastructure of software, staff, and services to assist researchers and developers more easily process and mine large scale textual data effectively and efficiently. The primary users of HTRC are digital humanities, informatics, and librarians. They are of different research backgrounds and expertise and thus a variety of tools are made available to them. In the HTRC model of computing, computation moves to the data, and services grow up around the corpus to serve the research community. In this manner, the architecture is cloud-based. Moving algorithms to the data is important because the copyrighted content must be protected, however, a side benefit is that the paradigm frees scholars from worrying about managing a large corpus of data. The text analytics currently supported in HTRC is the SEASR suite of analytical algorithms (www.seasr.org). SEASR algorithms, which are written as workflows, include entity extraction, tag cloud, topic modeling, NaiveBayes, Date Entities to Similie Timeline. In this talk, I introduce the collections, architecture, and text analytics of HTRC, with a focus on the challenges of a BigData corpus and what that means for data storage, access, and large-scale computation. HTRC is building a user community to better understand and support researcher needs. It opens many exciting possibilities for the NLP, text mining, IR types of research: with so large an amount of textual data and many candidate algorithms, with support for researcher contributed algorithms, many interesting research questions emerge and many interesting results are to follow.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124053620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops","authors":"Susana Zoghbi, Ivan Vulic, Marie-Francine Moens","doi":"10.1145/2513549.2513557","DOIUrl":"https://doi.org/10.1145/2513549.2513557","url":null,"abstract":"User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured as posted on social media sites and blogs. In this work we evaluate different textual representations and retrieval models that aim to make sense of social media data for retail applications. Our task is to link the text of pins (from Pinterest.com) to online shops (formed by clustering Amazon.com's products). Our results show that document representations that combine latent concepts with single words yield the best performance.","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"366 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132948234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Paper session","authors":"Xiaozhong Liu","doi":"10.1145/3250040","DOIUrl":"https://doi.org/10.1145/3250040","url":null,"abstract":"","PeriodicalId":126426,"journal":{"name":"Proceedings of the 2013 international workshop on Mining unstructured big data using natural language processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122485499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}