基于可靠性感知和图的生物数据等级聚合方法

2019 15th International Conference on eScience (eScience) Pub Date : 2019-09-01 DOI:10.1109/eScience.2019.00022

Pierre Andrieu, Bryan Brancotte, L. Bulteau, Sarah Cohen-Boulakia, A. Denise, A. Pierrot, Stéphane Vialette

{"title":"基于可靠性感知和图的生物数据等级聚合方法","authors":"Pierre Andrieu, Bryan Brancotte, L. Bulteau, Sarah Cohen-Boulakia, A. Denise, A. Pierrot, Stéphane Vialette","doi":"10.1109/eScience.2019.00022","DOIUrl":null,"url":null,"abstract":"Massive biological datasets are available in public databases and can be queried using portals with keyword queries. Ranked lists of answers are obtained by users. However, properly querying such portals remains difficult since various formulations of the same query can be considered (e.g., using synonyms). Consequently, users have to manually combine several lists of hundreds of answers into one list. Rank aggregation techniques are particularly well-fitted to this context as they take in a set of ranked elements (rankings) and provide a consensus, that is, a single ranking which is the \"closest\" to the input rankings. However, the problem of rank aggregation is NP-hard in most cases. Using an exact algorithm is currently not possible for more than a few dozens of elements. A plethora of heuristics have thus been proposed which behaviour are, by essence, difficult to anticipate: given a set of input rankings, one cannot guarantee how far from an exact solution the consensus ranking provided by an heuristic will be. The two challenges we want to tackle in this paper are the following: (i) providing an approach based on a pre-process to decompose large data sets into smaller ones where high-quality algorithms can be run and (ii) providing information to users on the robustness of the positions of elements in the consensus ranking produced. Our approach not only lies in mathematical bases, offering guarantees on the result computed but it has also been implemented in a real system available to life science community and tested on various real use cases.","PeriodicalId":142614,"journal":{"name":"2019 15th International Conference on eScience (eScience)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Reliability-Aware and Graph-Based Approach for Rank Aggregation of Biological Data\",\"authors\":\"Pierre Andrieu, Bryan Brancotte, L. Bulteau, Sarah Cohen-Boulakia, A. Denise, A. Pierrot, Stéphane Vialette\",\"doi\":\"10.1109/eScience.2019.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Massive biological datasets are available in public databases and can be queried using portals with keyword queries. Ranked lists of answers are obtained by users. However, properly querying such portals remains difficult since various formulations of the same query can be considered (e.g., using synonyms). Consequently, users have to manually combine several lists of hundreds of answers into one list. Rank aggregation techniques are particularly well-fitted to this context as they take in a set of ranked elements (rankings) and provide a consensus, that is, a single ranking which is the \\\"closest\\\" to the input rankings. However, the problem of rank aggregation is NP-hard in most cases. Using an exact algorithm is currently not possible for more than a few dozens of elements. A plethora of heuristics have thus been proposed which behaviour are, by essence, difficult to anticipate: given a set of input rankings, one cannot guarantee how far from an exact solution the consensus ranking provided by an heuristic will be. The two challenges we want to tackle in this paper are the following: (i) providing an approach based on a pre-process to decompose large data sets into smaller ones where high-quality algorithms can be run and (ii) providing information to users on the robustness of the positions of elements in the consensus ranking produced. Our approach not only lies in mathematical bases, offering guarantees on the result computed but it has also been implemented in a real system available to life science community and tested on various real use cases.\",\"PeriodicalId\":142614,\"journal\":{\"name\":\"2019 15th International Conference on eScience (eScience)\",\"volume\":\"94 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 15th International Conference on eScience (eScience)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/eScience.2019.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 15th International Conference on eScience (eScience)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

大量的生物数据集在公共数据库中可用，并且可以使用带有关键字查询的门户进行查询。答案的排名列表由用户获得。然而，正确查询这样的门户仍然很困难，因为可以考虑相同查询的各种公式(例如，使用同义词)。因此，用户必须手动将数百个答案的多个列表合并为一个列表。排名聚合技术特别适合这种情况，因为它们采用一组排名元素(排名)并提供共识，即“最接近”输入排名的单一排名。然而，在大多数情况下，排序聚合问题是np困难的。对于超过几十个元素，目前不可能使用精确的算法。因此，人们提出了大量的启发式方法，从本质上讲，这些行为是难以预测的:给定一组输入排名，人们无法保证启发式方法提供的共识排名离精确解决方案有多远。我们在本文中想要解决的两个挑战是:(i)提供一种基于预处理的方法，将大数据集分解成可以运行高质量算法的小数据集;(ii)向用户提供关于所产生的共识排名中元素位置的鲁棒性的信息。我们的方法不仅基于数学基础，为计算结果提供保证，而且还在生命科学界可用的实际系统中实现，并在各种实际用例中进行了测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reliability-Aware and Graph-Based Approach for Rank Aggregation of Biological Data

Massive biological datasets are available in public databases and can be queried using portals with keyword queries. Ranked lists of answers are obtained by users. However, properly querying such portals remains difficult since various formulations of the same query can be considered (e.g., using synonyms). Consequently, users have to manually combine several lists of hundreds of answers into one list. Rank aggregation techniques are particularly well-fitted to this context as they take in a set of ranked elements (rankings) and provide a consensus, that is, a single ranking which is the "closest" to the input rankings. However, the problem of rank aggregation is NP-hard in most cases. Using an exact algorithm is currently not possible for more than a few dozens of elements. A plethora of heuristics have thus been proposed which behaviour are, by essence, difficult to anticipate: given a set of input rankings, one cannot guarantee how far from an exact solution the consensus ranking provided by an heuristic will be. The two challenges we want to tackle in this paper are the following: (i) providing an approach based on a pre-process to decompose large data sets into smaller ones where high-quality algorithms can be run and (ii) providing information to users on the robustness of the positions of elements in the consensus ranking produced. Our approach not only lies in mathematical bases, offering guarantees on the result computed but it has also been implemented in a real system available to life science community and tested on various real use cases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 15th International Conference on eScience (eScience)

自引率

0.00%

发文量