J. Peckham, Andrew Perkins, Tayo Obafemi-Ajayi, Xiuzhen Huang
{"title":"NBT (no-boundary thinking): needed to attend to ethical implications of data and AI","authors":"J. Peckham, Andrew Perkins, Tayo Obafemi-Ajayi, Xiuzhen Huang","doi":"10.1145/3535508.3545595","DOIUrl":"https://doi.org/10.1145/3535508.3545595","url":null,"abstract":"In this era of Big Data and AI, expertise in multiple aspects of data, computing, and the domains of application is needed. This calls for teams of experts with different training and perspectives. Because data analysis can have serious ethical implications, it is important that these teams are well and deeply integrated. No-Boundary Thinking (NBT) teams can provide support for team formation and maintenance, thereby attending to the many dimensions of the ethics of data and analysis. In this NBT workshop session, we discuss the ethical concerns that arise from the use of data and AI, and the implications for team building; and provide and brainstorm suggestions for ethical data enabled science and AI.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114957405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MoDNA","authors":"Weizhi An, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, Junzhou Huang","doi":"10.1145/3535508.3545512","DOIUrl":"https://doi.org/10.1145/3535508.3545512","url":null,"abstract":"Obtaining informative representations of gene expression is crucial in predicting various downstream regulatory-related tasks such as promoter prediction and transcription factor binding sites prediction. Nevertheless, current supervised learning with insufficient labeled genomes limits the generalization capability of training a robust predictive model. Recently researchers model DNA sequences by self-supervised training and transfer the pre-trained genome representations to various downstream tasks. Instead of directly shifting the mask language learning to DNA sequence learning, we incorporate prior knowledge into genome language modeling representations. We propose a novel Motif-oriented DNA (MoDNA) pre-training framework, which is designed self-supervised and can be fine-tuned for different downstream tasks MoDNA effectively learns the semantic level genome representations from enormous unlabelled genome data, and is more computationally efficient than previous methods. We pre-train MoDNA on human genome data and fine-tune it on downstream tasks. Extensive experimental results on promoter prediction and transcription factor binding sites prediction demonstrate the state-of-the-art performance of MoDNA.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122755612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An evolutionary approach to data valuation","authors":"Natalia Khuri, Sapana Bhandari, Esteban Murillo Burford, Nathan P. Whitener, Konghao Zhao","doi":"10.1145/3535508.3545522","DOIUrl":"https://doi.org/10.1145/3535508.3545522","url":null,"abstract":"Data valuation in machine learning comprises computational methods for the estimation of the importance of individual training instances. It has been used to remove noise, uncover biases, and improve the accuracy of trained models. Current data valuation techniques do not scale up for large datasets and do not work for regression tasks, where the objective is to predict a numerical outcome rather than a small number of nominal class labels. In this work, an evolutionary approach for qualitative and quantitative data valuation, is presented. The proposed approach is tested on regression and classification benchmarks, and on several bioinformatics and health informatics datasets. In addition, models trained with most valuable subsets of data are validated on independently acquired tests, demonstrating the generalizability as well as the practical utility of the proposed approach.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128468910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Haplotype-aware variant selection for genome graphs","authors":"Neda Tavakoli, Daniel Gibney, S. Aluru","doi":"10.1145/3535508.3545556","DOIUrl":"https://doi.org/10.1145/3535508.3545556","url":null,"abstract":"Graph-based genome representations have proven to be a powerful tool in genomic analysis due to their ability to encode variations found in multiple haplotypes and capture population genetic diversity. Such graphs also unavoidably contain paths which switch between haplotypes (i.e., recombinant paths) and thus do not fully match any of the constituent haplotypes. The number of such recombinant paths increases combinatorially with path length and cause inefficiencies and false positives when mapping reads. In this paper, we study the problem of finding reduced haplotype-aware genome graphs that incorporate only a selected subset of variants, yet contain paths corresponding to all α-long substrings of the input haplotypes (i.e., non-recombinant paths) with at most δ mismatches. Solving this problem optimally, i.e., minimizing the number of variants selected, is previously known to be NP-hard [14]. Here, we first establish several inapproximability results regarding finding haplotype-aware reduced variation graphs of optimal size. We then present an integer linear programming (ILP) formulation for solving the problem, and experimentally demonstrate this is a computationally feasible approach for real-world problems and provides far superior reduction compared to prior approaches.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125596660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparison of cohort-based identical-by-descent (IBD) segment finding methods for endogamous populations","authors":"Huyen T. Dang, Shi Jie Samuel Tan, Sara Mathieson","doi":"10.1145/3535508.3545104","DOIUrl":"https://doi.org/10.1145/3535508.3545104","url":null,"abstract":"Segments of DNA that are inherited from a common ancestor are referred to as identical-by-descent (IBD). Because these segments are inherited, they not only allow us to study population characteristics and the sharing of rare variants but also understand the hidden familial relationships within populations. Over the past two decades, various IBD finding algorithms have been developed using hidden Markov model (HMM), hashing and extension, and Burrows-Wheeler Transform (BWT) approaches. In this study, we investigate the utility of pedigree information in enhancing the efficacy of IBD finding methods for endogamous populations. With the increasing prevalence of computationally efficient sequencing technology and proper documentation of pedigree structures, we expect complete pedigree information to become readily available for more populations. While IBD segments have been used to reconstruct pedigrees [1], because we now have access to the pedigree, it is a natural question to ask if including pedigree information would substantially improve IBD segment finding for the purpose of studying inheritance. Our contributions center around the proposition of two types of IBD finding algorithms for reducing the number of false positives in the detected IBD segments. Both methods analyze the familial relationships between cohorts of individuals who are initially hypothesized to share IBD segments. Our first algorithm is inspired by a k-nearest neighbors (KNN) algorithm [2] where we perform outlier detection on the cohort of IBD-sharing individuals. The metric for proximity is determined by the kinship coefficient evaluated from the pairwise relationships between individuals from the cohort. Our second algorithm is inspired by the Bonsai algorithm [3] and uses multiple hypothesis tests to evaluate if an individual has much more IBD than is expected by chance. Bonsai IBD detection algorithm first divides the pedigree into multiple cohorts of family members with no shared individuals, proceeds to pick the two cohorts with the most shared IBD, and performs a hypothesis test between individuals in the first cohort against everyone in the second cohort. If the hypothesis test is rejected, we remove the individual from the cohort, recompute the common ancestor, and recurse on the remaining individual and the new cohort. Essentially, we account for recombination rates on top of Bonsai's hypothesis tests computations. Our algorithms are evaluated against simulations of an endogamous Amish population to determine their efficacy in removing false positive IBD segments.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133415110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-omics graph database for data integration and knowledge extraction","authors":"Suyeon Kim, I. Thapa, H. Ali","doi":"10.1145/3535508.3545517","DOIUrl":"https://doi.org/10.1145/3535508.3545517","url":null,"abstract":"Major recent advances in sequencing technologies have created new opportunities for studying the complex microbiome domain. However, microbial communities have many unknown roles and unclear impacts on their host environment. The increased availability of microbial omics data associated with heterogeneous metadata has the potential to revolutionize microbiome research. This study proposes a novel data-integration model and a practical pipeline to explore microbial community functions with the integration of omics data. Three case studies were employed to highlight the advanced abilities and applications of our graph database model. Furthermore, we show that a variety of information can be queried against our model and easily extracted using the proposed analysis pipeline. Our findings suggest that the proposed model is highly queryable and provides a critical analytical platform to extract useful knowledge from multi-omics data. We show that such knowledge extraction can lead to new discoveries, particularly when utilizing all available datasets.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131635590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Hornback, Wenqi Shi, F. Giuste, Yuanda Zhu, A. Carpenter, Coleman Hilton, Vinieth N. Bijanki, Hiram Stahl, G. Gottesman, C. Purnell, H. Iwinski, J. M. Wattenbarger, May D. Wang
{"title":"Development of a generalizable multi-site and multi-modality clinical data cloud infrastructure for pediatric patient care","authors":"Andrew Hornback, Wenqi Shi, F. Giuste, Yuanda Zhu, A. Carpenter, Coleman Hilton, Vinieth N. Bijanki, Hiram Stahl, G. Gottesman, C. Purnell, H. Iwinski, J. M. Wattenbarger, May D. Wang","doi":"10.1145/3535508.3545565","DOIUrl":"https://doi.org/10.1145/3535508.3545565","url":null,"abstract":"World-renowned pediatric patient care in scoliosis, craniofacial, orthopedic, and other life-altering conditions is provided at the international Shriners Children's hospital system. The impact of scoliosis can be extreme with significant curvature of the spine that often progresses during childhood periods of growth and development. Gauging the impact of treatment is vital throughout the diagnostic and treatment process and is achieved using radiographic imaging and patient reported feedback surveys. Surgeons from multiple clinical centers have amassed a wealth of patient data from more than 1,000 scoliosis patients. However, these data are difficult to access due to data heterogeneity and poor interoperability between complex hospital systems. These barriers significantly decrease the value of these data to improve patient care. To solve these challenges, we create a generalizable multi-site and multi-modality cloud infrastructure for managing the clinical data of multiple diseases. First, we establish a standardized and secure research data repository using the Fast Health Interoperability Resources (FHIR) standard to harmonize multi-modal clinical data from different hospital sites. Additionally, we develop a SMART-on-FHIR application with a user-friendly graphical user interface (GUI) to enable non-technical users to access the harmonized clinical data. We demonstrate the generalizability of our solution by expanding it to also facilitate craniofacial microsomia and pediatric bone disease imaging research. Ultimately, we present a generalized framework for multi-site, multimodal data harmonization, which can efficiently organize and store data for clinical research to improve pediatric patient care.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130663449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Sequence analysis","authors":"Ziqi Ke","doi":"10.1145/3552470","DOIUrl":"https://doi.org/10.1145/3552470","url":null,"abstract":"","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Vizza, Mattia Cannistrà, R. Giancotti, P. Veltri
{"title":"Image processing segmentation algorithms evaluation through implementation choices","authors":"P. Vizza, Mattia Cannistrà, R. Giancotti, P. Veltri","doi":"10.1145/3535508.3545593","DOIUrl":"https://doi.org/10.1145/3535508.3545593","url":null,"abstract":"The processing of medical images is gaining an important role to allow an increasingly accurate diagnosis, essential for chronic diseases identification and treatment. We focus on image processing techniques, such as segmentation ones, and we report implementation experiences and tests in different programming languages. Results regard the use and implementation of K-means algorithm to analyze T1-weighted MRI images regarding 233 subjects. Dataset refers to on line available one containing images referred to three different brain tumors (meningioma, glioma and pituitary tumor). We report the results of implementing the K-means algorithm by using two different programming languages, Java and Octave, measuring different performances.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134516698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance portability study of epistasis detection using SYCL on NVIDIA GPU","authors":"Zheming Jin, J. Vetter","doi":"10.1145/3535508.3545591","DOIUrl":"https://doi.org/10.1145/3535508.3545591","url":null,"abstract":"We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between CUDA and SYCL. Evaluating the CUDA and SYCL applications on an NVIDIA V100 GPU, we find that the optimization of loop unrolling needs to be applied manually to the SYCL kernel for obtaining comparable performance. The performance of the SYCL group reduce function, an alternative to the CUDA warp-based reduction, depends on the problem and work group sizes. The 64-bit popcount operation implemented with tree of adders is slightly faster than the built-in popcount operation. When the number of OpenMP threads is four, the highest performance of the SYCL and CUDA applications are comparable.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116588102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}