J. A. Rodman, Yuewei Lin, D. Sprouster, L. Ecker, Shinjae Yoo
{"title":"Automated X-ray diffraction of irradiated materials","authors":"J. A. Rodman, Yuewei Lin, D. Sprouster, L. Ecker, Shinjae Yoo","doi":"10.1109/NYSDS.2017.8085053","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085053","url":null,"abstract":"Synchrotron-based X-ray diffraction (XRD) and small-angle Xray scattering (SAXS) characterization techniques used on unirradiated and irradiated reactor pressure vessel steels yield large amounts of data. Machine learning techniques, including PCA, offer a novel method of analyzing and visualizing these large data sets in order to determine the effects of chemistry and irradiation conditions on the formation of radiation induced precipitates. In order to run analysis on these data sets, preprocessing must be carried out to convert the data to a usable format and mask the 2-D detector images to account for experimental variations. Once the data has been preprocessed, it can be organized and visualized using principal component analysis (PCA), multi-dimensional scaling, and k-means clustering. From these techniques, it is shown that sample chemistry has a notable effect on the formation of the radiation induced precipitates in reactor pressure vessel steels.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115339702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Cisek, M. Mahajan, J. Dale, Susan Pepper, Yuewei Lin, Shinjae Yoo
{"title":"A transfer learning approach to parking lot classification in aerial imagery","authors":"Daniel Cisek, M. Mahajan, J. Dale, Susan Pepper, Yuewei Lin, Shinjae Yoo","doi":"10.1109/NYSDS.2017.8085049","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085049","url":null,"abstract":"The importance of satellite imagery analysis has increased dramatically over the last several years, keeping pace with the rapid improvements seen in both remote sensing platforms and sensors. As this field expands, so too does the interest in using machine learning methods to automate parts of the imagery analyst’s workflow. In this paper we address one aspect of this challenge: the development of a method for the automatic extraction of parking lots from aerial imagery. To the best of our knowledge, there has been no prior work conducted on the development of an end-to-end pipeline for this particular task. Due to the limited size of our dataset and to accommodate the potentially limited size of future datasets, we propose a deep learning approach using transfer learning. This process hinges upon the use of state of the art Convolutional Neural Networks (CNNs), trained on general image classification datasets. These networks were then fine-tuned on our custom dataset, to establish a comprehensive benchmark for this task. Our method exhibits promising results for automatic parking lot extraction, and is generalizable enough to work with different input types, including high resolution aerial orthoimagery, satellite imagery, full motion video (FMV), and UAV imagery.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121540055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Stephan, B. Raju, T. Elsethagen, Line C. Pouchard, Carlos Gamboa
{"title":"A scientific data provenance harvester for distributed applications","authors":"E. Stephan, B. Raju, T. Elsethagen, Line C. Pouchard, Carlos Gamboa","doi":"10.1109/NYSDS.2017.8085041","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085041","url":null,"abstract":"Data provenance provides a way for scientists to observe how experimental data originates, conveys process history, and explains influential factors such as experimental rationale and associated environmental factors from system metrics measured at runtime. The US Department of Energy Office of Science Integrated end-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows (IPPD) project has developed a provenance harvester that is capable of collecting observations from file based evidence typically produced by distributed applications. To achieve this, file based evidence is extracted and transformed into an intermediate data format inspired in part by W3C CSV on the Web recommendations, called the Harvester Provenance Application Interface (HAPI) syntax. This syntax provides a general means to pre-stage provenance into messages that are both human readable and capable of being written to a provenance store, Provenance Environment (ProvEn). HAPI is being applied to harvest provenance from climate ensemble runs for Accelerated Climate Modeling for Energy (ACME) project funded under the U.S. Department of Energy’s Office of Biological and Environmental Research (BER) Earth System Modeling (ESM) program. ACME informally provides provenance in a native form through configuration files, directory structures, and log files that contain success/failure indicators, code traces, and performance measurements. Because of its generic format, HAPI is also being applied to harvest tabular job management provenance from Belle II DIRAC scheduler relational database tables as well as other scientific applications that log provenance related information.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124738853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical data reduction for streaming data","authors":"Kesheng Wu, Dongeun Lee, A. Sim, Jaesik Choi","doi":"10.1109/NYSDS.2017.8085035","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085035","url":null,"abstract":"Bulk of the streaming data from scientific simulations and experiments consists of numerical values, and these values often change in unpredictable ways over a short time horizon. Such data values are known to be hard to compress, however, much of the random fluctuation is not essential to the scientific application and could therefore be removed without adverse impact. We have developed a compression technique based on statistical similarity that could reduce the storage requirement by over 100-fold while preserve prominent features in the data stream. We achieve these impressive compression ratios because most data blocks have similar probability distribution and could be reproduced from a small block. The core concept behind this work is the exchangeability in statistics. To create a practical compression algorithm, we choose to work with fixed size blocks and use Kolmogorov-Smirnov test to measure similarity. The resulting technique could be regarded as a dictionary-based compression scheme. In this paper, we describe the method and explore its effectiveness on two sets of application data. We pay particular attention to the Fourier components of the reconstructed data and show that in addition to preserving unique features in data it is also faithfully preserving the Fourier components whose periods extend more than a few blocks.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124790411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Wang, E. Papenhausen, B. Wang, S. Ha, A. Zelenyuk, K. Mueller
{"title":"Progressive clustering of big data with GPU acceleration and visualization","authors":"Jun Wang, E. Papenhausen, B. Wang, S. Ha, A. Zelenyuk, K. Mueller","doi":"10.1109/NYSDS.2017.8085036","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085036","url":null,"abstract":"Clustering has become an unavoidable step in big data analysis. It may be used to arrange data into a compact format, making operations on big data manageable. However, clustering of big data requires not only the capability of handling data with large volume and high dimensionality, but also the ability to process streaming data, all of which are less developed in most current algorithms. Furthermore, big data processing is seldom interactive, which stands at conflict with users who seek answers immediately. The best one can do is to process incrementally, such that partial and, hopefully, accurate results can be available relatively quickly and are then progressively refined over time. We propose a clustering framework which uses Multi-Dimensional Scaling for layout and GPU acceleration to accomplish these goals. Our domain application is the clustering of mass spectral data of individual aerosol particles with 8 million data points of 450 dimensions each.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127826564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative study of deep learning framework in HPC environments","authors":"HamidReza Asaadi, B. Chapman","doi":"10.1109/NYSDS.2017.8085040","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085040","url":null,"abstract":"The rise of machine learning and deep learning applications in recent years has resulted in the development of several specialized frameworks to design neural networks, train them and use them in production. The efforts toward scaling and tuning of such frameworks have coincided with the increasing popularity of heterogeneous architectures (e.g. GPUs and accelerators); and developers found that the iterative and highly concurrent nature of machine learning algorithms is a good fit for the offerings of such architectures. As a result, most machine learning and deep learning frameworks now support offloading features and job distribution among heterogeneous processing units. Despite increasing use of deep learning techniques in scientific computing, HPC architectures has not been a first-class requirement for framework designers and is missing in many cases. We have taken a first step toward understanding the behavior of deep learning frameworks in HPC environments by comparing the performance of such frameworks on a regular HPC cluster setup and their compatibility with cluster architecture. We also studied the support for HPC-specific features provided by each of the frameworks. In order to accomplish this, a set of tests to compare deep learning frameworks has been introduced as well. In addition to the performance results, we observed some design conflicts between these frameworks and the traditional HPC tool chain. Launching deep learning framework jobs using common HPC job schedulers is not straightforward. Also, limited HPC-specific hardware support by these frameworks results in scalability issues and high communication overhead when running in multi-node environments. We discuss the idea of adding native support for executing deep learning frameworks to HPC job schedulers as an example of such adjustments in more details.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"10 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114448951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Line C. Pouchard, A. Malik, H. V. Dam, C. Xie, W. Xu, K. K. van Dam
{"title":"Capturing provenance as a diagnostic tool for workflow performance evaluation and optimization","authors":"Line C. Pouchard, A. Malik, H. V. Dam, C. Xie, W. Xu, K. K. van Dam","doi":"10.1109/NYSDS.2017.8085043","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085043","url":null,"abstract":"In extreme-scale computing environments such as the DOE Leadership Computing Facilities scientific workflows are routinely used to coordinate software processes for the execution of complex, computational applications that perform in-silico experiments. Monitoring the performance of workflows without also simultaneously tracking provenance is not sufficient to understand variations between runs, configurations, versions of a code, and between changes in an implemented stack, and systems, i.e. the variability of performance metrics data in their historical context. We take a provenance-based approach and demonstrate that provenance is useful as a tool for evaluating and optimizing workflow performance in extreme- scale HPC environments. We present Chimbuko, a framework for the analysis and visualization of the provenance of performance. Chimbuko implements a method for the evaluation of workflow performance from multiple components that enables the exploration of performance metrics data at scale.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126509412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Allen Liu, Bryant Liu, Daniel Lee, M. Weissman, J. Posner, Jiook Cha, Shinjae Yoo
{"title":"Machine learning aided prediction of family history of depression","authors":"Allen Liu, Bryant Liu, Daniel Lee, M. Weissman, J. Posner, Jiook Cha, Shinjae Yoo","doi":"10.1109/NYSDS.2017.8085046","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085046","url":null,"abstract":"Increased risk for psychopathology in the offspring of depressed parents has been widely known. The brain may mediate the effects of risk of familial depression on the offspring via shared genetic and environmental factors. Conventional brain imaging studies to test this mediation effects primarily use a priori knowledge to select a subset of brain imaging-derived features. Despite of the existing positive results supporting the notion of the brain as an endophenotype for familial depression, no quantitative assessment has been performed regarding to what extents the complex brain structure contains information about familial depression. To this end, here we aim to predict whether an individual has a history of familial depression. We propose a data-driven, unbiased, and rigorous machine learning approach using multimodal brain features (e.g., grey matter morphometry based on T1-weighted images and structural connectome based on probabilistic diffusion tractography) to capture the complex representations of brain structure. We implemented logistic regression (LR) with regularization, support vector machine (SVR), and graph convolutional neural network (GCN). Our models show promising cross-validated classification accuracy: 97.78% (LR), 93.67% (SVM) and 89.58% (GCN). Brain features with greatest weights in the models include brain regions previously implicated in the depression literature (e.g., emotion regulation frontal-limbic circuit) as well as new regions. Results suggest a large impact of familial depression on the brain structure and connectome. Results also highlight potentials for prediction of risk for psychopathology in a data-driven fashion using costeffective, simple, ubiquitous brain images.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127724594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visualization of Higgs potentials and decays from sources beyond the standard model including dark matter and extra dimensions","authors":"R. Miceli, M. McGuigan","doi":"10.1109/NYSDS.2017.8085051","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085051","url":null,"abstract":"Because the Higgs particle interacts with so many different particles, the potential associated with it takes contributions from many different sectors. This makes it very difficult to calculate, even when dealing with a restricted number of components. Another concern is creating useful visualizations of these potentials, as visual inspection is one of the main ways that physicists can gain new insights about them. Our main project involved plotting various Higgs potentials from new physics beyond the Standard Mod-el, in ways that would illustrate their dependence on various parameters such as temperature, energy scale and coupling strength. We also exported these potentials as 3D models with the aim of displaying them in virtual reality. We will calculate and plot new potentials including contributions from sources like dark matter and its interactions which could be visible through astrophysics or LHC experiments. We plotted new types of potentials associated with extra dimensions, dynamical symmetry breaking, and hidden gauge sectors involving undiscovered fermions. Another part of the project concerned the setup and con-figuration of the Visualization Center in the Computational Science Initiative at BNL. The room is equipped with a graphics computer with dual GPUs powering 6 wall mounted televisions and two virtual reality headsets. The televisions are configured to work as a single large unit intended to display large animations and data visualizations. This setup should make it easier for scientists to interact with and draw meaning from data, such as the high energy physics models that we studied.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124935859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keyword extraction for document clustering using submodular optimization","authors":"Xi Zhang, K. Mueller, Shinjae Yoo","doi":"10.1109/NYSDS.2017.8085056","DOIUrl":"https://doi.org/10.1109/NYSDS.2017.8085056","url":null,"abstract":"With the rapid growth of information services, enormous amount of text corpus cannot simply be read and understand. Therefore, text clustering and visualization present a direct way to observe the documents as well as understand the topic by corresponding keywords. However, even a short paragraph contains a variety of words, which makes the keyword or topic extraction difficult to achieve. Therefore, we propose an algorithm to extract keywords efficiently and effectively, which makes use of the latent semantic indexing and submodular optimization. The visual layout allows users to simultaneously visualize (1) the overview of the whole dataset, (2) the detailed information in the specific scope of the collection of documents, and (3) the relationships of documents with their keywords.","PeriodicalId":380859,"journal":{"name":"2017 New York Scientific Data Summit (NYSDS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117043452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}