M. Giles, E. László, I. Reguly, J. Appleyard, Julien Demouth
{"title":"GPU Implementation of Finite Difference Solvers","authors":"M. Giles, E. László, I. Reguly, J. Appleyard, Julien Demouth","doi":"10.1109/WHPCF.2014.10","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.10","url":null,"abstract":"This paper discusses the implementation of one-factor and three-factor PDE models on GPUs. Both explicit and implicit time-marching methods are considered, with the latter requiring the solution of multiple tridiagonal systems of equations.Because of the small amount of data involved, one-factor models are primarily compute-limited, with a very good fraction of the peak compute capability being achieved. The key to the performance lies in the heavy use of registers and shuffle instructions for the explicit method, and a non-standard hybrid Thomas/PCR algorithm for solving the tridiagonal systems for the implicit solverThe three-factor problems involve much more data, and hence their execution is more evenly balanced between computation and data communication to/from the main graphics memory. However, it is again possible to achieve a good fraction of the theoretical peak performance on both measures. The high performance requires particularly careful attention to coalescence in the data transfers, using local shared memory for small array transpositions, and padding to avoid shared memory bank conicts.Computational results include comparisons to computations on Sandy Bridge and Haswell Intel Xeon processors, using both multithreading and AVX vectorisation.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116950819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Gillan, Dimitrios S. Nikolopoulos, G. Georgakoudis, R. Faloon, George Tzenakis, I. Spence
{"title":"On the Viability of Microservers for Financial Analytics","authors":"C. Gillan, Dimitrios S. Nikolopoulos, G. Georgakoudis, R. Faloon, George Tzenakis, I. Spence","doi":"10.1109/WHPCF.2014.11","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.11","url":null,"abstract":"Energy consumption and total cost of ownership are daunting challenges for Datacenters, because they scale disproportionately with performance. Datacenters running financial analytics may incur extremely high operational costs in order to meet performance and latency requirements of their hosted applications. Recently, ARM-based microservers have emerged as a viable alternative to high-end servers, promising scalable performance via scale-out approaches and low energy consumption.In this paper, we investigate the viability of ARM-based microservers for option pricing, using the Monte Carlo and Binomial Tree kernels. We compare an ARM-based microserver against a state-of-the-art x86 server. We define application-related but platform-independent energy and performance metrics to compare those platforms fairly in the context of datacenters for financial analytics and give insight on the particular requirements of option pricing. Our experiments show that through scaling out energy-efficient compute nodes within a 2U rack-mounted unit, an ARM-based microserver consumes as little as about 60% of the energy per option pricing compared to an x86 server, despite having significantly slower cores. We also find that the ARM microserver scales enough to meet a high fraction of market throughput demand, while consuming up to 30% less energy than an Intel server.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116908631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speeding up Large-Scale Financial Recomputation with Memoization","authors":"Alexander Moreno, T. Balch","doi":"10.1109/WHPCF.2014.9","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.9","url":null,"abstract":"Quantitative financial analysis requires repeated computations of the same functions with the same arguments when prototyping trading strategies; many of these functions involve resource intensive operations on large matrices. Reducing the number of repeated computations either within a program or across runs of the same program would allow analysts to build and debug trading strategies more quickly. We built a disk memoization library that caches function computations to files to avoid recomputation. Anymemoization solution should be easy to use, minimizing the need for users to think about whether caching is appropriate, while at the same time giving them control over speed, accuracy, and space used for caching. Guo and Engler proposed a similar tool that does automatic memoization by modifying the python interpreter, while the packages Jug and Joblib are distributed computing tools that have memoization options. Our library attempts to maintain the ease of use of the above packages for memoization, but at the same time give a higher degree of control of how caching is done for users who need it. We provide the same basic features as these other libraries, but allow control of how hashing is done, space usage for individual functions and all memoization, refreshing memoization for a specific function, and accuracy checking. This should lead to both increased productivity and speed increases for recomputation. We show that for several financial calculations, including Markowitz Optimization, Fama French, and the Singular Value Decomposition, memoization greatly speeds up recomputation, often by over 99%. We also show that by using xxhash, a non-cryptographic hash function, instead of md5, and avoiding equality checks, our package greatly outperforms joblib, the best current package.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131057315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Many-core Programming with Asian Option Pricing","authors":"S. Li, James Lin","doi":"10.1109/WHPCF.2014.7","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.7","url":null,"abstract":"In this paper, we discuss the problem of pricing one exotic option, the strong path dependent Asian option using the Black-Scholes model and compare how the pricing algorithm can map into different many-core architectures and achieve equally impressive performance gains. In the end, we will show that a 2-year contract with 252 times steps and 1,000,000 samples can be priced in approximately one fifth of a second on two leading many-core architectures. The purpose of this paper is to understand what is required to power the numerical-intensive algorithms in quantitative finance and how to extract and express parallelism inherent in many other similar algorithms in quantitative Finance.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124930424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Brugger, Gongda Liu, C. D. Schryver, N. Wehn
{"title":"A Systematic Methodology for Analyzing Closed-Form Heston Pricer Regarding Their Accuracy and Runtime","authors":"Christian Brugger, Gongda Liu, C. D. Schryver, N. Wehn","doi":"10.1109/WHPCF.2014.13","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.13","url":null,"abstract":"Calibration methods are the heart of modeling any financial process. While for the Heston model (semi) closed-form solutions exist for calibrating to simple products, their evaluation involves complex functions and infinite integrals. So far these integrals can only be solved with time-consuming numerical methods. For that reason, calibration consumes a large portion of available compute power in the daily finance business and it is worth checking for the most optimal available methods with respect to runtime and accuracy.However, over the years more and more theoretical and practical subtleties have been revealed and today a large number of approaches are available, including dierent formulations of closed-formulas and various integration algorithms like quadrature or Fourier methods. Currently there is no clear indication which pricing method should be used for a specific calibration purpose with additional speed and accuracy constraints. With this publication we are closing this gap. We derive a novel methodology to systematically find the best methods for a well-defined accuracy target among a huge set of available methods. For a practical setup we study the available popular closed-form solutions and integration algorithms from literature. In total we compare 14 pricing methods, including adaptive quadrature and Fourier methods. For a target accuracy of 10-3 we show that static Gauss-Legendre are best on CPUs for the unrestricted parameter set. Further we show that for restricted Carr-Madan formulation the methods are 3.6x faster. We also show that Fourier methods are even better when pricing at least 10 options with the same maturity but dierent strikes.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132245943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"STAC-A2 on Intel Architecture: From Scalar Code to Heterogeneous Application","authors":"Evgeny Fiksman, S. Salahuddin","doi":"10.1109/WHPCF.2014.6","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.6","url":null,"abstract":"STAC-A2™ is compute and memory intensive industry benchmark in the field of market risk analysis. The benchmark specifications were created by the Securities Technology Analysis Center (aka STAC®) and are based on inputs collected from the leading trading companies, universities, and high performance computing vendors. The specifications describe the models which represent realistic market risk analysis workloads. In this paper we discuss the development steps that lead to competitive performance of the STAC-A2 benchmark executed on systems consisting of Intel® Xeon® processor(s) and an Intel® Xeon Phi™ coprocessor. We show the importance of utilization of all parallel resources available on Intel architectures to achieve maximum performance. We demonstrate that the offload extension supported by Intel® Composer XE minimizes the efforts required to create accelerated applications by using only C/C++ language. With Intel's latest implementation of the STAC-A2 benchmark we were able to achieve a significant (800%) performance gain by using a heterogeneous approach running on two Intel Xeon E5-2699 v3 processors and a single Intel® Xeon Phi™ 7120A card, compared to earlier version running on only two Intel Xeon E5-2697 v2 processors. This implementation outperforms Nvidia's implementation based on an Intel Xeon processor based server with two NVIDIA* K20Xm cards.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134367925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Portable and Fast Stochastic Volatility Model Calibration Using Multi and Many-Core Processors","authors":"M. Dixon, Jörg Lotze, M. Zubair","doi":"10.1109/WHPCF.2014.12","DOIUrl":"https://doi.org/10.1109/WHPCF.2014.12","url":null,"abstract":"Financial markets change precipitously and on-demand pricing and risk models must be constantly recalibrated to reduce risk. However, certain classes of models are computationally intensive to robustly calibrate to intraday pricesstochastic volatility models being an archetypal example due to the non-convexity of the objective function. In order to accelerate this procedure through parallel implementation,nancial application developers are faced with an ever growing plethora of low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics, and forced to make a trade-off between performance versus the portability,exibility and modularity of the code required to facilitate rapid in-house model development and productionization.This paper describes the acceleration of stochastic volatility model calibration on multi-core CPUs and GPUs using the Xcelerit platform. By adopting a simple dataow programming model, the Xcelerit platform enables the application developer to write sequential, high-level C++ code, without concern for low-level high-performance computing frameworks. This platform provides the portability,exibility and modularity required by application developers. Speedups of up to 30x and 293x are respectively achieved on an Intel Xeon CPU and NVIDIA Tesla K40 GPU, compared to a sequential CPU implementation. The Xcelerit platform implementation is further shown to be equivalent in performance to a low-level CUDA version. Overall, we are able to reduce the entire calibration process time of the sequential implementation from 6; 189 seconds to 183:8 and 17:8 seconds on the CPU and GPU respectively without requiring the developer to reimplement in low-level high performance computing frameworks.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128104411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Irregular Time Series through Non-Uniform Fast Fourier Transform","authors":"Jung Heon Song, M. L. Prado, H. Simon, Kesheng Wu","doi":"10.2139/ssrn.2487656","DOIUrl":"https://doi.org/10.2139/ssrn.2487656","url":null,"abstract":"Most popular analysis tools on time series require the data to be taken at uniform time intervals. However, the realworld time series, such as those fromnancial markets, are typically taken at irregular time intervals. It is a common practice to resample or bin the irregular time series into a regular one, but there are significant limitations on this practice. For example, if one is to resample the trading activities of a stock into hourly series, then the time series can only last through the trading day, because there usually is no trading in the night. In this work, we explore the dynamics of irregular time series through a high-performance computing algorithm known as Non-Uniform Fast Fourier Transform (NUFFT).To illustrate its effectiveness, we apply NUFFT on the trading records of natural gas futures contracts for the last seven years. Tests show that NUFFT results accurately capture well-known structural features in the trading records, such as weekly and daily cycles. At the same time the results also reveal unexplored features, such as the presence of multiple power laws. In particular, we observe an emerging power law in the Fourier spectra in recent years. We also detect a strong Fourier component at the precise frequency once per minute, which implies significant automated trading activities might be triggered by clock.","PeriodicalId":368134,"journal":{"name":"2014 Seventh Workshop on High Performance Computational Finance","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124737876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}