{"title":"Why Your Experimental Results Might Be Wrong","authors":"F. Schuhknecht, Justus Henneberg","doi":"10.1145/3592980.3595317","DOIUrl":null,"url":null,"abstract":"Research projects in the database community are often evaluated based on experimental results. A typical evaluation setup looks as follows: Multiple methods to compare with each other are embedded in a single shared benchmarking codebase. In this codebase, all methods execute an identical workload to collect the individual execution times. This seems reasonable: Since the only difference between individual test runs are the methods themselves, any observed time difference can be attributed to these methods. Also, such a benchmarking codebase can be used for gradual optimization: If one method runs slowly, its code can be optimized and re-evaluated. If its performance improves, this improvement can be attributed to the particular optimization. Unfortunately, we had to learn the hard way that it is not that simple. The reason for this lies in a component that sits right between our benchmarking codebase and the produced experimental results — the compiler. As we will see in the following case study, this black-box component has the power to completely ruin any meaningful comparison between methods, even if we setup our experiments as equal and fair as possible.","PeriodicalId":400127,"journal":{"name":"Proceedings of the 19th International Workshop on Data Management on New Hardware","volume":"111 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Workshop on Data Management on New Hardware","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3592980.3595317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Research projects in the database community are often evaluated based on experimental results. A typical evaluation setup looks as follows: Multiple methods to compare with each other are embedded in a single shared benchmarking codebase. In this codebase, all methods execute an identical workload to collect the individual execution times. This seems reasonable: Since the only difference between individual test runs are the methods themselves, any observed time difference can be attributed to these methods. Also, such a benchmarking codebase can be used for gradual optimization: If one method runs slowly, its code can be optimized and re-evaluated. If its performance improves, this improvement can be attributed to the particular optimization. Unfortunately, we had to learn the hard way that it is not that simple. The reason for this lies in a component that sits right between our benchmarking codebase and the produced experimental results — the compiler. As we will see in the following case study, this black-box component has the power to completely ruin any meaningful comparison between methods, even if we setup our experiments as equal and fair as possible.