{"title":"流式排序网络","authors":"M. Zuluaga, Peter Milder, Markus Püschel","doi":"10.1145/2854150","DOIUrl":null,"url":null,"abstract":"Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exist for both software and hardware implementations. For the latter, there is a trade-off between the throughput achieved and the cost (i.e., the logic and storage invested to sort <i>n</i> elements). Two popular solutions are bitonic sorting networks with <i>O</i>(<i>n</i>log <sup>2</sup><i>n</i>) logic and storage, which sort <i>n</i> elements per cycle, and linear sorters with <i>O</i>(<i>n</i>) logic and storage, which sort <i>n</i> elements per <i>n</i> cycles. In this article, we present new hardware structures that we call <i>streaming sorting networks</i>, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With the new networks, we achieve novel and improved cost-performance trade-offs. For example, assuming that <i>n</i> is a two-power and <i>w</i> is any divisor of <i>n</i>, one class of these networks can sort in <i>n</i>/;<i>w</i> cycles with <i>O</i>(<i>w</i>log <sup>2</sup><i>n</i>) logic and <i>O</i>(<i>n</i>log <sup>2</sup><i>n</i>) storage; the other class that we present sorts in <i>n</i>log <sup>2</sup><i>n</i>/;<i>w</i> cycles with <i>O</i>(<i>w</i>) logic and <i>O</i>(<i>n</i>) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. The accompanying hardware generator allows us to explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance trade-offs compared to prior work.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"22 1","pages":"55:1-55:30"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Streaming Sorting Networks\",\"authors\":\"M. Zuluaga, Peter Milder, Markus Püschel\",\"doi\":\"10.1145/2854150\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exist for both software and hardware implementations. For the latter, there is a trade-off between the throughput achieved and the cost (i.e., the logic and storage invested to sort <i>n</i> elements). Two popular solutions are bitonic sorting networks with <i>O</i>(<i>n</i>log <sup>2</sup><i>n</i>) logic and storage, which sort <i>n</i> elements per cycle, and linear sorters with <i>O</i>(<i>n</i>) logic and storage, which sort <i>n</i> elements per <i>n</i> cycles. In this article, we present new hardware structures that we call <i>streaming sorting networks</i>, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With the new networks, we achieve novel and improved cost-performance trade-offs. For example, assuming that <i>n</i> is a two-power and <i>w</i> is any divisor of <i>n</i>, one class of these networks can sort in <i>n</i>/;<i>w</i> cycles with <i>O</i>(<i>w</i>log <sup>2</sup><i>n</i>) logic and <i>O</i>(<i>n</i>log <sup>2</sup><i>n</i>) storage; the other class that we present sorts in <i>n</i>log <sup>2</sup><i>n</i>/;<i>w</i> cycles with <i>O</i>(<i>w</i>) logic and <i>O</i>(<i>n</i>) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. The accompanying hardware generator allows us to explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance trade-offs compared to prior work.\",\"PeriodicalId\":7063,\"journal\":{\"name\":\"ACM Trans. Design Autom. Electr. Syst.\",\"volume\":\"22 1\",\"pages\":\"55:1-55:30\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Trans. Design Autom. Electr. Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2854150\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Design Autom. Electr. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2854150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Sorting is a fundamental problem in computer science and has been studied extensively. Thus, a large variety of sorting methods exist for both software and hardware implementations. For the latter, there is a trade-off between the throughput achieved and the cost (i.e., the logic and storage invested to sort n elements). Two popular solutions are bitonic sorting networks with O(nlog 2n) logic and storage, which sort n elements per cycle, and linear sorters with O(n) logic and storage, which sort n elements per n cycles. In this article, we present new hardware structures that we call streaming sorting networks, which we derive through a mathematical formalism that we introduce, and an accompanying domain-specific hardware generator that translates our formal mathematical description into synthesizable RTL Verilog. With the new networks, we achieve novel and improved cost-performance trade-offs. For example, assuming that n is a two-power and w is any divisor of n, one class of these networks can sort in n/;w cycles with O(wlog 2n) logic and O(nlog 2n) storage; the other class that we present sorts in nlog 2n/;w cycles with O(w) logic and O(n) storage. We carefully analyze the performance of these networks and their cost at three levels of abstraction: (1) asymptotically, (2) exactly in terms of the number of basic elements needed, and (3) in terms of the resources required by the actual circuit when mapped to a field-programmable gate array. The accompanying hardware generator allows us to explore the entire design space, identify the Pareto-optimal solutions, and show superior cost-performance trade-offs compared to prior work.