16路共享内存多处理器系统上并行视频挖掘应用的工作负载表征

2006 IEEE International Symposium on Workload Characterization Pub Date : 2006-10-01 DOI:10.1109/IISWC.2006.302725

Wenlong Li, E. Li, C. Dulong, Yen-kuang Chen, Tao Wang, Yimin Zhang

{"title":"16路共享内存多处理器系统上并行视频挖掘应用的工作负载表征","authors":"Wenlong Li, E. Li, C. Dulong, Yen-kuang Chen, Tao Wang, Yimin Zhang","doi":"10.1109/IISWC.2006.302725","DOIUrl":null,"url":null,"abstract":"As video data become more and more pervasive, mining information from multimedia data sources becomes increasingly important, e.g., automatically extracting highlights from soccer game video content. However, the huge computation requirement of mining interested data limits its wide use in practice. Since the hardware imperative behind computer architecture is shifting from uniprocessors to multi-core processors, exploiting thread-level parallelism existing in multimedia mining applications is critical to utilizing the hardware resources and accelerating the complex processing of highlight events detection. In this paper we analyze the view type and playfield detection application, a widely used application in sports video mining systems, and we present several different schemes (task level, data-slicing-level, and a hybrid parallel scheme, as well as variations of the hybrid parallel scheme) for parallelizing this application. The hybrid parallel scheme, which exploits data-level and task-slicing-level parallelism, outperforms basic task-level and data-slicing-level schemes, delivering much better performance in terms of execution time and speedup. On a 16-way shared-memory multi-processing system with hardware prefetch enabled, the hybrid scheme achieves a speedup of 10.6x. Detailed performance analysis shows that because of the large working set, the workload often requires data from the off-chip memory. Therefore, the saturated bus bandwidth utilization is the likely cause of bottlenecks for achieving perfect scalability performance. With hardware prefetch enabled, the bus utilization rate on 16-processors system is about 76% for the hybrid scheme, and the projected bus bandwidth requirement for perfect scalability is about 3.1GB/s for 16 processors and 6.2 GB/s for 32 processors. In addition, our experiments also reveal that there are also no obvious scaling limiting factors, e.g., very low synchronization and load imbalance problems even with up to 16 processors","PeriodicalId":222041,"journal":{"name":"2006 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System\",\"authors\":\"Wenlong Li, E. Li, C. Dulong, Yen-kuang Chen, Tao Wang, Yimin Zhang\",\"doi\":\"10.1109/IISWC.2006.302725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As video data become more and more pervasive, mining information from multimedia data sources becomes increasingly important, e.g., automatically extracting highlights from soccer game video content. However, the huge computation requirement of mining interested data limits its wide use in practice. Since the hardware imperative behind computer architecture is shifting from uniprocessors to multi-core processors, exploiting thread-level parallelism existing in multimedia mining applications is critical to utilizing the hardware resources and accelerating the complex processing of highlight events detection. In this paper we analyze the view type and playfield detection application, a widely used application in sports video mining systems, and we present several different schemes (task level, data-slicing-level, and a hybrid parallel scheme, as well as variations of the hybrid parallel scheme) for parallelizing this application. The hybrid parallel scheme, which exploits data-level and task-slicing-level parallelism, outperforms basic task-level and data-slicing-level schemes, delivering much better performance in terms of execution time and speedup. On a 16-way shared-memory multi-processing system with hardware prefetch enabled, the hybrid scheme achieves a speedup of 10.6x. Detailed performance analysis shows that because of the large working set, the workload often requires data from the off-chip memory. Therefore, the saturated bus bandwidth utilization is the likely cause of bottlenecks for achieving perfect scalability performance. With hardware prefetch enabled, the bus utilization rate on 16-processors system is about 76% for the hybrid scheme, and the projected bus bandwidth requirement for perfect scalability is about 3.1GB/s for 16 processors and 6.2 GB/s for 32 processors. In addition, our experiments also reveal that there are also no obvious scaling limiting factors, e.g., very low synchronization and load imbalance problems even with up to 16 processors\",\"PeriodicalId\":222041,\"journal\":{\"name\":\"2006 IEEE International Symposium on Workload Characterization\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 IEEE International Symposium on Workload Characterization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC.2006.302725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 IEEE International Symposium on Workload Characterization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2006.302725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

随着视频数据的日益普及，从多媒体数据源中挖掘信息变得越来越重要，例如从足球比赛视频内容中自动提取亮点。然而，兴趣数据挖掘的巨大计算量限制了其在实践中的广泛应用。由于计算机体系结构背后的硬件要求正在从单处理器向多核处理器转变，利用多媒体挖掘应用程序中存在的线程级并行性对于利用硬件资源和加速高光事件检测的复杂处理至关重要。本文分析了在体育视频挖掘系统中广泛应用的视图类型和运动场检测应用，并提出了几种不同的并行化方案(任务级、数据切片级和混合并行方案，以及混合并行方案的变体)。混合并行方案利用数据级和任务切片级并行性，优于基本任务级和数据切片级方案，在执行时间和加速方面提供更好的性能。在启用了硬件预取的16路共享内存多处理系统上，混合方案可以实现10.6倍的加速提升。详细的性能分析表明，由于工作集很大，工作负载通常需要来自片外内存的数据。因此，饱和的总线带宽利用率很可能成为实现完美可伸缩性性能的瓶颈。在启用硬件预取的情况下，混合方案在16处理器系统上的总线利用率约为76%，实现完美可扩展性的预计总线带宽需求约为16处理器3.1GB/s和32处理器6.2 GB/s。此外，我们的实验还表明，也没有明显的扩展限制因素，例如，即使在多达16个处理器的情况下，也存在非常低的同步和负载不平衡问题

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System

As video data become more and more pervasive, mining information from multimedia data sources becomes increasingly important, e.g., automatically extracting highlights from soccer game video content. However, the huge computation requirement of mining interested data limits its wide use in practice. Since the hardware imperative behind computer architecture is shifting from uniprocessors to multi-core processors, exploiting thread-level parallelism existing in multimedia mining applications is critical to utilizing the hardware resources and accelerating the complex processing of highlight events detection. In this paper we analyze the view type and playfield detection application, a widely used application in sports video mining systems, and we present several different schemes (task level, data-slicing-level, and a hybrid parallel scheme, as well as variations of the hybrid parallel scheme) for parallelizing this application. The hybrid parallel scheme, which exploits data-level and task-slicing-level parallelism, outperforms basic task-level and data-slicing-level schemes, delivering much better performance in terms of execution time and speedup. On a 16-way shared-memory multi-processing system with hardware prefetch enabled, the hybrid scheme achieves a speedup of 10.6x. Detailed performance analysis shows that because of the large working set, the workload often requires data from the off-chip memory. Therefore, the saturated bus bandwidth utilization is the likely cause of bottlenecks for achieving perfect scalability performance. With hardware prefetch enabled, the bus utilization rate on 16-processors system is about 76% for the hybrid scheme, and the projected bus bandwidth requirement for perfect scalability is about 3.1GB/s for 16 processors and 6.2 GB/s for 32 processors. In addition, our experiments also reveal that there are also no obvious scaling limiting factors, e.g., very low synchronization and load imbalance problems even with up to 16 processors

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 IEEE International Symposium on Workload Characterization

自引率

0.00%

发文量