Proceedings of the 48th International Conference on Parallel Processing最新文献

筛选
英文 中文
Modeling the Performance of Atomic Primitives on Modern Architectures 现代体系结构中原子原语的性能建模
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337901
F. Hoseini, A. Atalar, P. Tsigas
{"title":"Modeling the Performance of Atomic Primitives on Modern Architectures","authors":"F. Hoseini, A. Atalar, P. Tsigas","doi":"10.1145/3337821.3337901","DOIUrl":"https://doi.org/10.1145/3337821.3337901","url":null,"abstract":"Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127282688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks 基于sdn的数据中心网络中自适应路由重构以最小化流量成本
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337861
Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen
{"title":"Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks","authors":"Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen","doi":"10.1145/3337821.3337861","DOIUrl":"https://doi.org/10.1145/3337821.3337861","url":null,"abstract":"Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era 在微服务时代释放功率受限数据中心的可扩展性潜力
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337857
Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo
{"title":"Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era","authors":"Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo","doi":"10.1145/3337821.3337857","DOIUrl":"https://doi.org/10.1145/3337821.3337857","url":null,"abstract":"Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122689414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Cartesian Collective Communication 笛卡尔集体交流
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337848
J. Träff, S. Hunold
{"title":"Cartesian Collective Communication","authors":"J. Träff, S. Hunold","doi":"10.1145/3337821.3337848","DOIUrl":"https://doi.org/10.1145/3337821.3337848","url":null,"abstract":"We introduce Cartesian Collective Communication as sparse, collective communication defined on processes (processors) organized into d-dimensional tori or meshes. Processes specify local neighborhoods, e.g., stencil patterns, by lists of relative Cartesian coordinate offsets. The Cartesian collective operations perform data exchanges (and reductions) over the set of all neighborhoods such that each process communicates with the processes in its local neighborhood. The key requirement is that local neighborhoods must be structurally identical (isomorphic). This makes it possible for processes to compute correct, deadlock-free, efficient communication schedules for the collective operations locally without any interaction with other processes. Cartesian Collective Communication substantially extends collective neighborhood communication on Cartesian communicators as defined by the MPI standard, and is a restricted form of neighborhood collective communication on general, distributed graph topologies. We show that the restriction to isomorphic neighborhoods permits communication improvements beyond what is possible for unrestricted graph topologies by presenting non-trivial message-combining algorithms that reduce communication latency for Cartesian alltoall and allgather collective operations. For both types of communication, the required communication schedules can be computed in linear time in the size of the input neighborhood. Our benchmarks show that we can, for small data block sizes, substantially outperform the general MPI neighborhood collectives implementing the same communication pattern. We discuss different possibilities for supporting Cartesian Collective Communication in MPI. Our library is implemented on top of MPI and uses the same signatures for the collective communication operations as the MPI (neighborhood) collectives. Our implementation requires essentially only one single, new communicator creation function, but even this might not be needed for implementation in an MPI library.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Speculative Scheduling for Stochastic HPC Applications 随机高性能计算应用的推测调度
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337890
Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan
{"title":"Speculative Scheduling for Stochastic HPC Applications","authors":"Ana Gainaru, Guillaume Pallez, Hongyang Sun, P. Raghavan","doi":"10.1145/3337821.3337890","DOIUrl":"https://doi.org/10.1145/3337821.3337890","url":null,"abstract":"New emerging fields are developing a growing number of large-scale applications with heterogeneous, dynamic and data-intensive requirements that put a high emphasis on productivity and thus are not tuned to run efficiently on today's high performance computing (HPC) systems. Some of these applications, such as neuroscience workloads and those that use adaptive numerical algorithms, develop modeling and simulation workflows with stochastic execution times and unpredictable resource requirements. When they are deployed on current HPC systems using existing resource management solutions, it can result in loss of efficiency for the users and decrease in effective system utilization for the platform providers. In this paper, we consider the current HPC scheduling model and describe the challenge it poses for stochastic applications due to the strict requirement in its job deployment policies. To address the challenge, we present speculative scheduling techniques that adapt the resource requirements of a stochastic application on-the-fly, based on its past execution behavior instead of relying on estimates given by the user. We focus on improving the overall system utilization and application response time without disrupting the current HPC scheduling model or the application development process. Our solution can operate alongside existing HPC batch schedulers without interfering with their usage modes. We show that speculative scheduling can improve the system utilization and average application response time by 25-30% compared to the classical HPC approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123811126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Faster parallel collision detection at high resolution for CNC milling applications 更快的并行碰撞检测在高分辨率的数控铣削应用
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337838
Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc
{"title":"Faster parallel collision detection at high resolution for CNC milling applications","authors":"Xin Chen, Dmytro Konobrytskyi, Thomas M. Tucker, T. Kurfess, R. Vuduc","doi":"10.1145/3337821.3337838","DOIUrl":"https://doi.org/10.1145/3337821.3337838","url":null,"abstract":"This paper presents a new and more work-efficient parallel method to speed up a class of three-dimensional collision detection (CD) problems, which arise, for instance, in computer numerical control (CNC) milling. Given two objects, one enclosed by a bounding volume and the other represented by a voxel model, we wish to determine all possible orientations of the bounded object around a given point that do not cause collisions. Underlying most CD methods are 3 types of geometrical operations that are bottlenecks: decompositions, rotations, and projections. Our proposed approach, which we call the aggressive inaccessible cone angle (AICA) method, simplifies these operations and, empirically, can prune as much as 99% of the intersection tests that would otherwise be required and improve load balance. We validate our techniques by implementing a parallel version of AICA in SculptPrint, a state-of-the-art computer-aided manufacturing (CAM) application used CNC milling, for GPU platforms. Experimental results using 4 CAM benchmarks show that AICA can be over 23× faster than a baseline method that does not prune projections, and can check collisions for 4096 angle orientations in an object represented by 27 million voxels in less than 18 milliseconds on a GPU.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114659781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DeepHash DeepHash
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337924
Yuanning Gao, Xiaofeng Gao, Guihai Chen
{"title":"DeepHash","authors":"Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337924","DOIUrl":"https://doi.org/10.1145/3337821.3337924","url":null,"abstract":"In distributed file systems, distributed metadata management can be considered as a mapping problem, i.e., how to effectively map the metadata namespace tree to multiple metadata servers (MDS's). In general, all traditional distributed metadata management schemes simply presume a rigid mapping function, thus failing to adaptively meet the requirements of different applications. To better take advantage of the current distribution of the metadata, in this exploratory paper, we present the first machine learning based model called DeepHash, which leverages the deep neural network to learn a locality preserving hashing (LPH) mapping. To help learn a good position relationship of metadata nodes in the namespace tree, we first present a metadata representation strategy. Due to the absence of training labels, i.e., the hash values of metadata nodes, we design two kinds of loss functions with distinctive characters to train DeepHash respectively, including a pair loss and a triplet loss, and introduce some sampling strategies for these two approaches. We conduct extensive experiments on Amazon EC2 platform to compare the performance of DeepHash with traditional and state-of-the-art schemes. The results demonstrate that DeepHash can preserve the metadata locality well while maintaining a high load balancing, which denotes the effectiveness and efficiency of DeepHash.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114721992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SAFE: Service Availability via Failure Elimination Through VNF Scaling SAFE:通过VNF扩展消除故障的服务可用性
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337832
Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen
{"title":"SAFE: Service Availability via Failure Elimination Through VNF Scaling","authors":"Rui Xia, Haipeng Dai, Jiaqi Zheng, Rong Gu, Xiaoyu Wang, Guihai Chen","doi":"10.1145/3337821.3337832","DOIUrl":"https://doi.org/10.1145/3337821.3337832","url":null,"abstract":"Virtualized network functions (VNFs) enable software applications to replace traditional middleboxes, which is more flexible and scalable in the network service provision. This paper focuses on ensuring Service Availability via Failure Elimination (SAFE) using VNF scaling, that is, given the resource requirements of VNF instances, finding an optimal and robust instance consolidation strategy, which can recover from one instance failure quickly. To address the above problem, we present a framework based on rounding and dynamic programming. First, we discretize the range of resource requirements for VNF instances deployment into several sub-ranges, so that the number of instance types becomes a constant. Second, we further reduce the number of instance types by gathering several small instances into a bigger one. Third, we propose an algorithm built on dynamic programming to solve the instance consolidation problem with a limited number of instance types. We set up a testbed to profile the functional relationship between resource and throughput for different types of VNF instances, and conduct simulations to validate our theoretical results according to profiling results. The simulation results show that our algorithm outperforms the standby deployment model by 27.33% on average in terms of the number of servers required. Furthermore, SAFE has marginal overhead, around 7.22%, compared to instance consolidation strategy without VNF backup consideration.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"75 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116470400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication 一种提升ssd重复数据删除读性能的读均衡数据分发方案
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337884
Mengting Lu, F. Wang, D. Feng, Yuchong Hu
{"title":"A Read-leveling Data Distribution Scheme for Promoting Read Performance in SSDs with Deduplication","authors":"Mengting Lu, F. Wang, D. Feng, Yuchong Hu","doi":"10.1145/3337821.3337884","DOIUrl":"https://doi.org/10.1145/3337821.3337884","url":null,"abstract":"Deduplication, as a space-saving technology, is widely deployed in the flash-based storage systems to address the capacity and endurance limitations of flash devices. In this paper, we find that deduplication changes the physical data layout, which raises the chances of the uneven read distribution. This uneven read distribution not only increases the access contention but also deteriorates the read parallelism, thus leading to the read performance degradation. To solve this issue, we propose an efficient read-leveling data distribution scheme (RLDDS), which scatters the highly-duplicated data into different parallel units, to improve the read performance for SSDs with deduplication for access-intensive workloads. RLDDS writes data into a parallel unit with lower potential read-hotness to balance the read distribution among all the parallel units. Extensive experimental results show that RLDDS effectively improves the read performance by up to 21.61% compared to deduplication with the conventional dynamic data allocation scheme. Additional benefits of RLDDS include the promoted write performance (up to 23.69%) in access-intensive workloads and the overall system performance improvement (up to 18.22%) with the same write traffic reduction.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115916533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
FuncyTuner FuncyTuner
Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337842
Tao Wang, Nikhil Jain, D. Beckingsale, David Boehme, F. Mueller, T. Gamblin
{"title":"FuncyTuner","authors":"Tao Wang, Nikhil Jain, D. Beckingsale, David Boehme, F. Mueller, T. Gamblin","doi":"10.1145/3337821.3337842","DOIUrl":"https://doi.org/10.1145/3337821.3337842","url":null,"abstract":"The de facto compilation model for production software compiles all modules of a target program with a single set of compilation flags, typically 02 or 03. Such a per-program compilation strategy may yield sub-optimal executables since programs often have multiple hot loops with diverse code structures and may be better optimized with a per-region compilation model that assembles an optimized executable by combining the best per-region code variants. In this paper, we demonstrate that a naïve greedy approach to per-region compilation often degrades performance in comparison to the 03 baseline. To overcome this problem, we contribute a novel per-loop compilation framework, FuncyTuner, which employs lightweight profiling to collect per-loop timing information, and then utilizes a space-focusing technique to construct a performant executable. Experimental results show that FuncyTuner can reliably improve performance of modern scientific applications on several multi-core architectures by 9.2% to 12.3% and 4.5% to 10.7%(geometric mean, up to 22% on certain program) in comparison to the 03 baseline and prior work, respectively.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信