{"title":"云中的异常检测:挑战与实践","authors":"Kejiang Ye","doi":"10.1145/3129457.3129497","DOIUrl":null,"url":null,"abstract":"Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfere with each other and affect application performance. The complexity of these cloud computing systems, the complexity of application interaction structure and the inherent sharing pattern of cloud platforms make cloud systems more prone to performance anomalies than traditional platforms. It can be said that anomaly is a normal state in cloud computing [3]. For further analysis, resource competition, resource bottlenecks, misconfiguration, software defects, hardware failures, and external attacks can cause cloud system anomalies or failures. Performance anomaly refers to any sudden degradation of performance that deviates from the normal behavior of the system. Unlike outages that cause the system to stop running immediately, performance anomalies typically result in a decrease in system efficiency. The reasons such as misconfiguration, software defects, hardware failures, can often cause performance anomalies. For cloud computing systems, it is not enough to detect outages or other functional anomalies, because those anomalies often cause service interruption and can be resolved by simply restarting or replacing hardware. While performance anomalies caused by resource sharing and interference are more worthy of attention [4], because the performance anomalies can be eliminated before service interruption to ensure continued services. If the performance anomalies of cloud computing system are not timely handled, it may cause very serious consequences, which not only affect the business system to run normally, but also hinder the enterprise to deploy their services on cloud systems. Especially for the those latency-sensitive cloud applications, it is extremely important to eliminate performance anomalies in a timely manner. For example, Amazon found a 1% decline in sales per 100ms latency, Google found a 20% drop in traffic for every 0.5s latency in search page, and stock traders found that it would cause a loss of 400 Million dollars if their electronic trading platform lagged behind the competitors by 5 ms. Other research also shows that the average maximum time of cloud data center failure is about 100 hours, which seriously affects the experience of cloud service users. In the cloud environment, as a large number of business systems are deployed in the cloud data center, cloud data center failure will affect a large number of users, such as the previously mentioned Amazon S3 failure, resulting in serious economic losses. Thus, timely and accurate detection of the cloud computing anomalies is very important. Anomaly detection is an effective means to help cloud platform administrators monitor and analyze cloud behaviors and improve cloud reliability. It helps to identify unusual behavior of the system so that cloud platform administrators can take proactive operations before a system crash or service failure. However, due to the characteristics such as large-scale, complex and resource sharing, it is very difficult to accurately detect anomalies in cloud computing. If the anomalies can not be accurately detected, the further recovery will be out of the question. Due to the importance of the problem, current mainstream cloud computing service providers usually provide online monitoring services. Amazon developed CloudWatch [1] for its EC2 users to monitor virtual machine instance status, resource usage, network traffic, disk read and write status, etc. Google developed Dapper framework [6] to provide state monitoring for Google search engines. But those monitoring services only provide simple data presentation, lack of in-depth analysis of monitoring data (such as cross-level correlation analysis), and is not intelligent enough for anomaly reasoning (such as cross-node fault source localization). In cloud data center, as the size of detected objects is very large and interrelated, the object being detected itself is in a high dynamic environment, it is very challenging to detect anomalies in an accurate, real-time and adaptive way. The existing anomaly detection solution lacks effective discovery and reasoning of the anomaly, which leads to the inability to locate and eliminate the anomaly in time. This is also the main reason that causes the current cloud platform accident frequently. In this talk, we introduce our solution for anomaly detection in clouds. 1) Anomaly detection. In order to efficiently detect the potential anomalies, we perform large-scale offline performance testing and also create an online detection method. i) Offline testing. The purpose is to find the key performance bottleneck and quantify comparison between difference hardware and software. We first propose a three-layer benchmarking methodology to fully evaluates cloud performance and then present a new benchmark suite -- Virt-B [11] -- that measures various scenarios, such as single machine virtualization, server consolidation, VM mapping, VM live migration, HPC virtual clusters and Hadoop virtual cluster. Finally, we introduce a performance testing toolkit to automate the benchmarking process. ii) Online detection. The purpose is to monitor applications in real time and quickly detect potential faults. We propose a quantile regression based online anomaly detection method and did a case study on 67 real Yahoo! anomaly traffic datasets. 2) Anomaly inference. We propose a dependency graph based anomaly inference method. Dependency reflects interaction relationship and execution path, and can be used for fault localization. There are usually three methods that can be used to fetch the dependency graph: instrumentation, extract configuration files and analyze network traffic. We create light-weight agents to monitor the traffic and use sampling technique to reduce the overheads. The advantages of our solution include: supports VMs (Xen/KVM), accuracy guarantee based on probability theory, dynamic dependency construction, focuses on the PM/VM layer, and low overheads. 3) Anomaly recovery. The traditional recovery methods like Checkpoint/Restart has high overheads and are not suitable for latency-sensitive applications. While we propose two solutions: cache-aware fault VM isolation and migration-based fault recovery. i) Cash-Aware Fault Isolation [8]. We first give a quantitative definition of isolation, then we propose a VM-core scheduling method to improve the fault isolation. ii) Fault Recovery based on Migration [9, 10, 12]. We propose a fault recovery method based on live migration. The main advantages include: online service, low overheads, and can be used in large-scale cloud datacenters.","PeriodicalId":345943,"journal":{"name":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Anomaly Detection in Clouds: Challenges and Practice\",\"authors\":\"Kejiang Ye\",\"doi\":\"10.1145/3129457.3129497\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfere with each other and affect application performance. The complexity of these cloud computing systems, the complexity of application interaction structure and the inherent sharing pattern of cloud platforms make cloud systems more prone to performance anomalies than traditional platforms. It can be said that anomaly is a normal state in cloud computing [3]. For further analysis, resource competition, resource bottlenecks, misconfiguration, software defects, hardware failures, and external attacks can cause cloud system anomalies or failures. Performance anomaly refers to any sudden degradation of performance that deviates from the normal behavior of the system. Unlike outages that cause the system to stop running immediately, performance anomalies typically result in a decrease in system efficiency. The reasons such as misconfiguration, software defects, hardware failures, can often cause performance anomalies. For cloud computing systems, it is not enough to detect outages or other functional anomalies, because those anomalies often cause service interruption and can be resolved by simply restarting or replacing hardware. While performance anomalies caused by resource sharing and interference are more worthy of attention [4], because the performance anomalies can be eliminated before service interruption to ensure continued services. If the performance anomalies of cloud computing system are not timely handled, it may cause very serious consequences, which not only affect the business system to run normally, but also hinder the enterprise to deploy their services on cloud systems. Especially for the those latency-sensitive cloud applications, it is extremely important to eliminate performance anomalies in a timely manner. For example, Amazon found a 1% decline in sales per 100ms latency, Google found a 20% drop in traffic for every 0.5s latency in search page, and stock traders found that it would cause a loss of 400 Million dollars if their electronic trading platform lagged behind the competitors by 5 ms. Other research also shows that the average maximum time of cloud data center failure is about 100 hours, which seriously affects the experience of cloud service users. In the cloud environment, as a large number of business systems are deployed in the cloud data center, cloud data center failure will affect a large number of users, such as the previously mentioned Amazon S3 failure, resulting in serious economic losses. Thus, timely and accurate detection of the cloud computing anomalies is very important. Anomaly detection is an effective means to help cloud platform administrators monitor and analyze cloud behaviors and improve cloud reliability. It helps to identify unusual behavior of the system so that cloud platform administrators can take proactive operations before a system crash or service failure. However, due to the characteristics such as large-scale, complex and resource sharing, it is very difficult to accurately detect anomalies in cloud computing. If the anomalies can not be accurately detected, the further recovery will be out of the question. Due to the importance of the problem, current mainstream cloud computing service providers usually provide online monitoring services. Amazon developed CloudWatch [1] for its EC2 users to monitor virtual machine instance status, resource usage, network traffic, disk read and write status, etc. Google developed Dapper framework [6] to provide state monitoring for Google search engines. But those monitoring services only provide simple data presentation, lack of in-depth analysis of monitoring data (such as cross-level correlation analysis), and is not intelligent enough for anomaly reasoning (such as cross-node fault source localization). In cloud data center, as the size of detected objects is very large and interrelated, the object being detected itself is in a high dynamic environment, it is very challenging to detect anomalies in an accurate, real-time and adaptive way. The existing anomaly detection solution lacks effective discovery and reasoning of the anomaly, which leads to the inability to locate and eliminate the anomaly in time. This is also the main reason that causes the current cloud platform accident frequently. In this talk, we introduce our solution for anomaly detection in clouds. 1) Anomaly detection. In order to efficiently detect the potential anomalies, we perform large-scale offline performance testing and also create an online detection method. i) Offline testing. The purpose is to find the key performance bottleneck and quantify comparison between difference hardware and software. We first propose a three-layer benchmarking methodology to fully evaluates cloud performance and then present a new benchmark suite -- Virt-B [11] -- that measures various scenarios, such as single machine virtualization, server consolidation, VM mapping, VM live migration, HPC virtual clusters and Hadoop virtual cluster. Finally, we introduce a performance testing toolkit to automate the benchmarking process. ii) Online detection. The purpose is to monitor applications in real time and quickly detect potential faults. We propose a quantile regression based online anomaly detection method and did a case study on 67 real Yahoo! anomaly traffic datasets. 2) Anomaly inference. We propose a dependency graph based anomaly inference method. Dependency reflects interaction relationship and execution path, and can be used for fault localization. There are usually three methods that can be used to fetch the dependency graph: instrumentation, extract configuration files and analyze network traffic. We create light-weight agents to monitor the traffic and use sampling technique to reduce the overheads. The advantages of our solution include: supports VMs (Xen/KVM), accuracy guarantee based on probability theory, dynamic dependency construction, focuses on the PM/VM layer, and low overheads. 3) Anomaly recovery. The traditional recovery methods like Checkpoint/Restart has high overheads and are not suitable for latency-sensitive applications. While we propose two solutions: cache-aware fault VM isolation and migration-based fault recovery. i) Cash-Aware Fault Isolation [8]. We first give a quantitative definition of isolation, then we propose a VM-core scheduling method to improve the fault isolation. ii) Fault Recovery based on Migration [9, 10, 12]. We propose a fault recovery method based on live migration. The main advantages include: online service, low overheads, and can be used in large-scale cloud datacenters.\",\"PeriodicalId\":345943,\"journal\":{\"name\":\"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3129457.3129497\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the first Workshop on Emerging Technologies for software-defined and reconfigurable hardware-accelerated Cloud Datacenters","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3129457.3129497","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
摘要
云计算是许多企业的重要基础设施。经过10年的发展,云计算取得了巨大的成功,极大地改变了经济、社会、科学和工业。特别是随着移动互联网和大数据技术的快速发展,几乎所有的在线服务和数据服务都是建立在云计算之上的,如银行提供的网上银行服务、新闻媒体提供的电子服务、政府部门提供的政府云信息系统、通信公司提供的移动服务等。此外,数以万计的初创企业依赖云计算服务。因此,确保云的可靠性是非常重要和必要的。然而,现实情况是,目前的云系统不够可靠。2017年2月28日,众多公司使用的热门存储和托管平台Amazon Web Services在北弗吉尼亚(US-EAST-1)地区发生了S3服务中断4小时的事件,随后迅速蔓延到其他依赖S3服务的在线服务提供商[2]。这次失败造成了巨大的经济损失。这是因为云计算服务提供商通常与客户设置服务水平协议(SLA)。例如,当客户要求99.99%的可用性时,这意味着99.99%的时间必须满足每年365天的要求。如果服务中断超过0.01%,则需要赔偿。事实上,随着云计算的不断发展和成熟,大量的传统业务系统已经部署在云平台上。云计算通过虚拟化技术整合现有硬件资源,形成共享的资源池,应用可以按需获取计算、存储和网络资源,有效提升传统IT基础设施的可扩展性和资源利用率,显著降低传统业务系统的运营成本。然而,随着运行在云上的应用越来越多,云数据中心的规模也在不断扩大,当前的云计算系统已经变得非常复杂,主要体现在:1)规模庞大。一个典型的数据中心涉及超过10万台服务器和1万台交换机,节点越多通常意味着故障的可能性越大;2)应用结构复杂。网络搜索、电子商务等典型的云程序具有复杂的交互行为。例如,一个Amazon页面请求涉及到与数百个组件的交互[7],任何一个组件的错误都会导致整个应用程序异常;3)资源共享模式。云计算的基本特征之一是资源共享,在Google cloud数据中心,一台典型的服务器同时托管5到18个应用程序,每台服务器运行约10.69个应用程序[5]。资源竞争会相互干扰,影响应用程序的性能。这些云计算系统的复杂性、应用交互结构的复杂性以及云平台固有的共享模式使得云系统比传统平台更容易出现性能异常。可以说,在云计算中,异常是一种常态[3]。进一步分析,资源竞争、资源瓶颈、配置错误、软件缺陷、硬件故障、外部攻击等都可能导致云系统异常或故障。性能异常是指性能突然下降,偏离系统的正常行为。与导致系统立即停止运行的中断不同,性能异常通常会导致系统效率下降。配置错误、软件缺陷、硬件故障等原因往往会导致性能异常。对于云计算系统,仅检测中断或其他功能异常是不够的,因为这些异常通常会导致服务中断,并且可以通过简单地重新启动或更换硬件来解决。而由资源共享和干扰引起的性能异常更值得关注[4],因为性能异常可以在业务中断前消除,保证业务的持续进行。如果对云计算系统的性能异常不及时处理,可能会造成非常严重的后果,不仅影响业务系统的正常运行,也会阻碍企业在云系统上部署业务。特别是对于那些对延迟敏感的云应用程序,及时消除性能异常非常重要。例如,亚马逊发现,每100毫秒的延迟,销售额下降1%,谷歌发现,每0毫秒的延迟,流量下降20%。
Anomaly Detection in Clouds: Challenges and Practice
Cloud computing is an important infrastructure for many enterprises. After 10 years of development, cloud computing has achieved a great success, and has greatly changed the economy, society, science and industries. In particular, with the rapid development of mobile Internet and big data technology, almost all of the online services and data services are built on the top of cloud computing, such as the online banking services provided by banks, the electronic services provided by the news media, the government cloud information systems provided by the government departments, the mobile services provided by the communications companies. Besides, tens of thousands of Start-ups rely on the provision of cloud computing services. Therefore, ensuring cloud reliability is very important and essential. However, the reality is that the current cloud systems are not reliable enough. On February 28th 2017, Amazon Web Services, the popular storage and hosting platform used by a huge range of companies, experienced S3 service interruption for 4 hours in the Northern Virginia (US-EAST-1) Region, and then quickly spread other online service providers who rely on the S3 service [2]. This failure caused a huge economic loss. It is because cloud computing service providers typically set a Service Level Agreement (SLA) with customers. For example, when customers require 99.99% availability, it means that 99.99% of the time must meet the requirement for 365 days per year. If the service breaks more than 0.01%, compensation is required. In fact, with the continuous development and maturity of cloud computing, a large number of traditional business systems have been deployed on the cloud platform. Cloud computing integrates existing hardware resources through virtualization technology to create a shared resource pool that enables applications to obtain computing, storage, and network resources on demand, effectively enhancing the scalability and resource utilization of traditional IT infrastructures and significantly reducing the operation cost of the traditional business systems. However, with the growing number of applications running on the cloud, the scale of cloud data center has been expanding, the current cloud computing system has become very complex, mainly reflected in: 1) Large scale. A typical data center involves more than 100,000 servers and 10,000 switches, more nodes usually mean higher probability of failure; 2) Complex application structure. Web search, e-commerce and other typical cloud program has a complex interactive behavior. For example, an Amazon page request involves interaction with hundreds of components [7], error in any one component will lead to the whole application anomalies; 3) Shared resource pattern. One of the basic features of cloud computing is resource sharing, a typical server in Google Cloud data center hosts 5 to 18 applications simultaneously, each server runs about 10.69 applications [5]. Resource competition will interfere with each other and affect application performance. The complexity of these cloud computing systems, the complexity of application interaction structure and the inherent sharing pattern of cloud platforms make cloud systems more prone to performance anomalies than traditional platforms. It can be said that anomaly is a normal state in cloud computing [3]. For further analysis, resource competition, resource bottlenecks, misconfiguration, software defects, hardware failures, and external attacks can cause cloud system anomalies or failures. Performance anomaly refers to any sudden degradation of performance that deviates from the normal behavior of the system. Unlike outages that cause the system to stop running immediately, performance anomalies typically result in a decrease in system efficiency. The reasons such as misconfiguration, software defects, hardware failures, can often cause performance anomalies. For cloud computing systems, it is not enough to detect outages or other functional anomalies, because those anomalies often cause service interruption and can be resolved by simply restarting or replacing hardware. While performance anomalies caused by resource sharing and interference are more worthy of attention [4], because the performance anomalies can be eliminated before service interruption to ensure continued services. If the performance anomalies of cloud computing system are not timely handled, it may cause very serious consequences, which not only affect the business system to run normally, but also hinder the enterprise to deploy their services on cloud systems. Especially for the those latency-sensitive cloud applications, it is extremely important to eliminate performance anomalies in a timely manner. For example, Amazon found a 1% decline in sales per 100ms latency, Google found a 20% drop in traffic for every 0.5s latency in search page, and stock traders found that it would cause a loss of 400 Million dollars if their electronic trading platform lagged behind the competitors by 5 ms. Other research also shows that the average maximum time of cloud data center failure is about 100 hours, which seriously affects the experience of cloud service users. In the cloud environment, as a large number of business systems are deployed in the cloud data center, cloud data center failure will affect a large number of users, such as the previously mentioned Amazon S3 failure, resulting in serious economic losses. Thus, timely and accurate detection of the cloud computing anomalies is very important. Anomaly detection is an effective means to help cloud platform administrators monitor and analyze cloud behaviors and improve cloud reliability. It helps to identify unusual behavior of the system so that cloud platform administrators can take proactive operations before a system crash or service failure. However, due to the characteristics such as large-scale, complex and resource sharing, it is very difficult to accurately detect anomalies in cloud computing. If the anomalies can not be accurately detected, the further recovery will be out of the question. Due to the importance of the problem, current mainstream cloud computing service providers usually provide online monitoring services. Amazon developed CloudWatch [1] for its EC2 users to monitor virtual machine instance status, resource usage, network traffic, disk read and write status, etc. Google developed Dapper framework [6] to provide state monitoring for Google search engines. But those monitoring services only provide simple data presentation, lack of in-depth analysis of monitoring data (such as cross-level correlation analysis), and is not intelligent enough for anomaly reasoning (such as cross-node fault source localization). In cloud data center, as the size of detected objects is very large and interrelated, the object being detected itself is in a high dynamic environment, it is very challenging to detect anomalies in an accurate, real-time and adaptive way. The existing anomaly detection solution lacks effective discovery and reasoning of the anomaly, which leads to the inability to locate and eliminate the anomaly in time. This is also the main reason that causes the current cloud platform accident frequently. In this talk, we introduce our solution for anomaly detection in clouds. 1) Anomaly detection. In order to efficiently detect the potential anomalies, we perform large-scale offline performance testing and also create an online detection method. i) Offline testing. The purpose is to find the key performance bottleneck and quantify comparison between difference hardware and software. We first propose a three-layer benchmarking methodology to fully evaluates cloud performance and then present a new benchmark suite -- Virt-B [11] -- that measures various scenarios, such as single machine virtualization, server consolidation, VM mapping, VM live migration, HPC virtual clusters and Hadoop virtual cluster. Finally, we introduce a performance testing toolkit to automate the benchmarking process. ii) Online detection. The purpose is to monitor applications in real time and quickly detect potential faults. We propose a quantile regression based online anomaly detection method and did a case study on 67 real Yahoo! anomaly traffic datasets. 2) Anomaly inference. We propose a dependency graph based anomaly inference method. Dependency reflects interaction relationship and execution path, and can be used for fault localization. There are usually three methods that can be used to fetch the dependency graph: instrumentation, extract configuration files and analyze network traffic. We create light-weight agents to monitor the traffic and use sampling technique to reduce the overheads. The advantages of our solution include: supports VMs (Xen/KVM), accuracy guarantee based on probability theory, dynamic dependency construction, focuses on the PM/VM layer, and low overheads. 3) Anomaly recovery. The traditional recovery methods like Checkpoint/Restart has high overheads and are not suitable for latency-sensitive applications. While we propose two solutions: cache-aware fault VM isolation and migration-based fault recovery. i) Cash-Aware Fault Isolation [8]. We first give a quantitative definition of isolation, then we propose a VM-core scheduling method to improve the fault isolation. ii) Fault Recovery based on Migration [9, 10, 12]. We propose a fault recovery method based on live migration. The main advantages include: online service, low overheads, and can be used in large-scale cloud datacenters.