This article introduces the challenges and practices of ** performance ** monitoring and analysis of ** Alibaba ** large data centers.

This blog is a translation from the English version. You can check the original from here. We use some machine translation. We would appreciate it if you could point out any translation errors. *

Singles' Day, Shopping Festival

Alibaba's signature data crunch event is the 11.11 Global Shopping Festival, which celebrates China's Bachelor's Day every year with a platform-wide sale.

The figure below uses the data as of 2017. To help you see the size of the event, the graph in the upper left shows sales figures: about US $ 25.3 billion, which is higher than the combined sales of Thanksgiving, Black Friday and Cyber Monday in the same year. ..

Technical experts place more emphasis on the graph on the right. In 2017, Alibaba's cloud platform enabled a peak volume of 325,000 transactions and 256,000 payments per second.

What does such high peak performance mean for Alibaba? That means cost! The reason for sticking to performance is that we aim to reduce costs while continuously improving performance through constant technological innovation.

Alibaba infrastructure

Behind the peak performance figures above is the massive software and hardware infrastructure that underpins it.

As shown in the figure above, Alibaba's infrastructure includes a variety of top-level applications such as e-commerce apps Taobao and Tmall, distribution network Cainiao, business messaging app DingTalk, payment platform Alipay, and Alibaba Cloud. Exists. This diversity of business scenarios is characteristic of Alibaba's economy.

At the bottom is a large data center connected to millions of machines.

The middle tier is called "Enable Platform" in Alibaba. The closest to the upper layer applications are databases, storage, middleware and computing platforms. In between are resource scheduling, cluster management, and containers. At the bottom is system software such as operating system, JVM, and virtualization.

Performance analysis

Benchmark testing is useful because you need a simple and reproducible way to measure performance. However, benchmark testing has its limits. This is because each test depends on its own defined operating environment as well as its hardware and software settings. These settings can have a significant impact on performance. Another issue is whether the hardware and software configurations actually meet the business requirements and thus provide typical results.

Can the effects observed in benchmark tests extend to real-world complex production environments?

Alibaba's data centers have tens of thousands of business applications and millions of servers distributed around the world. Therefore, when considering software and hardware upgrades in a data center, it is important that the effects observed in small benchmark tests actually apply to the complex production environment in the data center. It becomes.

For example, different Java applications may have different JVM capabilities affecting different performance, and different hardware may observe different results for the same application. Such cases are not uncommon. It is not possible to run tests on every application or hardware, so a systematic approach is needed to estimate the overall performance impact of new features on different applications and hardware.

In the data center, it is important to assess the overall performance impact of software and hardware upgrades. Take our 11.11 Global Shopping Festival as an example. We monitor our sales and trading peak business indicators very carefully. But what if the business indicators double? Does that mean we have to double the number of all machines? In order to come up with the answer, we need an evaluation method that shows the impact of technology on the business while evaluating the technology. We've proposed many innovations and found many opportunities to optimize performance, but we also have to demonstrate business value.

SPEED platform

In order to solve the above problems, we have developed a system performance estimation / evaluation / decision (SPEED) platform.

First of all, SPEED estimates what is currently happening in the data center. Collect data through global monitoring and analyze that data to detect optimization potential.

SPEED then evaluates software and hardware upgrades in production online. For example, before deploying a new hardware product, hardware vendors typically perform their own performance tests to show performance gains. However, performance gains are specific to vendor test scenarios. All we need to know is how well the hardware is suitable for our particular use case.

This is not easy to answer. For obvious reasons, users cannot let hardware vendors test their performance in their business environment. Instead, you have to do the canary release of the new hardware yourself. Of course, the larger the canary release, the more accurate the rating, but the production environment is directly related to the business and cannot be very large at first. To minimize risk, users typically release canaries between a handful of machines and dozens of machines, then gradually scale up to hundreds or even thousands. ..

Therefore, an important contribution of SPEED is that even small canary releases can provide reasonable estimates, which can save costs and mitigate risk in the long run. As canary releases grow in size, SPEED continues to improve the quality of performance analytics, helping users make important decisions.

The purpose of the decision is not only to decide whether to perform a given hardware or software upgrade, but also to gain a comprehensive understanding of the performance nature of the software and hardware stack. Understanding what hardware and software architectures are better suited to your application scenario will help you consider the direction of software and hardware customization.

Understanding performance metrics is not easy

One of the most popular performance indicators in data centers is CPU utilization. But think about this. Assuming that the average CPU usage per machine in the data center is 50%, the demand for applications is no longer increasing, and software pieces do not interfere with each other, then the existing machines in the data center Can the number be halved? Probably not possible.

Due to the large number of different workloads and large amounts of hardware, the main challenge of comprehensive performance analysis in the data center is to find an overall performance indicator to summarize the performance of different workloads and hardware. .. As far as we know, there is currently no unified standard. At SPEED, we have proposed an index called Resource Usage Effectiveness (RUE) to measure resource utilization per unit of work.

Here, Resource Usage is the usage rate of resources such as CPU, memory, storage, and network, and Work Done is the task of querying and big data processing of e-commerce applications.

The RUE concept provides a multifaceted and comprehensive way to evaluate performance. For example, if the business side says "the response speed of an application on a machine has increased", the technical side also considers that "the system load and CPU usage are certainly increasing". In such cases, the natural reaction is that you are worried that the new feature may have caused the failure. But as a better response, you might also want to look at the Query Per Second Throughput Index (QPS). If QPS is on the rise, it may point to the fact that more resources are being used to complete more work after the release of new features.

As you can see, performance needs to be measured in a multifaceted and comprehensive manner. Otherwise, irrational conclusions can be drawn and real opportunities for performance optimization can be missed.

The data collected is not always correct

Performance analysis requires data collection, but how do you know that the data you collect is correct?

Consider the example of Hyper-Threading, an Intel technology. Modern laptops typically have dual cores, or two hardware cores. When Hyper-Threading is enabled, a dual-core machine becomes a machine with four hardware threads (logical CPUs).

In the figure below, the graph above shows a machine with two hardware cores and Hyper-Threading disabled. The task manager reports that the average CPU utilization of the machine is 100% because it has used up the CPU resources of both cores.

The graph in the lower left corner shows a machine that also has two hardware cores but has hyper-threading enabled. The average CPU utilization of the machine is 50% because it has used up one of the hardware threads in each physical core. On the other hand, the lower right graph shows that on a machine that also has two hardware cores, Hyper-Threading is enabled and two hardware threads on one core are used up. In other words, the average CPU utilization of this machine is also 50%.

There is a problem here. In reality, the CPU usage shown in the lower left graph and the CPU usage shown in the lower right graph are completely different, but if you collect only the average CPU usage of the entire machine, they are exactly the same. You can see that there is.

As you can see, when analyzing performance data, we cannot afford to think only about data processing and algorithms. You also have to consider how the data is collected.

Heterogeneous hardware

Hardware heterogeneity is a major challenge for performance analysis in data centers, but it also points to the direction of performance optimization.

For example, in the figure below, the Broadwell architecture on the left represents the mainstream architecture of Intel server CPUs over the last few years.

There is a big difference between the two architectures. For example, in Broadwell, memory access is done in the long-used dual-ring mode, while in Skylake the memory access method has been changed to mesh mode. In addition, Skylake has a fourx expansion of the L2 cache.

Each of these differences has its strengths and weaknesses. You should measure the impact of these differences on overall performance and consider whether to upgrade your data center servers to Skylake for cost considerations.

It is important to understand the hardware differences, as they affect all applications running on the hardware. The hardware you use has a significant impact on the direction of customization and optimization.

Complex software architecture

The software architecture of modern Internet services is very complex, and Alibaba's e-commerce architecture is a prime example. Such complex software architectures also pose significant challenges to performance analysis in the data center.

To give a simple example, the right side of the figure below is the coupon application, the upper left is the application for a large promotion venue, and the lower left is the shopping cart application. These are typical business scenarios in e-commerce.

From a Java development perspective, each business scenario is an application. E-commerce customers can choose a coupon from the promotion venue, a coupon from the shopping cart, or whatever they like. From a software architecture perspective, venues and shopping carts are two main applications, each a portal for coupon applications. Each portal has a different calling route for the coupon application, and each has a different impact on performance.

Therefore, when analyzing the overall performance of coupon applications, you need to consider the various complex application associations and call paths throughout your e-commerce architecture. These diverse business scenarios and complex call paths are difficult to fully reproduce in benchmark tests and require performance evaluation in a production environment.

Simpson's Paradox: Precautions

The figure below shows the Simpson's paradox, which is well known for real-world data analysis.

Continuing from the previous example, suppose the app shown in the figure is a coupon app. Suppose a new feature S appears during the promotion, and the machines used to release the canary make up 1% of the total. According to RUE metrics, Feature S improves performance by 8%. But here, the coupon application has three different groups, and as discussed above, the groups are related to applications with different portals, so from the perspective of each group, the feature is the performance of the application. Decrease.

With the same data set and the same criteria, the results obtained by the overall aggregate analysis are the exact opposite of the results obtained by the individual analysis of each part. This is Simpson's paradox.

You must be aware that you may see the overall evaluation results, or you may see the results for each component. In this example, the correct approach is to look at each group and conclude that feature S actually degrades performance and requires further modifications and optimizations.

Simpson's paradox reminds us why it is essential to pay attention to various potential pitfalls in performance analysis. Decisions based on unthinking metrics and non-representative data, and decisions that don't take into account hardware and software architectures can be very costly to the business.

Alibaba Cloud is the No. 1 (2019 Gartner) cloud infrastructure operator in the Asia Pacific region with two data centers in Japan and more than 60 availability zones in the world. Click here for more information on Alibaba Cloud. Alibaba Cloud Japan Official Page *

Alibaba Large Data Center Performance Monitoring and Analysis Challenges and Practices