MySQL performance bottleneck troubleshooting location example in detail

2021-01-02 22:00:33
OfStack

An example of MySQL performance bottleneck troubleshooting and location is presented. To share for your reference, the details are as follows:

takeaway

Starting from one site, the whole process analyzed how to locate performance bottlenecks.

The screening process

Received an alarm message that the MySQL instance load on the back end of a business online was relatively high, and logged in the server to check and confirm.

1. First, we carried out OS level inspection and confirmation

After logging into the server, our goal is to first identify which processes are currently causing the high load, where those processes are stuck, and what the bottleneck is.

In general, the easiest bottleneck on the server is the disk I/O subsystem, because it is usually the slowest to read and write. Even with today's PCIe SSD, random I/O reads and writes are not as fast as memory. Of course, there are many reasons why disk I/O is slow, and you need to be sure which one.

In step 1, we usually look at the overall load. If the load is high, all processes will run slower.

Command w or sar-q 1 can be executed to view the load data, for example:


[yejr@imysql.com:~ ]# w
 11:52:58 up 702 days, 56 min, 1 user, load average: 7.20, 6.70, 6.47
USER   TTY   FROM       LOGIN@  IDLE  JCPU  PCPU WHAT
root   pts/0  1.xx.xx.xx    11:51  0.00s 0.03s 0.00s w

Or observations of ES36en-ES37en:


[yejr@imysql.com:~ ]# sar -q 1
Linux 2.6.32-431.el6.x86_64 (yejr.imysql.com)   01/13/2016   _x86_64_  (24 CPU)
02:51:18 PM  runq-sz plist-sz  ldavg-1  ldavg-5 ldavg-15  blocked
02:51:19 PM     4   2305   6.41   6.98   7.12     3
02:51:20 PM     2   2301   6.41   6.98   7.12     4
02:51:21 PM     0   2300   6.41   6.98   7.12     5
02:51:22 PM     6   2301   6.41   6.98   7.12     8
02:51:23 PM     2   2290   6.41   6.98   7.12     8

load average indicates how many tasks are waiting in the queue in CPU. The more the queue is, the higher the load will be. If the value of load exceeds 5, it is relatively high.

High load can also be caused by a number of factors:

Some processes/services consume more CPU resources (services respond to more requests or have some application bottlenecks);

A more severe swap(out of available physical memory) occurs;

A more serious outage (due to an SSD or network outage);

Disk I/O is slow (causing CPU1 to wait for disk I/O requests);

At this point we can execute the following command to determine which subsystem the bottleneck is in:


[yejr@imysql.com:~ ]# top
top - 11:53:04 up 702 days, 56 min, 1 user, load average: 7.18, 6.70, 6.47
Tasks: 576 total,  1 running, 575 sleeping,  0 stopped,  0 zombie
Cpu(s): 7.7%us, 3.4%sy, 0.0%ni, 77.6%id, 11.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 49374024k total, 32018844k used, 17355180k free,  115416k buffers
Swap: 16777208k total,  117612k used, 16659596k free, 5689020k cached
 PID USER   PR NI VIRT RES SHR S %CPU %MEM  TIME+ COMMAND
14165 mysql   20  0 8822m 3.1g 4672 S 162.3 6.6 89839:59 mysqld
40610 mysql   20  0 25.6g 14g 8336 S 121.7 31.5 282809:08 mysqld
49023 mysql   20  0 16.9g 5.1g 4772 S 4.6 10.8  34940:09 mysqld

It is clear that the first two mysqld processes caused the overall load to be higher.

Furthermore, statistics from the Cpu(s) line also show that the higher values of %us and %wa indicate that the current high bottleneck may be on the user process consumption of CPU and disk I/O waits.

Let's first analyze the disk I/O case.

Perform sar-ES86en to verify that the disk I/O is really large:


[yejr@imysql.com:~ ]# sar -d 1
Linux 2.6.32-431.el6.x86_64 (yejr.imysql.com)   01/13/2016   _x86_64_  (24 CPU)
11:54:32 AM  dev8-0  5338.00 162784.00  1394.00   30.76   5.24   0.98   0.19  100.00
11:54:33 AM  dev8-0  5134.00 148032.00 32365.00   35.14   6.93   1.34   0.19  100.10
11:54:34 AM  dev8-0  5233.00 161376.00  996.00   31.03   9.77   1.88   0.19  100.00
11:54:35 AM  dev8-0  4566.00 139232.00  1166.00   30.75   5.37   1.18   0.22  100.00
11:54:36 AM  dev8-0  4665.00 145920.00  630.00   31.41   5.94   1.27   0.21  100.00
11:54:37 AM  dev8-0  4994.00 156544.00  546.00   31.46   7.07   1.42   0.20  100.00

Use iotop to identify which processes consume the most disks I/O resources:


[yejr@imysql.com:~ ]# iotop
Total DISK READ: 60.38 M/s | Total DISK WRITE: 640.34 K/s
 TID PRIO USER   DISK READ DISK WRITE SWAPIN   IO>  COMMAND
16397 be/4 mysql    8.92 M/s  0.00 B/s 0.00 % 94.77 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320
 7295 be/4 mysql   10.98 M/s  0.00 B/s 0.00 % 93.59 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320
14295 be/4 mysql   10.50 M/s  0.00 B/s 0.00 % 93.57 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320
14288 be/4 mysql   14.30 M/s  0.00 B/s 0.00 % 91.86 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320
14292 be/4 mysql   14.37 M/s  0.00 B/s 0.00 % 91.23 % mysqld --basedir=/usr/local/m~og_3320/mysql.sock --port=3320

As you can see, the instance with port number 3320 consumes more disk I/O resources, so take a look at what queries are running in this instance.

2. MySQL level check and confirm

First, take a look at what queries are currently running:


[yejr@imysql.com(db)]> mysqladmin pr|grep -v Sleep
+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+
| Id |User| Host   | db |Command|Time | State    | Info                                             |
+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+
| 25 | x | 10.x:8519 | db | Query | 68 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>404612 order by Fvideoid) t1 |
| 26 | x | 10.x:8520 | db | Query | 65 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>484915 order by Fvideoid) t1 |
| 28 | x | 10.x:8522 | db | Query | 130 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>404641 order by Fvideoid) t1 |
| 27 | x | 10.x:8521 | db | Query | 167 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>324157 order by Fvideoid) t1 |
| 36 | x | 10.x:8727 | db | Query | 174 | Sending data | select max(Fvideoid) from (select Fvideoid from t where Fvideoid>324346 order by Fvideoid) t1 |
+----+----+----------+----+-------+-----+--------------+-----------------------------------------------------------------------------------------------+

It can be seen that many slow queries have not been completed, and it can also be found from slow query log that this kind of SQL occurs with a high frequency.

This is a very inefficient SQL writing method, resulting in the need to scan the entire primary key, but in fact only one maximum value needs to be obtained, as can be seen from slow query log:

Rows_sent: 1 Rows_examined: 5502460
It is very inefficient to scan more than 5 million rows of data at a time while reading only one maximum value.

After analysis, the SQL can be completed in a single digit millisecond with a simple modification, instead of the original 150-180 seconds needed to complete, raising the N power.

The transformation method is: to do a reverse order of the query results, the first record can be obtained. The original way was to sort the results in positive order, take the last record, sweat...

Write it at the end, in summary

In this example, the cause of the bottleneck is relatively easy to locate and SQL optimization is not difficult. In the actual online environment, there are usually the following common reasons leading to high load:

1 request read and write data amount is too large, resulting in disk I/O read and write value is large, such as a SQL to read or update tens of thousands of lines of data or more, this is the best way to reduce the amount of data read and write;

SQL query filter under the condition of no appropriate index can be used to finish, sort (ORDER BY), group (GROUP BY), data aggregation (MIN/MAX/COUNT/AVG, etc.), adding indexes or SQL rewrite;

Sudden a large number of requests, such as as long as the peak can be carried over the good, to be on the safe side or to appropriately improve the configuration of the server, 10,000 peak resistance can not go may be an avalanche effect;

Due to the increased load caused by certain timing tasks, such as doing statistical analysis and backup, the consumption of CPU, memory and disk I/O is very high, which is best performed on an independent slave server.

When the server's own energy saving strategy finds that the load is low, it will make CPU reduce frequency. When it finds that the load is high, it will automatically increase frequency, but it is usually not timely. As a result, CPU's performance is insufficient and it cannot withstand the sudden request.

Use raid card, usually equipped with BBU backup battery (cache module), early 1 lithium battery technology, need to charge and discharge (DELL server 1, 90 days IBM is 30 days), we can monitor the business before the next one to charge and discharge time when trough to discharge ahead of schedule, but a new generation of server using capacitive batteries mostly, also don't have this problem.

The file system uses ext4 or even ext3 instead of xfs. When the I/O pressure is high, it is likely that %util has reached 100%, but iops cannot be improved any more.

The kernel's io scheduler strategy uses cfq instead of deadline or noop, which can be adjusted directly online or greatly improved.

For more information about MySQL, please refer to MySQL Log tips, MySQL Transaction Tips, MySQL Stored Procedure Tips, MySQL Database Lock Tips and MySQL Common Functions.

I hope this article has been helpful to you with the MySQL database.