Some pits that have to be paid attention to after upgrading to MySQL 5.7

2021-11-01 05:15:13
OfStack

Preface

Some time ago, the online MySQL database was upgraded to 5.7. Considering the possible incompatibility, I was really afraid before upgrading, although the test environment and development environment were upgraded ahead of schedule as early as six months ago.

Based on previous research and feedback from friends, there are two main points related to development:

sql_mode

In MySQL 5.6, its default value is "NO_ENGINE_SU BSTITUTION", which can be understood as a non-strict mode. For example, inserting an empty string ''into the self-increasing primary key, although prompting warning, does not affect the generation of the self-increasing primary key.

However, in MySQL 5.7, it is adjusted to strict mode. For the above one, it will not prompt warning, but report an error directly.

Grouping for maximum value

Some writing methods of finding the maximum value in groups can not get the expected results in MySQL 5.7, which is relatively hidden.

Among them, the first point is controllable, after all, the parameters can be adjusted. The second point, however, is uncontrollable, and there are no parameters related to it, so it is necessary to develop Review code.

Let's take a look at it in detail

Test data


mysql> select * from emp;
+-------+----------+--------+--------+
| empno | ename | sal | deptno |
+-------+----------+--------+--------+
| 1001 | emp_1001 | 100.00 | 10 |
| 1002 | emp_1002 | 200.00 | 10 |
| 1003 | emp_1003 | 300.00 | 20 |
| 1004 | emp_1004 | 400.00 | 20 |
| 1005 | emp_1005 | 500.00 | 30 |
| 1006 | emp_1006 | 600.00 | 30 |
+-------+----------+--------+--------+
rows in set (0.00 sec)

Where empno is the employee number, ename is the employee name, sal is the salary, and deptno is the employee department number.

The need of the business is to find out the relevant information of the highest-paid employees in each department.

In MySQL 5.6, we can do this with the following SQL.


SELECT
 deptno,ename,sal 
FROM
 ( SELECT * FROM emp ORDER BY sal DESC ) t 
GROUP BY
 deptno;

The results are as follows, and it can be seen that it really achieved the expected results.


+--------+----------+--------+
| deptno | ename | sal |
+--------+----------+--------+
| 10 | emp_1002 | 200.00 |
| 20 | emp_1004 | 400.00 |
| 30 | emp_1006 | 600.00 |
+--------+----------+--------+

Let's take a look at the results of MySQL5.7, but they are not the same.


+--------+----------+--------+
| deptno | ename | sal |
+--------+----------+--------+
| 10 | emp_1001 | 100.00 |
| 20 | emp_1003 | 300.00 |
| 30 | emp_1005 | 500.00 |
+--------+----------+--------+

In fact, in MySQL 5.7, the SQL was rewritten, and the rewritten SQL can be viewed through explain (extended) + show warnings.


mysql> explain select deptno,ename,sal from (select * from emp order by sal desc) t group by deptno;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
| 1 | SIMPLE | emp | NULL | ALL | NULL | NULL | NULL | NULL | 6 | 100.00 | Using temporary |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-----------------+
row in set, 1 warning (0.00 sec)

mysql> show warnings\G
*************************** 1. row ***************************
 Level: Note
 Code: 1003
Message: /* select#1 */ select `slowtech`.`emp`.`deptno` AS `deptno`,`slowtech`.`emp`.`ename` AS `ename`,`slowtech`.`emp`.`sal` AS `sal` from `slowtech`.`emp` group by `slowtech`.`emp`.`deptno`
row in set (0.00 sec)

From the rewritten SQL, it eliminates sub-queries, resulting in the failure to achieve the expected results. The official also confirmed this point, https://bugs.mysql.com/bug.php? id=80131

Many people may disagree and think that no one will write this, but in the famous stackoverflow, there are 116 praises for this implementation-thus its audience is wide, second only to the "Method 2" mentioned later (206 praises).

https://stackoverflow.com/questions/12102200/get-records-with-max-value-for-each-group-of-grouped-sql-results

It should be noted that this SQL cannot be run directly in 5.7, and it will prompt the following error:

ERROR 1055 (42000): Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column 't.ename' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by

This relates to sql_mode, and in MySQL 5.7, sql_mode is adjusted to


ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION

Among them, ONLY_FULL_GROUP_BY is related to group by statement, which requires that only grouping columns (that is, columns after group by) and aggregate functions (sum, avg, max, etc.) can appear in select list, which is also the standard of SQL92.

However, in my work, I often see the development write the following SQL.


mysql> select deptno,ename,max(sal) from emp group by deptno;
+--------+----------+----------+
| deptno | ename | max(sal) |
+--------+----------+----------+
| 10 | emp_1001 | 200.00 |
| 20 | emp_1003 | 400.00 |
| 30 | emp_1005 | 600.00 |
+--------+----------+----------+
rows in set (0.01 sec)

I really don't understand the significance of ename here at the business level. After all, he is not the highest paid employee.

Grouping to find the maximum value, the implementation of MySQL

In fact, it is a very common requirement to find the maximum value in groups. At work, it is often asked by development colleagues. Let's take a look at the implementation methods in MySQL.

Method 1


SELECT
 e.deptno,
 ename,
 sal 
FROM
 emp e,
 ( SELECT deptno, max( sal ) maxsal FROM emp GROUP BY deptno ) t 
WHERE
 e.deptno = t.deptno 
 AND e.sal = t.maxsal;

Method 2


SELECT
 a.deptno,
 a.ename,
 a.sal 
FROM
 emp a
 LEFT JOIN emp b ON a.deptno = b.deptno 
 AND a.sal < b.sal 
WHERE
 b.sal IS NULL;

These two ways of implementation, in fact, are common, not only applicable to MySQL, but also applicable to other mainstream relational databases.

Method 3

MySQL 8.0 introduces the analysis function, which can also achieve similar functions.


SELECT
 deptno,
 ename,
 sal 
FROM
 (
 SELECT
 deptno,
 ename,
 sal,
 LAST_VALUE ( sal ) OVER ( PARTITION BY deptno ORDER BY sal ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) maxsal 
 FROM
 emp 
 ) a 
WHERE
 sal = maxsal;

Performance comparison of three implementation methods

Because the data volume of the above test cases is too small, the results of the three implementation methods are all out in seconds, and it is difficult to intuitively see the advantages and disadvantages of the implementation methods only by the execution plan.

The test data with larger data volume is replaced below, the table dept_emp in the official sample database employees, https://github.com/datacharmer/test_db

The information in the table is as follows, where emp_no is the employee number, dept_no is the department number, and from_date is the date of entry.


SELECT
 deptno,ename,sal 
FROM
 ( SELECT * FROM emp ORDER BY sal DESC ) t 
GROUP BY
 deptno;

Method 1


SELECT
 deptno,ename,sal 
FROM
 ( SELECT * FROM emp ORDER BY sal DESC ) t 
GROUP BY
 deptno;

Method 2


SELECT
 deptno,ename,sal 
FROM
 ( SELECT * FROM emp ORDER BY sal DESC ) t 
GROUP BY
 deptno;

Method 3


mysql> select dept_no,emp_no,from_date from ( select dept_no,emp_no,from_date,last_value(from_date) over(partition by dept_no order by from_date rows between unbounded preceding and unbounded following) max_hiredate from dept_emp) a where from_date=max_hiredate;
 … 
rows in set (1.57 sec)

mysql> desc select dept_no,emp_no,from_date from ( select dept_no,emp_no,from_date,last_value(from_date) over(partition by dept_no order by from_date rows between unbounded preceding and unbounded following) max_hiredate from dept_emp) a where from_date=max_hiredate;
+----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra  |
+----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL  | NULL | NULL | NULL | 331008 | 100.00 | Using where |
| 2 | DERIVED | dept_emp | NULL | ALL | NULL  | NULL | NULL | NULL | 331008 | 100.00 | Using filesort |
+----+-------------+------------+------------+------+---------------+------+---------+------+--------+----------+----------------+
rows in set, 2 warnings (0.00 sec)

In terms of execution time,

Method 1 takes the shortest time. In the case of composite indexes (deptno, fromdate), the results come out instantly, and even without indexes, only 0.75 s is consumed.

Method 2 took the longest time, and there was no result after 3 hours. It took 87 minutes and 49 seconds to check the same data and the same SQL in Oracle.

The time of method 3 is relatively fixed, and it is maintained at about 1.5 s whether there is index or not, which takes longer than that of method 1.

Here, the previously mentioned implementation of MySQL 5.7, which is no longer compatible, is also tested. Without any index, it stabilizes at 0.7 s (the performance is not weak, no wonder someone uses it), while in the same case, method 1 stabilizes at 0.5 s (ha, MySQL 5.6 is actually faster than 8.0). However, unlike Method 1, it cannot be optimized by index.

From the implementation plan,

Method 1: First, put the results of group by into the temporary table, and then use the temporary table as the driving table to query with the dept_emp table. The driver table is small (only 9 records), and the associated columns have indexes. No wonder the results can come out in seconds.

Method 2, two tables are associated. It makes two big taboos in SQL optimization.

1. The driver table is too large, with 331,603 records.

2. Although the driven table also has indexes, from the execution plan, it only uses dept_no in the composite indexes (dept_no, from_date), and the selection rate of dept_no is too low, after all, there are only 9 departments.

Method 3: Put the analysis results into a temporary table first, and then process the temporary table. It performs two full table scans, one for the dept_emp table and one for the temporary table.

Therefore, for the requirement of grouping to find the maximum value, it is recommended to use Method 1, which not only conforms to SQL specification, but also has the best query performance, especially in the case of joint index.

Summarize