Analyze an MySQL exception query case

2020-10-07 18:54:49
OfStack

The problem

User worksheet question: same statement with different number of limit lines at the end. Oddly, limit 10 is about 10 times slower than limit 100 statements.

Hide user table information, statements and results are shown below


SELECT f1 , SUM(`f2`) `CNT` FROM T WHERE f1 IS NOT NULL AND f3 = '2014-05-12' GROUP BY f1 ORDER BY `CNT` DESC LIMIT 10;

Execution time 3 min 3.65 sec


SELECT f1 , SUM(`f2`) `CNT` FROM T WHERE f1 IS NOT NULL AND f3 = '2014-05-12' GROUP BY f1 ORDER BY `CNT` DESC LIMIT 100;

Execution time 1.24Sec.

The performance gap is huge!

Analysis of the
MySQL Tips: The most common way to trace the execution of a statement is to look at the execution plan of the statement through explain. The & # 63;

The more striking effect is that by narrowing down the scope, the implementation plans for limit 67 and limit 68 differ significantly under this data.

Two execution plans:


LIMIT 67
id: 1
select_type: SIMPLE
table: a
type: range
possible_keys: A,B,C
key: B
key_len: 387
ref: NULL
rows: 2555192
Extra: Using where; Using temporary; Using filesort
1 row in set (0.00 sec)

LIMIT 68
id: 1
select_type: SIMPLE
table: a
type: ref
possible_keys: A,B,C
key: A
key_len: 3
ref: const
rows: 67586
Extra: Using where; Using temporary; Using filesort
1 row in set (0.00 sec)

As you can see, the execution plans of the two statements are different: the indexes used are different.

MySQL Tips: In the result of explain, key represents the final index used and rows represents the number of rows that need to be scanned to use this index, which is an estimate.

The index A in the table is defined as (f3, f4, f1, f2, f5); The index B is defined as (f1, f2, f3);

A confirmation

Although rows is an estimate, it is the basis for guiding the use of indexes. Since limit 68 can reach rows 67586, indicating that this value should also be present in the first statement optimizer optional result, why not select the index A?
Let's make sure that we have this conclusion above.

MySQL Tips: MySQL syntax is sufficient to force the optimizer to use an index with force index.


Explain SELECT f1 , SUM(f2) CNT FROM t force index(A) WHERE f1 IS NOT NULL AND f3 =  ' 2014-05-12' GROUP BY P ORDER BY CNT DESC LIMIT 67\G

id: 1
select_type: SIMPLE
table: a
type: ref
possible_keys:A
key: A
key_len: 3
ref: const
rows: 67586
Extra: Using where; Using temporary; Using filesort
1 row in set (0.00 sec)

By the way, since we specified force index, the optimizer does not consider other indexes, only A will be shown in possible_keys. We're looking at rows:67586. This shows that using index A in limit 67 can also reduce line scanning.

MySQL Tips: The MySQL optimizer calculates the query cost for each possible index in possiable_key, choosing the least costly query plan.

At this point, we can probably guess that this is bug on the MySQL implementation: the proper index was not selected, resulting in the use of an obviously incorrect execution plan.

MySQL Tips: The optimizer execution of MySQL needs to rely on table statistics, which are estimates and thus may result in a non-optimal execution plan.

It is important to note, however, that the Tip mentioned above is an objective condition (acceptable), but this example is an exception, so the optimizer can actually get the data (rows value) to make the right choice, but ultimately make the wrong choice.

Cause analysis,

The MySQL optimizer determines the index to use based on an estimate of the query cost. The process of calculating this estimate is basically based on estimating the number of rows to be scanned.

MySQL Tips: MySQL only USES prefix indexes in the 5.1 and 5.5 versions currently in mainstream use by the group.

Therefore, using the index A can only use the field f3, and using the index B can only use the field f1. Rows is the number of data rows (estimated values) that need to be scanned after using the index to look up the upper and lower bounds.

The above statement requires both group and order by, so there is Using temporary in the execution plan. Using filesort.
The query cost using the index A is calculated sequentially on the flow.

The query cost for the other possitabe_key is then calculated in turn. Since sorting is required during the process, after obtaining a tentative result, it is necessary to determine whether there is a less costly sorting method (test_if_cheaper_ordering).
Much the same as before, also rely on the estimated number of scanned rows to calculate the cost.

In the implementation of this logic, there is an bug: prefix indexes are not taken into account when estimating the differentiation of the current index.

I.e., the assumption of 50 w row in the table, index B (f1 f2, f3), the calculation index to distinguish between degrees, need according to the prefix part can be used to determine. For example, if f1 has 1000 different values, the average number of records per key value is 500. Such as (f1 f2) has 10000 with the value of the is the average number of records in each combination key is 50, if (f1 f2, f3) 50 w different values, the average number of records in each combination key is 1.

MySQL Tips: The smaller the number of records per key, the more efficient the query is when using this index. The larger the VALUE of Cardinality corresponding to show index from tbl output is.

Under the case, index B prefix index can only use f1 do, but when calculating the average line on single key used (f1 f2, f3), this leads to estimate the index B estimate, get smaller cost. This results in a false selection.

Back to the question itself

1. Why is limit selected correctly when the value is large?
This is because when calculating the query cost of B, the number of rows returned by the query limit_rows also participates in the product. If the value of limit is large, the calculated B will cost more, but will be due to the cost. The value exceeds A, causing the optimizer to finally choose A.

2. There are only 50w rows in this table. Why is the difference of limit so big?
It has to do with the statement itself. There is group by in this statement, which means that for every additional value of limit1, you actually need to scan more lines N. Here N is "Total number of rows in table"/" Different f2 values in table ".
In other words, this statement magnifies the bug.

The solution

After the analysis is clear, the solution is relatively simple, modify the code logic, during the execution of test_if_cheaper_ordering, use the field f1 discrimination to calculate.