Detailed Explanation of Basic Implementation Principle of MySQL DISTINCT

  • 2021-12-09 10:19:18
  • OfStack

Preface

DISTINCT is actually very similar to the implementation of the GROUP BY operation, except that only one record is fetched from each group after GROUP BY. Therefore, the implementation of DISTINCT is basically the same as that of GROUP and BY, without much difference. It can also be achieved by loose index scanning or compact index scanning. Of course, when DISTINCT cannot be completed by using indexes alone, MySQL can only be completed by temporary tables.

However, there is one difference between GROUP and BY in that DISTINCT does not need to be sorted. That is to say, if Query, which is only an DISTINCT operation, cannot complete the operation only by using the index, MySQL will use the temporary table to "cache" the data once, but will not perform filesort operation on the data in the temporary table.

Of course, if we also use GROUP BY and group with DISTINCT, and use aggregate function operations like MAX, we can't avoid filesort.

Let's show the implementation of DISTINCT under 1 with a few simple Query examples.

1. First, look at the operation of DISTINCT through loose index scan:


sky@localhost : example 11:03:41> EXPLAIN SELECT DISTINCT group_id 
  -> FROM group_messageG
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)

We can clearly see that the Extra message in the execution plan is "Using index for group-by". What does this mean? Why did the execution plan tell me that GROUP BY was done by index when I didn't do GROUP BY?

In fact, this is related to the implementation principle of DISTINCT. In the process of implementing DISTINCT, it is also necessary to group, and then take out one piece from each group of data and return it to the client. The Extra information here tells us that MySQL completes the whole operation by using loose index scanning.

Of course, it would be better and easier for people to understand if MySQL Query Optimizer could do one more humanization and replace the information here with "Using index for distinct", hehe.

2. Let's look at the example of scanning through a compact index:


sky@localhost : example 11:03:53> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id = 2G
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: ref
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: const
     rows: 4
    Extra: Using WHERE; Using index
1 row in set (0.00 sec)

The display here and the implementation of GROUP BY through compact index scanning are also completely 1-like. In fact, in the implementation of Query, MySQL will let the storage engine scan all the index keys of group_id = 2 to obtain all user_id, and then use the sorted characteristics of the index to keep one piece of information every time the index key value of user_id is changed, so that the whole DISTINCT operation can be completed when all the index keys of gruop_id = 2 are scanned.

3. Let's look at what happens when you can't complete DISTINCT using indexes alone:


sky@localhost : example 11:04:40> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10G
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary
1 row in set (0.00 sec)

When MySQL cannot rely solely on indexes to complete DISTINCT operations, temporary tables have to be used to do the corresponding operations. However, we can see that when MySQL uses temporary tables to complete DISTINCT, there is one difference between GROUP and BY, that is, filesort is missing.

In fact, in the grouping algorithm of MySQL, it is not necessary to sort to complete the grouping operation, which I have already mentioned in the above optimization tips of GROUP BY. In fact, MySQL realizes grouping and finally completes DISTINCT operation without sorting, so the sorting operation of filesort is missing.

4. Finally, try to combine with GROUP BY:


sky@localhost : example 11:05:06> EXPLAIN SELECT DISTINCT max(user_id) 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10
  -> GROUP BY group_idG
*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary; Using filesort
1 row in set (0.00 sec)

Finally, let's look at this example with aggregate function from GROUP BY 1. Compared with the third example above, we can see that there are more filesort sorting operations, precisely because we use MAX function. To get the grouped MAX value, you can't use the index to complete the operation, you can only sort it.

Since the implementation of DISTINCT is basically similar to the implementation of GROUP and BY, this article will not draw a diagram to show the implementation process


Related articles: