The basic principle of distinct statement in MySQL and its comparison with group by

  • 2020-12-16 06:09:39
  • OfStack

DISTINCT is actually very similar to the implementation of GROUP BY, except that only one record is pulled from each group after GROUP BY. So, the implementation of DISTINCT is pretty much the same as the implementation of GROUP BY, not much different. This can also be done with either a loose index scan or a compact index scan, of course, MySQL can only be done with temporary tables when DISTINCT cannot be done with indexes alone. However, unlike GROUP BY, DISTINCT does not need to be sorted. That is, MySQL uses temporary tables to "cache" data once if Query, which only operates on DISTINCT, cannot do so using indexes alone, but does not operate on the data in the temporary table. Of course, if we use and group GROUP BY when we do DISTINCT, and use aggregate function operations like MAX, we cannot avoid filesort.

Let's demonstrate the implementation of DISTINCT below with a few simple Query examples.

1. First, let's look at the operation of DISTINCT through loose index scanning:


sky@localhost : example 11:03:41> EXPLAIN SELECT DISTINCT group_id 
  -> FROM group_messageG

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)

We can clearly see that the Extra message in the execution plan is "Using index for ES38en-ES39en". What does this mean? Why didn't the execution plan tell me that GROUP BY was indexed when I did GROUP BY? In fact, this is related to the implementation principle of DISTINCT. In the implementation process of DISTINCT, it is also necessary to group, and then take out one piece of data from each group and return it to the client. The Extra information here tells us that MySQL uses a loose index scan to complete the operation. Of course, if MySQL Query Optimizer can do more humanized 1 point change the information here to "Using index for distinct" it will be better and easier for people to understand, hehe.

2. Let's take another look at an example through compact index scanning:


sky@localhost : example 11:03:53> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id = 2G

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: ref
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: const
     rows: 4
    Extra: Using WHERE; Using index
1 row in set (0.00 sec)

The display and implementation of GROUP BY via compact index scanning are also exactly the same. In fact, during the implementation of Query, MySQL will ask the storage engine to scan all the index keys of group_id = 2 to get all the user_id. Then, by taking advantage of the sorted feature of the index, one piece of information will be kept when every index key value of user_id is replaced, and the whole operation of DISTINCT can be completed when all the index keys of gruop_id = 2 are scanned.

3. Let's look at what happens when DISTINCT cannot be completed using indexes alone:


sky@localhost : example 11:04:40> EXPLAIN SELECT DISTINCT user_id 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10G

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary
1 row in set (0.00 sec)

When MySQL cannot rely solely on indexes to complete DISTINCT operations, temporary tables have to be used to do so. However, we can see that when MySQL uses temporary tables to complete DISTINCT, there is one difference from dealing with GROUP BY, which is the absence of filesort. In fact, in MySQL's grouping algorithm, no sorting is required to complete the grouping operation. I have mentioned this point in the GROUP BY optimization tip above. In fact, MySQL is actually doing the grouping without sorting and it's doing the DISTINCT operation, so it's missing the sorting operation.

4. Finally, try combining with GROUP BY:


sky@localhost : example 11:05:06> EXPLAIN SELECT DISTINCT max(user_id) 
  -> FROM group_message
  -> WHERE group_id > 1 AND group_id < 10
  -> GROUP BY group_idG

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: idx_gid_uid_gc
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 32
    Extra: Using WHERE; Using index; Using temporary; Using filesort
1 row in set (0.00 sec)

Finally, let's look at this example with an aggregate function compared to GROUP BY 1. Compared to the third example above, you can see that there are more filesort sort operations because we are using the MAX function. To get the grouped MAX value, you cannot use the index to complete the operation, only by sorting.

mysql distinct or group by who is better
1. Preparation before the test


// To prepare 1 A test table  
mysql> CREATE TABLE `test_test` ( 
 ->  `id` int(11) NOT NULL auto_increment, 
 ->  `num` int(11) NOT NULL default '0', 
 ->  PRIMARY KEY (`id`) 
 -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;



Query OK, 0 rows affected (0.05 sec) 



*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
0

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
1



*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
2

Query OK, 1 row affected (5.66 sec) 



*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
4

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
5



*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
6

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
7



mysql> set profiling=1;      // open  

*************************** 1. row ***************************
      id: 1
 SELECT_type: SIMPLE
    table: group_message
     type: range
possible_keys: NULL
     key: idx_gid_uid_gc
   key_len: 4
     ref: NULL
     rows: 10
    Extra: Using index for group-by
1 row in set (0.00 sec)
9

2, test


// Do the 4 Set of test  
mysql> select distinct(num) from test_test; 
mysql> select num from test_test group by num; 
 
mysql> show profiles;  // View the results  

+----------+------------+-------------------------------------------+ 
| Query_ID | Duration  | Query                   | 
+----------+------------+-------------------------------------------+ 
|    1 | 0.07298225 | select distinct(num) from test_test    | 
|    2 | 0.07319975 | select num from test_test group by num  | 
|    3 | 0.07313525 | select num from test_test group by num  | 
|    4 | 0.07317725 | select distinct(num) from test_test    | 
|    5 | 0.07275200 | select distinct(num) from test_test    | 
|    6 | 0.07298600 | select num from test_test group by num  | 
|    7 | 0.07500700 | select num from test_test group by num  | 
|    8 | 0.07331325 | select distinct(num) from test_test    | 
|    9 | 0.57831575 | create index num_index on test_test (num) | // While I was here, I added the index  
|    10 | 0.00243550 | select distinct(num) from test_test    | 
|    11 | 0.00121975 | select num from test_test group by num  | 
|    12 | 0.00116550 | select distinct(num) from test_test    | 
|    13 | 0.00107650 | select num from test_test group by num  | 
+----------+------------+-------------------------------------------+ 
13 rows in set (0.00 sec) 

The above 1-8 is 4 sets of data, and it is not indexed, from which we can see that distinct is 1 dot better than group by
10-13 is two sets of data, after the index is added, from which we can see that group by is 1 dot better than distinct
1. Generally, for tables with large data volume, the associated fields will be indexed, and the retrieval time after indexing is only about 1/6 of that before.


Related articles: