Hash Connection for MySQL 8.0 New Feature (Hash Join)

2021-12-13 10:04:12
OfStack

The MySQL development team officially released MySQL 8.0. 18 GA version on October 14, 2019, bringing a number of new features and enhancements. The most striking thing is that multi-table join query supports hash join mode. Let's take a look at the official description first:

MySQL implements the hash join mode for interjoin queries. For example, starting with MySQL 8.0. 18, the following query can be used for join queries using hash join:


SELECT * 
  FROM t1 
  JOIN t2 
    ON t1.c1=t2.c1;

Hash join does not require index support. In most cases, hash join is more efficient than the previous Block Nested-Loop algorithm for equivalent join without index. Create 3 test tables using the following statement:


CREATE TABLE t1 (c1 INT, c2 INT);
CREATE TABLE t2 (c1 INT, c2 INT);
CREATE TABLE t3 (c1 INT, c2 INT);

Use the command EXPLAIN FORMAT=TREE to see hash join in the execution plan, for example:


mysql> EXPLAIN FORMAT=TREE
  -> SELECT * 
  ->   FROM t1 
  ->   JOIN t2 
  ->     ON t1.c1=t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (t2.c1 = t1.c1) (cost=0.70 rows=1)
  -> Table scan on t2 (cost=0.35 rows=1)
  -> Hash
    -> Table scan on t1 (cost=0.35 rows=1)

You must use the FORMAT=TREE option of the EXPLAIN command to see hash join in the node. In addition, the EXPLAIN ANALYZE command can also display the usage information of hash join. This is also a new function in this version.

Queries that use equivalent joins between multiple tables will also undergo this optimization. For example, the following query:


SELECT * 
  FROM t1
  JOIN t2 
    ON (t1.c1 = t2.c1 AND t1.c2 < t2.c2)
  JOIN t3 
    ON (t2.c1 = t3.c1);

In the above example, any other non-equivalent join condition will be used as a filter after the join operation. You can view it from the output of the command EXPLAIN FORMAT=TREE:


mysql> EXPLAIN FORMAT=TREE
  -> SELECT * 
  ->   FROM t1
  ->   JOIN t2 
  ->     ON (t1.c1 = t2.c1 AND t1.c2 < t2.c2)
  ->   JOIN t3 
  ->     ON (t2.c1 = t3.c1)\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (t3.c1 = t1.c1) (cost=1.05 rows=1)
  -> Table scan on t3 (cost=0.35 rows=1)
  -> Hash
    -> Filter: (t1.c2 < t2.c2) (cost=0.70 rows=1)
      -> Inner hash join (t2.c1 = t1.c1) (cost=0.70 rows=1)
        -> Table scan on t2 (cost=0.35 rows=1)
        -> Hash
          -> Table scan on t1 (cost=0.35 rows=1)

As you can also see from the above output, a query that contains multiple equivalent join conditions can (will) use multiple hash join joins.

However, the hash join connection will not be used if an equivalent connection condition is not used in any connection statement (ON). For example:


mysql> EXPLAIN FORMAT=TREE
  ->   SELECT * 
  ->     FROM t1
  ->     JOIN t2 
  ->       ON (t1.c1 = t2.c1)
  ->     JOIN t3 
  ->       ON (t2.c1 < t3.c1)\G
*************************** 1. row ***************************
EXPLAIN: <not executable by iterator executor>

At this point, the slower performance will be adopted block nested loop Join algorithm. This is the same as when there was no index in MySQL before 8.0. 18:


mysql> EXPLAIN
  ->   SELECT * 
  ->     FROM t1
  ->     JOIN t2 
  ->       ON (t1.c1 = t2.c1)
  ->     JOIN t3 
  ->       ON (t2.c1 < t3.c1)\G       
*************************** 1. row ***************************
      id: 1
 select_type: SIMPLE
    table: t1
  partitions: NULL
     type: ALL
possible_keys: NULL
     key: NULL
   key_len: NULL
     ref: NULL
     rows: 1
   filtered: 100.00
    Extra: NULL
*************************** 2. row ***************************
      id: 1
 select_type: SIMPLE
    table: t2
  partitions: NULL
     type: ALL
possible_keys: NULL
     key: NULL
   key_len: NULL
     ref: NULL
     rows: 1
   filtered: 100.00
    Extra: Using where; Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
      id: 1
 select_type: SIMPLE
    table: t3
  partitions: NULL
     type: ALL
possible_keys: NULL
     key: NULL
   key_len: NULL
     ref: NULL
     rows: 1
   filtered: 100.00
    Extra: Using where; Using join buffer (Block Nested Loop)

Hash join joins also apply to Cartesian products (Cartesian product) without specifying query criteria, such as:


mysql> EXPLAIN FORMAT=TREE
  -> SELECT *
  ->   FROM t1
  ->   JOIN t2
  ->   WHERE t1.c2 > 50\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join (cost=0.70 rows=1)
  -> Table scan on t2 (cost=0.35 rows=1)
  -> Hash
    -> Filter: (t1.c2 > 50) (cost=0.35 rows=1)
      -> Table scan on t1 (cost=0.35 rows=1)

When configured by default, MySQL uses hash join whenever possible. It also provides two methods to control whether hash join is used:

Set server system variables at the global or session level optimizer_switch In hash_join=on Or hash_join=off Option. Default to hash_join=on .

Specify the optimizer prompt HASH_JOIN or NO_HASH_JOIN at the statement level for a particular connection.

You can use system variables join_buffer_size Control the amount of memory allowed for hash join; hash join does not use more memory than this variable sets. If the memory required by hash join exceeds this threshold, MySQL will perform the operation on disk. It should be noted that if hash join cannot be completed in memory and the number of open files exceeds the system variable open_files_limit The connection operation may fail. To solve this problem, you can use one of the following methods:

Increase join_buffer_size Ensure that the value of the hash join It can be done in memory.

Add ope n_files_limit The value of.

Next, they compare 1 hash join And block nested loop First generate 1,000,000 records for t1, t2, and t3, respectively:


set join_buffer_size=2097152000;
SET @@cte_max_recursion_depth = 99999999;
INSERT INTO t1
-- INSERT INTO t2
-- INSERT INTO t3
WITH RECURSIVE t AS (
 SELECT 1 AS c1, 1 AS c2
 UNION ALL
 SELECT t.c1 + 1, t.c1 * 2
  FROM t
  WHERE t.c1 < 1000000
)
SELECT *
 FROM t;

hash join without index:


mysql> EXPLAIN ANALYZE
  -> SELECT COUNT(*)
  ->  FROM t1
  ->  JOIN t2 
  ->   ON (t1.c1 = t2.c1)
  ->  JOIN t3 
  ->   ON (t2.c1 = t3.c1)\G
*************************** 1. row ***************************
EXPLAIN: -> Aggregate: count(0) (actual time=22993.098..22993.099 rows=1 loops=1)
  -> Inner hash join (t3.c1 = t1.c1) (cost=9952535443663536.00 rows=9952435908880402) (actual time=14489.176..21737.032 rows=1000000 loops=1)
    -> Table scan on t3 (cost=0.00 rows=998412) (actual time=0.103..3973.892 rows=1000000 loops=1)
    -> Hash
      -> Inner hash join (t2.c1 = t1.c1) (cost=99682753413.67 rows=99682653660) (actual time=5663.592..12236.984 rows=1000000 loops=1)
        -> Table scan on t2 (cost=0.01 rows=998412) (actual time=0.067..3364.105 rows=1000000 loops=1)
        -> Hash
          -> Table scan on t1 (cost=100539.40 rows=998412) (actual time=0.133..3395.799 rows=1000000 loops=1)

1 row in set (23.22 sec)

mysql> SELECT COUNT(*)
  ->  FROM t1
  ->  JOIN t2 
  ->   ON (t1.c1 = t2.c1)
  ->  JOIN t3 
  ->   ON (t2.c1 = t3.c1);
+----------+
| COUNT(*) |
+----------+
| 1000000 |
+----------+
1 row in set (12.98 sec)

The actual operation took 12.98 seconds. If you use block nested loop at this time:


CREATE TABLE t1 (c1 INT, c2 INT);
CREATE TABLE t2 (c1 INT, c2 INT);
CREATE TABLE t3 (c1 INT, c2 INT);

EXPLAIN shows that hash join cannot be used. The query ran for several 10 minutes and there was no result, and one CPU utilization rate reached 100%; Because 1 is executing a nested loop (1000000 to the third power).

Look at the block nested loop method when there is an index, and add the index:


CREATE TABLE t1 (c1 INT, c2 INT);
CREATE TABLE t2 (c1 INT, c2 INT);
CREATE TABLE t3 (c1 INT, c2 INT);

View the execution plan and run the same query statement:


CREATE TABLE t1 (c1 INT, c2 INT);
CREATE TABLE t2 (c1 INT, c2 INT);
CREATE TABLE t3 (c1 INT, c2 INT);

The actual operation took 19.56 seconds. So the test results in our scenario are as follows:

Hash Join（无索引）	Block Nested Loop（无索引）	Block Nested Loop（有索引）
12.98 s	未返回	19.56 s

Add another Oracle 12c without index hash join Result: 1.282 s.

Add another hash join without index in PostgreSQL 11.5 Result: 6.234 s.

Add another hash join without index in SQL 2017 Result: 5.207 s.

Summarize