mysql nested and federated table query optimization methods

  • 2020-05-15 02:25:10
  • OfStack

Poor optimization of nested queries
I mentioned above that, regardless of the special circumstances, a combined table query is more efficient than a nested query. Although both queries say the same thing, and although your plan is to tell the server what to do and let it decide what to do, sometimes you have to tell it what to do. Otherwise the optimizer might do something silly. That happened to me recently. The tables are 3-tiered: category, subcategory, and item. There are thousands of records in the category table, hundreds of records in the subcategory table, and millions of records in the item table. You can ignore the category table, I just gave you the background for 1, and none of the following queries refer to it. This is the statement to create the table:

[sql]

create table subcategory (  
    id int not null primary key,  
    category int not null,  
    index(category)  
) engine=InnoDB;  

create table item(  
    id int not null auto_increment primary key,  
    subcategory int not null,  
    index(subcategory)  
) engine=InnoDB;  
 

I filled in some sample data

[sql]

insert into subcategory(id, category)  
    select i, i/100 from number  
    where i <= 300000;  

insert into item(subcategory)  
    select id  
    from (  
        select id, rand() * 20 as num_rows from subcategory  
    ) as x  
        cross join number  
    where i <= num_rows;  

create temporary table t as  
    select subcategory from item  
    group by subcategory  
    having count(*) = 19  
    limit 100;  

insert into item (subcategory)  
    select subcategory  
    from t  
        cross join number  
    where i < 2000;  
 

Again, these statements take 1 point to complete and are not suitable for running in a production environment. The idea is to insert the number of random rows into item so that subcategory has between 1 and 2018 item. This is not the actual complete data, but the effect is 1.

I want to find all subcategory where the number of item in some category is greater than 2000. First, I find one with an subcategory item number greater than 2000, and use its category for the following query. This is the specific query statement:

[sql]

select c.id  
from subcategory as c  
    inner join item as i on i.subcategory = c.id  
group by c.id  
having count(*) > 2000;  

-- choose one of the results, then  
select * from subcategory where id = ????  
-- result: category = 14  
 

I get an appropriate value of 14, which will be used in the following query. This is a query for all subcategory in category 14 with an item number greater than 2000:

[sql]

select c.id  
from subcategory as c  
    inner join item as i on i.subcategory = c.id  
where c.category = 14  
group by c.id  
having count(*) > 2000;  
 

In my sample data, the result of the query was 10 rows long and completed in just over 10 seconds. EXPLAIN shows good use of indexes; Given the size of the data, it's pretty good. The query plan is to traverse the index and compute the target record. So far, so good.

This time let's say I want to extract all the fields from subcategory. I could have nested the above query and used JOIN or SELECT MAX or something like that (since the values for the grouping sets are all unique to one), but I could have written it like 1 below, right?

[sql]

select * from subcategory  
where id in (  
    select c.id  
    from subcategory as c  
        inner join item as i on i.subcategory = c.id  
    where c.category = 14  
    group by c.id  
    having count(*) > 2000  
);  
 

Running this query is estimated to be from dawn to sunset to the earth. I don't know how long it will run, because I'm not going to let it run forever. You might think, just from the statement, that it would :a) calculate the query inside and find the 10 values, b) go ahead and find the 10 records and find them on the primary index very quickly. No, this is the actual query plan:

[sql]

*************************** 1. row ***************************  
           id: 1  
  select_type: PRIMARY  
        table: subcategory  
         type: ALL  
possible_keys: NULL  
          key: NULL  
      key_len: NULL  
          ref: NULL  
         rows: 300783  
        Extra: Using where  
*************************** 2. row ***************************  
           id: 2  
  select_type: DEPENDENT SUBQUERY  
        table: c  
         type: ref  
possible_keys: PRIMARY,category  
          key: category  
      key_len: 4  
          ref: const  
         rows: 100  
        Extra: Using where; Using index; Using temporary; Using filesort  
*************************** 3. row ***************************  
           id: 2  
  select_type: DEPENDENT SUBQUERY  
        table: i  
         type: ref  
possible_keys: subcategory  
          key: subcategory  
      key_len: 4  
          ref: c.id  
         rows: 28  
        Extra: Using index  
 

If you are not familiar with how to analyze mysql's statement query plan, mysql plans to execute queries from the outside to the inside, not from the inside to the outside. I'll go through each part of the query one by one.

The query outside simply becomes SELECT * FROM subcategory. Although the inside query has a constraint on subcategory (WHERE category = 14), for some reason mysql does not apply it to the outside query. I don't know why. All I know is that it scans the entire table (that's what type:ALL means) and doesn't use any indexes. This is a scan on a table with over 100,000 rows of records.

For an outside query, the inside query is executed once for each row, although no value is used by the inside query because the inside query is "optimized" to reference the outside query. With this analysis, the query plan becomes a nested loop. For every loop of the outside query, the inside query is executed. Here is the query plan after the optimizer has rewritten it:

[sql]

select * from subcategory as s  
where <in_optimizer>(  
   s.id,<exists>(  
   select c.id  
   from subcategory as c  
      join item as i  
   where ((i.subcategory = c.id) and (c.category = 14))  
   group by c.id  
   having ((count(0) > 2000)  
      and (<cache>(s.id) = <ref_null_helper>(c.id))))  
)  
 

You can get optimized queries by putting SHOW WARNINGS behind EXPLAIN EXTENDED. Notice the external domain pointed to in the HAVING clause.

I didn't use this example to criticize mysql's optimization strategy. It is well known that mysql does not optimize nested queries well in some cases, and this problem has been widely reported. I want to point out that it is important for developers to check the query statements to make sure they are not poorly optimized. For the most part, it is safe to avoid nesting unless it is absolutely necessary -- WHERE in particular... IN () and WHERE... NOT IN statement.

My own rule is "if in doubt, EXPLAIN look". If I was faced with a big data table, I would naturally question it.

How do I force a query inside to execute first
The statement crash in the previous section is only because mysql executes it from the outside out as a related statement, not from the inside out as an unrelated statement. There is also a way to get mysql to execute the internal queries first, as a temporary table, to avoid significant performance overhead.

mysql implements nested queries from temporary tables (derivative tables that are somewhat misinformed). This means that mysql first executes the inner query, stores the results in a temporary table, and then USES it in other tables. This is how I expected the query to execute when I wrote it. The query statement is modified as follows:
[sql]

select * from subcategory  
where id in (  
    select id from (  
        select c.id  
        from subcategory as c  
            inner join item as i on i.subcategory = c.id  
        where c.category = 14  
        group by c.id  
        having count(*) > 2000  
    ) as x  
);  
 

All I did was wrap the nesting around the original nested query. mysql will assume that the innermost one is a separate nested query that is executed first, and now all that is left is the nested one wrapped around the outside, which has been packed into a temporary table with only a few records, so it is much faster. According to this analysis, this is a rather stupid optimization method; I might as well rewrite it as join. Also, to avoid being seen by others, clean up as unnecessary code.

There are cases where this optimization approach can be used, such as when mysql throws an error and the table of the nested query is modified elsewhere. Unfortunately, this approach is useless in cases where a temporary table can only be used once in a query.

Source http: / / blog. csdn. net/afeiqiang/article/details / 8620038

Related articles: