Preliminary study on query efficiency calculation of Postgresql

  • 2020-05-06 11:55:53
  • OfStack

the

One important aspect of relational databases is query speed. The quality of query speed directly affects the quality of a system.

Query speed generally requires a peek into the execution process through query planning.

The query path selects the least expensive path to execute. And how did we figure out that cost?

focuses on the parameters and the table

parameter: is from the postgresql.conf file,

can be viewed through show

seq_page_cost = 1.0     # measured on an arbitrary scale
random_page_cost = 4.0     # same scale as above
cpu_tuple_cost = 0.01     # same scale as above
cpu_index_tuple_cost = 0.005   # same scale as above
cpu_operator_cost = 0.0025    # same scale as above
parallel_tuple_cost = 0.1    # same scale as above
parallel_setup_cost = 1000.0   # same scale as above 

Table (view): pg_class (focus on relpages, reltuples), pg_stats

analyzes the costing process for simple queries

Set up the simulation data and insert 100,000 pieces of data into a table


create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i); 

no index case

Analyze the full table query costing process


postgres=# analyze test;  # In case there is no analysis 
postgres=# explain select * from test;
       QUERY PLAN       
-------------------------------------------------------------
 Seq Scan on test (cost=0.00..1834.00 rows=100000 width=37) 

1. Query pg_class table to see page number and row number
of test table


postgres=# select t.relpages, t.reltuples from pg_class t where t.relname = 'test';
 relpages | reltuples 
----------+-----------
  834 | 100000 

How was the cost of 1834.00 calculated?

2. In this process, 834 page were actually scanned sequentially, and 100,000 lines of
were emitted from the node

3. View the configuration parameter


seq_page_cost = 1.0 
cpu_tuple_cost = 0.01 

4. The result is


postgres=# select 834 * 1.0 + 100000 * 0.01;
 ?column? 
----------
 1834.00 

5. The query cost is 1834.00. It is consistent with the above query plan.

Full list of cost - adding conditions for the costing process


postgres=# explain select * from test where id = 100;
      QUERY PLAN      
--------------------------------------------------------
 Seq Scan on test (cost=0.00..2084.00 rows=1 width=37)
 Filter: (id = 100) 

How is the cost 2084.00 calculated?

1. Query pg_class table, pages,tuples as in the above example

2. This process is the order test table, which emits 100,000 rows, and then filters 100,000 rows
through cloud storage

3. View the cost of the filter operation line


cpu_operator_cost = 0.0025 

4. The result is


postgres=# select 834 * 1.0 + 100000 * 0.01 + 100000 * 0.0025;
 ?column? 
-----------
 2084.0000

Join index case


```
create index on test(id);
```

compares the following four situations:

Index Only Scan


create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i); 
0

Index Scan


create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i); 
1

Index Scan


create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i); 
2

Insert the data into

out of order

truncate table test;
insert into test(id, info) select i, md5(i::text) from generate_series(1, 1000000) t(i) order by random();

create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i); 
4

conclusion

When there is an index, the cost is greatly reduced. The execution plan has a lot to do with the distribution of data. The analysis of the index is relatively complex, you can refer to the official source code implementation. will be added later

summary


Related articles: