Preliminary study on query efficiency calculation of Postgresql


**the **

One important aspect of relational databases is query speed. The quality of query speed directly affects the quality of a system.

Query speed generally requires a peek into the execution process through query planning.

The query path selects the least expensive path to execute. And how did we figure out that cost?

focuses on the parameters and the table

parameter: is from the postgresql.conf file,

can be viewed through show

seq_page_cost = 1.0     # measured on an arbitrary scale
random_page_cost = 4.0     # same scale as above
cpu_tuple_cost = 0.01     # same scale as above
cpu_index_tuple_cost = 0.005   # same scale as above
cpu_operator_cost = 0.0025    # same scale as above
parallel_tuple_cost = 0.1    # same scale as above
parallel_setup_cost = 1000.0   # same scale as above

Table (view): pg_class (focus on relpages, reltuples), pg_stats

analyzes the costing process for simple queries

Set up the simulation data and insert 100,000 pieces of data into a table

create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i);

no index case

Analyze the full table query costing process

postgres=# analyze test;  # In case there is no analysis
postgres=# explain select * from test;
       QUERY PLAN
-------------------------------------------------------------
 Seq Scan on test (cost=0.00..1834.00 rows=100000 width=37)

1. Query pg_class table to see page number and row number of test table

postgres=# select t.relpages, t.reltuples from pg_class t where t.relname = 'test';
 relpages | reltuples
----------+-----------
  834 | 100000

How was the cost of 1834.00 calculated?

2. In this process, 834 page were actually scanned sequentially, and 100,000 lines of were emitted from the node

3. View the configuration parameter

seq_page_cost = 1.0
cpu_tuple_cost = 0.01

4. The result is

postgres=# select 834 * 1.0 + 100000 * 0.01;
 ?column?
----------
 1834.00

5. The query cost is 1834.00. It is consistent with the above query plan.

Full list of cost - adding conditions for the costing process

postgres=# explain select * from test where id = 100;
      QUERY PLAN
--------------------------------------------------------
 Seq Scan on test (cost=0.00..2084.00 rows=1 width=37)
 Filter: (id = 100)

How is the cost 2084.00 calculated?

1. Query pg_class table, pages,tuples as in the above example

2. This process is the order test table, which emits 100,000 rows, and then filters 100,000 rows through cloud storage

3. View the cost of the filter operation line

cpu_operator_cost = 0.0025

4. The result is

postgres=# select 834 * 1.0 + 100000 * 0.01 + 100000 * 0.0025;
 ?column?
-----------
 2084.0000

Join index case

```
create index on test(id);
```

compares the following four situations:

Index Only Scan

create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i);

0

Index Scan

create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i);

1

Index Scan

create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i);

2

Insert the data into

out of order

truncate table test;
insert into test(id, info) select i, md5(i::text) from generate_series(1, 1000000) t(i) order by random();
create table test(id int, info text);
insert into test(id, info) select i, md5(i::text) from generate_series(1, 100000) t(i);

4

conclusion

When there is an index, the cost is greatly reduced. The execution plan has a lot to do with the distribution of data. The analysis of the index is relatively complex, you can refer to the official source code implementation. will be added later

summary