PostgreSQL tutorial of 10: performance improvement tips

2020-05-06 11:53:59
OfStack

1. Use EXPLAIN:

PostgreSQL generates a query plan for each query because the performance impact of choosing the right query path is critical. PostgreSQL itself already contains a planner for finding the optimal plan, and we can view the query plan generated by the planner for each query by using the EXPLAIN command.
The query planning generated in PostgreSQL is a planning tree composed of 1 to n planning nodes, of which the lowest node is the table scan node, which is used to return the retrieved data rows from the data table. However, different scan node types represent different table access patterns, such as sequential scan, index scan, and bitmap index scan. If the query still requires concatenation, aggregation, sorting, or other operations on the original row, there will be additional nodes "on top" of the scan node. And there are often multiple methods for these operations, so it is possible to have different node types at these locations. EXPLAIN outputs a line of information for each node in the planning tree, showing the basic node type and the estimated cost value calculated by the planner for executing the planning node. The first line (the topmost node) is an estimate of the total execution cost of the program, which is the number the planner is trying to minimize.
Here's a simple example:



    EXPLAIN SELECT * FROM tenk1;

                             QUERY PLAN

    -------------------------------------------------------------

     Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)

The data referenced by EXPLAIN is:
1). Expected startup overhead (time consumed before the output scan begins, such as time to queue in a sort node).
2). Estimated total cost.
3). The expected number of rows output from the program node.
4). The estimated average line width of the planned node (in bytes).
Here the unit of account for the overhead (cost) is the number of disk page accesses, such as 1.0, which would represent a sequential disk page read. The overhead of the upper node will include the overhead of all its children. The number of output lines (rows) here is not the number of lines that the planning node processes/scans, and is usually less. In general, the number of rows at the top level is expected to be closer to the number of rows actually returned by the query.
Now we execute the following query based on the system table:



    SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';

From the query results, we can see that the tenk1 table occupies 358 disk pages and 10,000 records. However, in order to calculate the value of cost, we still need to know another system parameter value.



    postgres=# show cpu_tuple_cost;

     cpu_tuple_cost

    ----------------

     0.01

    (1 row)

     cost = 358( Number of disk pages ) + 10000( The number of rows ) * 0.01(cpu_tuple_cost System parameter value )

Next let's look at a query plan with WHERE conditions.



    EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 7000;

    

                             QUERY PLAN

    ------------------------------------------------------------

     Seq Scan on tenk1  (cost=0.00..483.00 rows=7033 width=244)

       Filter: (unique1 < 7000)

The output of EXPLAIN shows that the WHERE clause is treated as an "filter" application, which means that the planning node will scan each row of data in the table, determine whether they meet the filtering criteria, and finally output only the number of rows that pass the filtering criteria. Because of the WHERE clause, the expected number of output lines is reduced. Even so, the scan will still access all 10,000 rows of data, so the overhead isn't really reduced and it actually adds some extra CPU overhead due to data filtering.
The data above is only an estimated number, which changes even after each execution of the ANALYZE command, because the statistics generated by ANALYZE are calculated using a random sample from the table.
if we set the above query more strictly, we will get a different query plan, such as



    EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100;
                                      QUERY PLAN

    ------------------------------------------------------------------------------

     Bitmap Heap Scan on tenk1  (cost=2.37..232.35 rows=106 width=244)

       Recheck Cond: (unique1 < 100)

       ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)

             Index Cond: (unique1 < 100)

In , the planner decides to use a two-step plan, in which the innermost planning node accesses an index, finds the position of the rows that match the index condition, and then the upper planning node reads the rows from the table. Reading rows individually is much more expensive than reading them sequentially, but because not all the disk pages of the table are accessed, this method still costs less than a sequential scan. The reason for the two-tier plan is that the upper planning node sorts the physical locations of rows retrieved through the index first, which minimizes the overhead of reading separate disk pages. The bitmap (bitmap) mentioned in the node name is the sorting mechanism.

now we can also set WHERE more strictly, such as



    EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 3;
                                      QUERY PLAN

    ------------------------------------------------------------------------------

     Index Scan using tenk1_unique1 on tenk1  (cost=0.00..10.00 rows=2 width=244)

       Index Cond: (unique1 < 3)

In , the rows of the table are read in index order, which makes them more expensive to read, but the number of rows to be fetched is so small that there is no need to sort them based on the physical location of the rows.
Now we need to add another condition to the WHERE clause, such as



    EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 3 AND stringu1 = 'xxx';

    

                                      QUERY PLAN

    ------------------------------------------------------------------------------

     Index Scan using tenk1_unique1 on tenk1  (cost=0.00..10.01 rows=1 width=244)

       Index Cond: (unique1 < 3)

       Filter: (stringu1 = 'xxx'::name)

The new filter condition stringu1 = 'xxx' only reduces the number of rows expected to be output, but it does not reduce the actual overhead because we still need to access the same number of rows. This condition is not regarded as an index condition, but as a filtering condition of index results.
If more than one field exists in the WHERE condition, the planner may use a combination of the AND or OR of the index, such as



    EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000;

     

                                         QUERY PLAN

    -------------------------------------------------------------------------------------

     Bitmap Heap Scan on tenk1  (cost=11.27..49.11 rows=11 width=244)

       Recheck Cond: ((unique1 < 100) AND (unique2 > 9000))

       ->  BitmapAnd  (cost=11.27..11.27 rows=11 width=0)

             ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)

                   Index Cond: (unique1 < 100)

             ->  Bitmap Index Scan on tenk1_unique2  (cost=0.00..8.65 rows=1042 width=0)

                   Index Cond: (unique2 > 9000)

Results like will result in access to two indexes, which is not necessarily better than using only one index and using the other condition as a filter.
Now let's look at query planning for table joins based on index fields, such as



    EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;

      

                                          QUERY PLAN

    --------------------------------------------------------------------------------------

     Nested Loop  (cost=2.37..553.11 rows=106 width=488)

       ->  Bitmap Heap Scan on tenk1 t1  (cost=2.37..232.35 rows=106 width=244)

             Recheck Cond: (unique1 < 100)

             ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)

                   Index Cond: (unique1 < 100)

       ->  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.00..3.01 rows=1 width=244)

             Index Cond: ("outer".unique2 = t2.unique2)

As you can see from the query planning (Nested Loop), the query statement USES nested loops. The outer scan is a bitmap index, so the overhead is the same as the row count and the previous query, because of the condition unique1 < 100 worked. At this point the t1.unique2 = t2.unique2 condition clause has not yet done anything, so it does not affect the line count of the outer scan. For an inner scan, however, the data from the current outer scan will be inserted into the inner index scan, and a similar condition will be generated. So, the inner scan will get the same planning and overhead as EXPLAIN SELECT * FROM tenk2 WHERE unique2 = 42. Finally, the overhead of the outer layer scan sets the overhead of the loop node on the basis, plus one iteration per outer row (in this case 106 * 3.01), and a little CPU time for connection processing.
If does not want to plan the above query in a nested loop fashion, we can close the nested loop by performing the following system Settings, such as



    SET enable_nestloop = off;

    EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2 WHERE t1.unique1 < 100 AND t1.unique2 = t2.unique2;

      

                                            QUERY PLAN

    ------------------------------------------------------------------------------------------

     Hash Join  (cost=232.61..741.67 rows=106 width=488)

       Hash Cond: ("outer".unique2 = "inner".unique2)

       ->  Seq Scan on tenk2 t2  (cost=0.00..458.00 rows=10000 width=244)

       ->  Hash  (cost=232.35..232.35 rows=106 width=244)

             ->  Bitmap Heap Scan on tenk1 t1  (cost=2.37..232.35 rows=106 width=244)

                   Recheck Cond: (unique1 < 100)

                   ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..2.37 rows=106 width=0)

                         Index Cond: (unique1 < 100)

this plan still attempts to scan the required 100 rows from tenk1 with the same index and store them in an in-memory hash table, then do a full-table sequential scan of tenk2 and query the hash table for each record in tenk2 for rows that might match t2.unique2. Reading tenk1 and setting up a hash table are all startup costs for this hash join, because we cannot get any output rows before we start reading tenk2.

In addition, we can check the accuracy of the planner's estimates with the EXPLAIN command. This command will first execute the query and then display the actual elapsed time within each program node, as well as the estimated cost of the EXPLAIN command alone, such as



    SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';

0
Note that " " values are calculated in milliseconds of real time, while "cost" estimates are calculated in milliseconds of real time, so they are likely to be inconsistent. However, what we need to pay attention to is whether the ratio of the two sets of data is consistent.

In some query plans, a subplan node is likely to be run multiple times, as in the previous nested loop plan, where the inner index scan is performed once for each outer row. In this case, "loops" reports the total number of executions of the node, while the actual time and number of rows displayed are the average of each execution. The reason for this is to make these real values more comparable to the values shown in the overhead estimates. If you want to get the total amount of time spent on the node, you multiply it by the value "loops".
The "Total runtime" shown in EXPLAIN ANALYZE includes the time the actuator is started and closed, and the time the resulting row is processed, but it does not include the time to analyze, rewrite, or plan.
The command is of no use if it can only be used in the test environment and not in the real world. For example, if EXPLAIN is performed on a table with less data, it cannot be applied to a large number of large tables, because the planner's overhead calculation is not linear, so it is likely to choose different plans for larger or smaller tables. An extreme example is a table that occupies only one disk page, on which you almost always get a sequential scan plan, whether it has an index or not. The planner knows that it will read one disk page in any case, so adding a few more disk page reads to find the index is pointless.

Ii. Batch data insertion:

has the following methods for optimizing bulk inserts of data.

1. Turn off auto-submit:

If every piece of data is automatically submitted during batch insertion of , when system failure occurs in the middle of the process, not only can't guarantee the data consistency of this batch insertion, but also the whole insertion efficiency will be greatly hit due to the occurrence of multiple submission operations. The solution is to turn off auto-commit on the system, and execute the begin transaction command shown before the insertion starts, and then execute commit command to commit all inserts after all inserts are completed.

2. Use COPY:

USES COPY to load all records in one command instead of a series of INSERT commands. The COPY command, which is optimized for loading a large number of data rows, is not as flexible as the INSERT command, but it also has much less overhead when loading large amounts of data. Because COPY is a single command, there is no need to turn off auto-commit when populating a table.

3. Delete the index:

If you are loading a newly created table, the quickest way is to create the table, load it in bulk with COPY, and then create whatever indexes the table needs. Because creating indexes on a table with existing data is faster than maintaining row by row increments. Of course, during the absence of indexes, the performance of other query operations related to the table will suffer, and the uniqueness constraint may be broken.

4. Remove foreign key constraints:
like indexes, checking foreign key constraints "in bulk" is more efficient than checking them line by line. Therefore, we can delete the foreign key constraint, load the data, and then rebuild the constraint.

5. Increase maintenance_work_mem:
Temporarily increasing the value of the maintenance_work_mem system variable improves performance when a large amount of data is loaded. This system parameter increases the efficiency of the CREATE INDEX command and ALTER TABLE FOREIGN KEY command, but it does not have much impact on the COPY operation itself.

6. Increase checkpoint_segments:
Temporarily increasing the value of the checkpoint_segments system variable can also improve the efficiency of large data loads. This is because when large amounts of data are loaded into PostgreSQL, checkpoint operations (as declared by the system variable checkpoint_timeout) occur more frequently than usual. At each checkpoint, all the dirty data must be flush to disk. By increasing the value of the checkpoint_segments variable, the number of checkpoints can be effectively reduced.

7. Run ANALYZE:
after the fact should run the ANALYZE command immediately after a large amount of data has been added or updated to ensure that the planner has the latest statistics based on the table. In other words, if there are no statistics or if the statistics are too old, the planner is likely to choose a poor query plan, resulting in an inefficient query.