PostgreSQL tutorial of viii: index in detail

  • 2020-05-06 11:54:06
  • OfStack

Type of index:

      PostgreSQL provides a variety of index types: B-Tree, Hash, GiST, and GIN. Because they use different algorithms, each index type has its own suitable query type. By default, the CREATE INDEX command creates an B-Tree index.
     
      1. B-Tree:
 


    CREATE TABLE test1 (
        id integer,
        content varchar
    );
    CREATE INDEX test1_id_index ON test1 (id);
 

The       B-Tree index is primarily used for equality and range queries, especially when the index column contains the operator "< , < =, =, > = and > "All query planners at PostgreSQL consider using the B-Tree index as a query condition. In queries using BETWEEN, IN, IS NULL, and IS NOT NULL, PostgreSQL can also use the B-Tree index. However, for queries based on pattern matching operators such as LIKE, ILIKE, ~, and ~*, the index takes effect only if the pattern has a constant at the beginning of the pattern string, such as col LIKE 'foo%' or col ~ '^foo', otherwise a full table scan is performed, such as col LIKE '%bar'.
     
      2. Hash:
 

    CREATE INDEX name ON table USING hash (column);
 

The       hash (Hash) index can only handle simple equals comparisons. When index columns are compared using the equals operator, the query planner considers using hash indexes.
As an additional note, the performance of the PostgreSQL hash index is no better than that of the B-Tree index, but the size and construction time of the hash index are worse. Also, since the hash index operation is not currently logged on WAL, we would have to rebuild the hash index on REINDEX in case of a database crash.
     
      3. GiST:
The       GiST index is not a single index type, but a schema on which many different indexing strategies can be implemented. This enables the GiST index to use specific operator types depending on the index strategy.
     
      4. GIN:
The       GIN index is an inverted index that can handle values containing multiple keys (such as arrays). Similar to GiST, GIN also supports user-defined index policies, allowing the GIN index to use specific operator types depending on the different index policies. As an example, the standard release for PostgreSQL includes the GIN operator type for one-dimensional arrays, such as < @ @ > , =, &&, etc.

Ii. Composite index:

Indexes in       PostgreSQL can be defined on multiple fields in a data table, such as
 


    CREATE TABLE test2 (
        major int,
        minor int,
        name varchar
    }
    CREATE INDEX test2_mm_idx ON test2 (major, minor);
 

In the current version, only B-tree, GiST, and GIN support composite indexes, where up to 32 fields can be declared.
      1. Composite index of B-Tree type:
In a composite index of the B-Tree type, any subset of the index field can be used for query criteria, however, maximum efficiency is achieved only when the first index field (far left) in the composite index is included.
     
      2. Composite index of GiST type:
In an       composite index of type GiST, it is only when the first index field is included in the query criteria that the query can determine how much index data will be scanned, and the conditions on the other index fields simply limit the entries returned by the index. If most of the data on the first index field has the same key value, then applying the GiST index is less efficient.

      3. Composite index of GIN type:
      unlike B-Tree and GiST indexes, GIN composite indexes are not affected by which subsets of index fields are used in the query condition, and either combination will result in the same efficiency.

      should be careful when using composite indexes. In most cases, an index on a single field is sufficient and saves time and space. An index of more than three fields is of little use unless the usage pattern of the table is very fixed.

3. Multiple indexes:

      PostgreSQL can combine multiple indexes (including multiple USES of the same index) at query time to handle situations where a single index scan is not possible. At the same time, the system can also form AND and OR conditions between multiple index scans. For example, a query like WHERE x = 42 OR x = 47 OR x = 53 OR x = 99 can be broken down into four separate scans based on the index of the x field, each scan using a query clause, and then the scan results OR together to produce the final result. Another example is if we have separate indexes on x and y, then a query like WHERE x = 5 AND y = 6 will be scanned separately based on the indexes of these two fields, and then AND operation will be performed on the results of each scan to generate the final result row.

To combine multiple indexes, the system scans each required index and then organizes an BITMAP in memory, which gives the physical location of the data scanned by the index in the data table. Then, according to the needs of the query, these bitmaps to AND or OR operations and the final BITMAP. Finally, the data table is retrieved and rows of data are returned. The rows of the table are accessed in physical order, because this is the layout of the bitmap, which means that any sort of original index will disappear. If there is an ORDER BY clause in the query, there will be an additional sort step. Because of this, and because each additional index scan adds extra time, the planner sometimes chooses to use a simple index scan, even if multiple indexes are available.        

     
4. Unique index:

Currently, only the B-Tree index can be declared unique.
 


    CREATE UNIQUE INDEX name ON table (column [, ...]);
 

      if the index is declared to be unique, multiple rows with the same index value are not allowed. We don't think NULL is equal to each other.
     
5. Expression index:

The       expression index is primarily used in cases where the result of a function or expression based on one field is compared to other values in a query condition, such as
 


    SELECT * FROM test1 WHERE lower(col1) = 'value';
 

At this point, if we were only indexing on the col1 field, the query would definitely not use that index at execution time, but would instead perform a full table scan. If the table has a large amount of data, it will take a long time to execute the query. The solution to this problem is simply to create an expression index on the test1 table based on the col1 field, such as
      CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));
      if we declare this index as UNIQUE, it prohibits the creation of rows whose values col1 differ only in case and whose values col1 are identical. Therefore, indexes on expressions can be used to enforce constraints that cannot be defined as simple unique constraints. Now let's look at another example of applying an expression index.
 

    SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith';
 

      as in the above example, although we may create separate indexes for first_name and last_name, or a composite index based on these two fields, these indexes will not be used when the query is executed, and the only indexes available for the query are the expression indexes we created below.
 

    CREATE INDEX people_names ON people ((first_name || ' ' || last_name));
 

The syntax of the       CREATE INDEX command generally requires that parentheses be written around the index expression, as we showed in the second example. If the expression is just a function call, you can omit it, as we showed in the first example.

From an index maintenance point of view, an index expression is relatively inefficient, because when data is inserted or updated, the result of the expression must be evaluated for that row and stored directly in the index. However, when querying, PostgreSQL treats them as WHERE idxcol = 'constant', so the search speed is equivalent to that of a query based on a simple index. In general, we should only use expression indexes in situations where the retrieval speed is more important than the insertion and update speed.  
     
vi. Partial index:

An       partial index (partial index) is an index built on a subset of a table defined by a conditional expression (a predicate called a partial index). The index contains only those rows in the table that satisfy this predicate.
      partial indexes improve the efficiency of data insertion and data update because they do not need to be updated in all cases. However, because partial indexes are smaller than normal indexes, it is possible to better improve the query efficiency of the index part. See the following three examples:
      1. The index field and the predicate condition field are the same:
 


    CREATE INDEX access_log_client_ip_ix ON access_log(client_ip)
        WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');

      the following query will use this section index:
 

    SELECT * FROM access_log WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';
 

The following query will not use this part of the index:
A query that cannot use this index may be:
 

    SELECT * FROM access_log WHERE client_ip = inet '192.168.100.23';

      2. Inconsistencies between index field and predicate condition field:
      PostgreSQL supports partial indexes with arbitrary predicates, the only constraint being that the fields of the predicates also come from the same data table. Note that if you want your query to use partial indexes, the conditional part of the query must exactly match the predicate of the partial index. Precisely, this partial index can only be used for the query if PostgreSQL is able to recognize that the WHERE condition for the query mathematically covers the predicates of the index.
 

    CREATE INDEX orders_unbilled_index ON orders(order_nr) WHERE billed is not true;
 

      the following query must use this section index:
 

    SELECT * FROM orders WHERE billed is not true AND order_nr < 10000;
 

      what about the following query?
 

    SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
 

      this query will not be as efficient as the above query because the query condition does not use the index field, whereas the query condition "billed is not true" exactly matches the predicate of the partial index, so PostgreSQL will scan the entire index. This makes the query more efficient only if there is relatively little index data.

Partial indexes will not be used in the following query for      .
 


    SELECT * FROM orders WHERE order_nr = 3501;
 

     
      3. Uniqueness constraint on a subset of a data table:
 

    CREATE TABLE tests (
        subject text,
        target text,
        success boolean,
        ...
    );
    CREATE UNIQUE INDEX tests_success_constraint ON tests(subject, target) WHERE success;
 

The partial index of       will only impose a unique constraint on data whose success field value is true. In practice, this implementation method can be very efficient if less data succeeds and more data fails.
       
Check the use of the index:

See four Suggestions for      :
1. Always run ANALYZE first.
      this command will collect statistics on the distribution of values in the table. This information is needed to estimate the number of rows returned by a query, which the planner needs to assign a true cost value to each possible query plan. In the absence of any real statistics, some default value is used, which is definitely inaccurate. Therefore, checking the usage of an index without running ANALYZE would be a failed check.
      2. Experiment with real data.
      populates the data table with the test data, and the index of that table will only evaluate how the index is used based on the test data, not all of the data. For example, if you choose 1000 rows from 100000 rows, the planner might consider using an index, and if you choose 1 row from 100 rows, it's hard to say that you will use an index. Because 100 rows of data are most likely stored on a disk page, no query planning is more efficient than accessing a disk page sequentially. At the same time, when simulating test data, it is important to note that if the data is very similar, completely random, or inserted in sorted order, the statistics will deviate from the characteristics that the actual data should have.      
      3. If the index is not used, it may be of some value to force its use in the test. There are runtime parameters that can turn off various query plans.
Forcing the use of indexes will lead to two possibilities: the system selection is correct, the use of indexes is not really appropriate, and the cost calculation of the query plan does not reflect the reality. The EXPLAIN ANALYZE command is useful when you should time queries with and without indexes.


Related articles: