asp. net Realization Postgresql Fast Write and Read Large Number of Data Instances

  • 2021-10-13 07:05:57
  • OfStack

Recently, because some projects need to insert a large amount of data, I studied asp. net to realize Postgresql to write/read a large amount of data quickly, so leave a note

Environment and Testing

Use. net to drive npgsql to connect to post database. Configuration: win10 x64, i5-4590, 16G DDR3, SSD 850EVO.

postgresql 9.6. 3, database and data are installed on SSD, default configuration, no extensions.


CREATE TABLE public.mesh
(
 x integer NOT NULL,
 y integer NOT NULL,
 z integer,
 CONSTRAINT prim PRIMARY KEY (x, y)
)

1. Import

Use data backup, import in csv format, file on mechanical hard disk, 480MB, data volume 2500w +.

Using COPY


copy mesh from 'd:/user.csv' csv

Running time 107s

Using insert

Single connection, c # release any cpu non-debug mode.


class Program
{
  static void Main(string[] args)
  {
    var list = GetData("D:\\user.csv");
    TimeCalc.LogStartTime();
    using (var sm = new SqlManipulation(@"Strings", SqlType.PostgresQL))
    {
      sm.Init();
      foreach (var n in list)
      {
        sm.ExcuteNonQuery($"insert into mesh(x,y,z) values({n.x},{n.y},{n.z})");
      }
    }
    TimeCalc.ShowTotalDuration();

    Console.ReadKey();
  }

  static List<(int x, int y, int z)> GetData(string filepath)
  {
    List<ValueTuple<int, int, int>> list = new List<(int, int, int)>();
    foreach (var n in File.ReadLines(filepath))
    {
      string[] x = n.Split(',');
      list.Add((Convert.ToInt32(x[0]), Convert.ToInt32(x[1]), Convert.ToInt32(x[2])));
    }
    return list;
  }
}

The occupancy rate of Postgresql CPU is very low, but after running for 1 year, the program still can't end, and there is no patience..., so it can't be inserted.

multiline insert

Using multiline insertion, one statement inserts about 100 pieces of data.


var bag = GetData("D:\\user.csv");
// When used, directly execute the stringbuilder Adj. tostring Method. 
List<StringBuilder> listbuilder = new List<StringBuilder>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < bag.Count; i++)
{
  if (i % 100 == 0)
  {
    sb = new StringBuilder();
    listbuilder.Add(sb);
    sb.Append("insert into mesh(x,y,z) values");
    sb.Append($"({bag[i].x}, {bag[i].y}, {bag[i].z})");
  }
  else
    sb.Append($",({bag[i].x}, {bag[i].y}, {bag[i].z})");
}

The occupancy rate of Postgresql CPU is almost 27%, and the disk writes about 45MB/S, which feels like working, and the last time is 217.36 s.

If you change to 10001 lines, CPU occupancy increases, but disk writes decrease on average, and the last time is 160.58 s.

prepare Syntax

The prepare syntax allows postgresql to plan ahead for sql and optimize performance.

Using single-line insertion CPU occupancy rate is less than 25%, and disk writing is about 63MB/S. However, using single-line insertion mode, the efficiency has not changed, and the time is too long to wait for the result.

Using multi-line insertion CPU occupies 30%, disk writing 50MB/S, and the final result is 163.02. At the end, there is an exception, that is, the length of the last group of data does not meet the conditions, which is harmless.


static void Main(string[] args)
{
  var bag = GetData("D:\\user.csv");
  List<StringBuilder> listbuilder = new List<StringBuilder>();
  StringBuilder sb = new StringBuilder();
  for (int i = 0; i < bag.Count; i++)
  {
    if (i % 1000 == 0)
    {
      sb = new StringBuilder();
      listbuilder.Add(sb);
      //sb.Append("insert into mesh(x,y,z) values");
      sb.Append($"{bag[i].x}, {bag[i].y}, {bag[i].z}");
    }
    else
      sb.Append($",{bag[i].x}, {bag[i].y}, {bag[i].z}");
  }
  StringBuilder sbp = new StringBuilder();
  sbp.Append("PREPARE insertplan (");
  for (int i = 0; i < 1000; i++)
  {
    sbp.Append("int,int,int,");
  }
  sbp.Remove(sbp.Length - 1, 1);
  sbp.Append(") AS INSERT INTO mesh(x, y, z) values");
  for (int i = 0; i < 1000; i++)
  {
    sbp.Append($"(${i*3 + 1},${i* 3 + 2},${i*3+ 3}),");
  }
  sbp.Remove(sbp.Length - 1, 1);
  TimeCalc.LogStartTime();

  using (var sm = new SqlManipulation(@"string", SqlType.PostgresQL))
  {
    sm.Init();
    sm.ExcuteNonQuery(sbp.ToString());
    foreach (var n in listbuilder)
    {
      sm.ExcuteNonQuery($"EXECUTE insertplan({n.ToString()})");
    }
  }
  TimeCalc.ShowTotalDuration();

  Console.ReadKey();
}

Using Transaction

On the basis of the previous, use transaction transformation. Each statement inserts 1000 pieces of data, every 1000 pieces as a transaction, CPU 30%, disk 34MB/S, taking 170.16 s.

It takes 167.78 s to change to 100 articles and one transaction.

Use multithreading

On the basis of the previous, multi-threading is used, each thread establishes one connection, one connection processes 100 sql statements, and each sql statement inserts 1000 pieces of data, which is imported in this way. Note that the connection string can be set to maxpoolsize larger than 1, I measured on the machine, do not set will report connection timeout error.

The occupancy rate of CPU is up to 80%. It should be noted here that because a lot of Postgresql server processes have been generated, it is difficult to make statistics. The cumulative calculation should be small 100MB/S, and the final time is 98.18 s.

With TPL, because the results returned by Parallel. ForEach are not checked, the time may not be very accurate (small).


var lists = new List<List<string>>();
var listt = new List<string>();
for (int i = 0; i < listbuilder.Count; i++)
{
  if (i % 1000 == 0)
  {
    listt = new List<string>();
    lists.Add(listt);
  }
  listt.Add(listbuilder[i].ToString());
}
TimeCalc.LogStartTime();
Parallel.ForEach(lists, (x) =>
{
  using (var sm = new SqlManipulation(@";string;MaxPoolSize=1000;", SqlType.PostgresQL))
  {
    sm.Init();
    foreach (var n in x)
    {
      sm.ExcuteNonQuery(n);
    }
  }
});
TimeCalc.ShowTotalDuration();

写入方式 耗时(1000条/行)
COPY 107s
insert N/A
多行insert 160.58s
prepare多行insert 163.02s
事务多行insert 170.16s
多连接多行insert 98.18s

2. Write updates

The data is updated in real time, and the amount may continue to grow. It is not possible to use simple insert or update. The operation uses the new syntax supported after postgresql 9.5.


insert into mesh on conflict (x,y) do update set z = excluded.z

Tucao postgresql supports on conflict so late, and mysql has existed for a long time...

Repeatedly write the data to the database on the premise that the existing data in the table is 2500w +. Only the multi-line insert update test is done here, and other results should be similar.

Ordinary multi-line insertion takes 272.15 s.
In the case of multi-thread insertion, it takes 362.26 s, and the occupancy rate of CPU is 1 degree to 100%. In the case of guessing multiple connections, update interlock leads to performance degradation.

Step 3 Read

Select method

Standard reading still uses select method, ADO. NET direct reading.

Using adapter mode takes 135.39 s;; It takes 71.62 s to use dbreader.

Copy method


postgresql Adj. copy Method provides stdout binary Mode, you can specify 1 It takes time to output a query 53.20s . 
public List<(int x, int y, int z)> BulkIQueryNpg()
{
  List<(int, int, int)> dict = new List<(int, int, int)>();
  using (var reader = ((NpgsqlConnection)_conn).BeginBinaryExport("COPY (select x,y,z from mesh) TO STDOUT (FORMAT BINARY)"))
  {
    while (reader.StartRow() != -1)
    {
      var x = reader.Read<int>(NpgsqlDbType.Integer);
      var y = reader.Read<int>(NpgsqlDbType.Integer);
      var z = reader.Read<int>(NpgsqlDbType.Integer);
      dict.Add((x, y, z));
    }
  }
  return dict;
}

Conclusion

Summarizing the test results, the following conclusions can be drawn in the case of more data:

The COPY statement is preferred for imports into empty tables or without duplicate tables (see P. S for details of why this premise exists); Using one statement to insert multiple pieces of data can greatly improve the insertion performance, and the optimal number of pieces can be determined experimentally; Using transaction or prepare insertion, the optimization effect is not obvious in this scenario; Using multi-connection/multi-thread operation has advantages in speed, but it is easy to cause high resource occupancy rate and too many connections to affect other applications; Write update is a new feature of postgresql, which will cause 1 certain performance consumption (relative direct insertion); When reading data, COPY statement can get better performance. ado. net dbreader objects are also preferred because they do not require fill procedures and are fast to read (although not as fast as COPY).

P.S.

Why not use mysql

There is no best, only the most suitable, reasonable I also quite like to use mysql. The main reasons for using postgresql are:

The sql instruction "copy" imported and exported by postgresql directly supports Binary mode to stdin and stdout. If the program wants to integrate directly, it is more convenient to use this; In comparison, sql syntax of mysql (load data infile) does not support stdin or stdout, and export can be realized through mysqldump. exe, so there is no particularly good way to import (mysqlimport may be possible).
Compared with mysql disadvantages

When postgresql uses copy to import, if the target table already has data, COPY will automatically terminate when the table with primary key constraint encounters errors, which may lead to incomplete insertion, in other words, it does not support the import process for update operation; The load syntax of mysql explicitly specifies the action after an error (IGNORE/REPLACE) without interrupting the import process.

Others

If you need to use mysql to import data from the program, you can consider exporting it to a file through the program first, and then importing it with the help of a file. It is said that the efficiency is much higher than that of insert.


Related articles: