Example of MySql method for quickly inserting tens of millions of big data

  • 2021-12-11 19:17:08
  • OfStack

In the field of data analysis, database is our good helper. Not only can we accept our query time, but we can also make a step-by-step analysis on this basis. Therefore, we must insert data into the database. In practical applications, we often encounter tens of millions or even larger data. If you don't have a quick insertion method, you will get twice the result with half the effort and spend a lot of time.

In Ali's Tianchi Big Data Algorithm Competition (Pop Music Trend Prediction), I encountered such a problem. Before optimizing the database query and insertion, I spent a lot of wrong time. Before optimizing, the insertion operation of 15 million pieces of data took an incredible 12 hours (using the most basic insertion one by one). This also prompted me to think about how to optimize database insertion and query operations and improve efficiency.

In the process of continuous optimization, the performance has been greatly improved. In the process of querying and summarizing the downloads, plays and collections of more than 26,000 songs from the database according to time series, the operation speed generated by query is reduced from the estimated 40 hours to more than 1 hour. In terms of database insertion, the performance has been greatly improved; Tested on the new data set, 54.9 million + data was inserted in 20 minutes. Let's share my experience.

The optimization process is divided into two steps. Step 1: Experimental static reader reads data from CSV file, and when the data reaches 1 quantity, multi-thread insertion database program is started; Step 2, use mysq batch insert operation.

Step 1: Read the file and start inserting multithreads

In my experiment, I started using 100w as this quantity, but there was a new problem, Java heap memory overflow, and finally adopted 10W as the quantity standard.

Of course, there can be other quantities, depending on which one you like.


import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
import preprocess.ImportDataBase;
 
public class MuiltThreadImportDB {
 
 /**
  * Java Read large files and put them into storage by multithreading 
  * 
  * @param args
  */
 private static int m_record = 99999;
 private static BufferedReader br = null;
 private ArrayList<String> list;
 private static int m_thread = 0;
 static {
 try {
  br = new BufferedReader(
  new FileReader(
  "E:/tianci/IJCAI15 Data/data_format1/user_log_format1.csv"),8192);
 
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 try {
  br.readLine(); //  Remove CSV Header
 } catch (IOException e) {
  e.printStackTrace();
 }
 }
 
 public void start() {
 String line;
 int count = 0;
 list = new ArrayList<String>(m_record + 1);
 synchronized (br) {
  try {
 while ((line = br.readLine()) != null) {
  if (count < m_record) {
 list.add(line);
 count++;
  } else {
 list.add(line);
 count = 0;
 Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
 t1.start();
 list = new ArrayList<String>(m_record + 1);
  }
 }
 
 if (list != null) {
  Thread t1 = new Thread(new MultiThread(list),Integer.toString(m_thread++));
  t1.start();
 }
  } catch (IOException e) {
 e.printStackTrace();
  }
 }
 }
 
 public static void main(String[] args) {
 new MuiltThreadImportDB().start();
 } 
}

Step 2: Insert data in batches using multithreading


class MultiThread implements Runnable {
 private ArrayList<String> list;
 
 public MultiThread(ArrayList<String> list) {
 this.list = list;
 }
 
 public void run() {
 try {
  ImportDataBase insert = new ImportDataBase(list);
  insert.start();
 } catch (FileNotFoundException e) {
  e.printStackTrace();
 }
 display(this.list);
 }
 
 public void display(List<String> list) {
 // for (String str : list) {
 // System.out.println(str);
 // }
 System.out.print(Thread.currentThread().getName() + " :");
 System.out.println(list.size());
 }
 
}

In batch operation, the prepareStatement class of mysql is used, and of course, the batch operation of statement class is also used, and the performance is not as good as the former. The former can reach the insertion speed of 1w + per second, while the latter only has 2000 +;


public int insertUserBehaviour(ArrayList<String> sqls) throws SQLException {
 
 String sql = "insert into user_behaviour_log (user_id,item_id,cat_id,merchant_id,brand_id,time_stamp,action_type)"
 + " values(?,?,?,?,?,?,?)";
 preStmt = conn.prepareStatement(sql);
 for (int i = 0; i < sqls.size(); i++) {
  UserLog log =new UserLog(sqls.get(i));
  preStmt.setString(1, log.getUser_id());
  preStmt.setString(2, log.getItem_id());
  preStmt.setString(3, log.getCat_id());
  preStmt.setString(4, log.getMerchant_id());
  preStmt.setString(5, log.getBrand_id());
  preStmt.setString(6, log.getTimeStamp());
  preStmt.setString(7, log.getActionType());
  preStmt.addBatch();
  if ((i + 1) % 10000 == 0) {
 preStmt.executeBatch();
 conn.commit();
 preStmt.clearBatch();
  }
 }
 preStmt.executeBatch();
 conn.commit();
 return 1;
 }

Of course, different mysql storage engines, InnoDB and MyISM, have also been tested. The experimental results show that InnoDB is faster (about 3 times), which may be related to the new version of mysq. The author's version of mysql is 5.6.

Finally, it summarizes the methods to improve the insertion speed under the condition of large amount of data.

For the Java code, multithreaded inserts are used and batch commits are used.

In database, do not use index when establishing table structure, otherwise, the index B + tree should be maintained after insertion process; Modify the storage engine, 1 default is InnoDB, (the new version can use the default, the old version may need).


Related articles: