Analyze the problems encountered in Mysql massive data import and their solutions

  • 2021-09-25 00:01:42
  • OfStack

In projects, you often encounter importing a large amount of data into the database to use sql for data analysis. In the process of importing data, we will encounter some problems that need to be solved. Here, combined with the practice of importing an txt data of about 4G, we will show the problems encountered and solutions. On the one hand, we will make a summary record by ourselves, and on the other hand, we hope to have a reference for those friends who encounter the same problems.

The data I imported is the txt file of Encyclopedia, with more than 4G file size and more than 65 million pieces of data, each of which is separated by a newline character. Each piece of data contains 3 fields, which are separated by Tab. The method of taking out the data is to use one TripleData class to store these three fields, all of which use String, and then store many pieces of data to List < TripleData > Then add List < TripleData > Store in mysql database, and store all data in mysql database in batches.

The above is a general idea, and the following are the problems encountered in the specific import process.

1 database connection garbled code and compatibility.

If there is Chinese in the data, 1 must set the coding parameters of url linked to the database, and url is set in the following form.


URL="jdbc:mysql://"+IP+":"+PORT+"/"+DB_NAME+"?useSSL=false&useUnicode=true&characterEncoding=utf-8";

Setting the code to UTF-8 is to solve the problem of garbled code, and setting useSSL is to solve the compatibility problem between JDBC and mysql. If useSSL is not set, an error will be reported. Similar to


Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

Such error messages. The main reason is that mysql version is relatively high and JDBC version is relatively low, which requires compatibility.

2 utf8mb4 coding problem

In the process of importing data, you will also encounter something similar to

SQLException :Incorrect string value: '\xF0\xA1\x8B\xBE\xE5\xA2...' for column 'name'

This error message is due to the default of utf-8 set in mysql, which is 3 bytes, and there is no problem for 1-like data. If it is a large amount of data, it will inevitably contain 1 WeChat expression or special characters, which occupy 4 bytes, so utf-8 cannot handle it, so it reports an error. The solution is that mysql introduced the 4-byte utf-8 encoding after 5.5. 3, that is, utf8mb4, which needs to reset the encoding of mysql.

You can follow the following steps. 1 is to back up the database to be modified. Although utf8mb4 is from backwards compatibility utf8, in order to prevent improper operation, it is necessary to prevent problems before they happen and do a good job of backup. 2 is to modify the character set encoding of the database to utf8mb4-UTF-8 Unicode, and the collation utf8mb4_general_ci. I use navicat to modify the above modifications, how to use the command line to modify, you can find their own. 3 is to modify the configuration file my. ini under the root directory of the mysql installation. Add the following settings.


[client]
default-character-set = utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_general_ci
[mysql]
default-character-set = utf8mb4

After the modification is completed, you need to restart mysql for the modification to take effect.

Then import the data, and it should be able to import it normally.

3 Time efficiency of mass import

Because we have a large amount of data, we divided the data. I divided 65 million pieces of data into 500 files, each file has about 110,000 pieces of data, and put these 110,000 pieces of data into ArrayList < TripleObject > And then batch import. The general idea is to adopt "insert into tb (...) values (...), (...)...;" The method, with insert1 times insert, so time will save a lot of time. The example method is as follows.


public static void insertSQL(String sql,List<TripleObject> tripleObjectList) throws SQLException{
    Connection conn=null;
    PreparedStatement psts=null;
    try {
      conn=DriverManager.getConnection(Common.URL, Common.DB_USERNAME, Common.DB_PASSWORD);
      conn.setAutoCommit(false); //  Set up manual submission  
      //  Save sql Suffix 
      StringBuffer suffix = new StringBuffer();
      int count = 0; 
      psts=conn.prepareStatement("");
      String s="";
      String p="";
      String o="";
      while (count<tripleObjectList.size()) {
        s=tripleObjectList.get(count).getSubject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        p=tripleObjectList.get(count).getPredicate().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        o=tripleObjectList.get(count).getObject().replaceAll(",", ".").replaceAll("\\(", "").replaceAll("\\)", "").replaceAll("\'", "").replaceAll("\\\\", "");
        suffix.append("('" +s +"','"+p+"','"+ o+"'),");
        count++;
      }
      //  Build integrity SQL
      String allsql = sql + suffix.substring(0, suffix.length() - 1);
      //  Add execution SQL
      psts.addBatch(allsql);
      psts.executeBatch(); //  Execute batch processing  
      conn.commit(); //  Submit  
    } catch (Exception e) {
      e.printStackTrace();
    }finally{
      if(psts!=null){
        psts.close();
      }
      if(conn!=null){
        conn.close();
      }
    }
  }

The advantage of this method is that it takes very little time to import data, 65 million pieces of data, which takes exactly one hour. The disadvantage is that if there is a large sentence in the data, it is necessary to deal with commas, brackets, backslashes, etc. Here, it is necessary to measure whether to use this method.

If you insert normally, that is, use the form of "insert into tb (...) values (...); insert into tb (...) values (...); …", you don't have to deal with special symbols, but it will take a long time. I tested it for 1 time, and it takes about 12 minutes for 110,000 pieces of data and about 100 hours to import 65 million pieces of data.

We use the first method, the data can be viewed, and the data requirements are not so strict, saving time.

These are the problems I encountered when importing large quantities of data into mysql and the solutions I thought of. If you have a better solution or encounter other problems, I hope to discuss them.


Related articles: