Share the migration steps of massive data under MySQL

  • 2020-06-03 08:36:05
  • OfStack

Corporate data center is planning to do 1 time migration, huge amounts of data at the same time increase a time field (turns out to be datatime type, increased 1 date type), single table data volume reached more than 600 million records, data is based on the time (in) doing partition because a bit busy, 1 straight no summary, so very detail place all don't remember clearly, simply sum up the situation at that time, here to cheat

Random disturb bump

When I received the task at first, without a clear starting point, it was select * from db limit 10000 directly. The number of pages was dynamically modified, and the time was slow to see through the console


SELECT IR_SID,IR_HKEY,IR_GROUPNAME,IR_SITENAME,IR_CHANNEL,IR_MID,IR_URLNAME,IR_STATUS_CONTENT,IR_CREATED_AT,date_format(IR_CREATED_AT,'%Y.%m.%d'),IR_LASTTIME,IR_VIA,IR_THUMBNAIL_PIC,IR_RTTCOUNT,IR_COMMTCOUNT,IR_UID,IR_SCREEN_NAME,IR_RETWEETED_UID,IR_RETWEETED_SCREEN_NAME,IR_RETWEETED_MID,IR_RETWEETED_URL,IR_STATUS_BODY FROM TB_SINA_STATUS WHERE IR_SID>40000 AND IR_SID<50001 INTO OUTFILE '/home/mysql/data/data_outfile.txt'; LOAD DATA INFILE '/home/mysql/data/data_outfile.txt' INTO TABLE NEW_TB_SINA_STATUS; 

Can you read data based on partition

Since the database is partitioned according to partition, is it possible to read data according to partition and what syntax to use? As long as the data is read monthly, mysql will automatically read based on partition. You can use the command: explain partition to see which partition is based on
It takes 10 minutes to read big data. As for the status of the data, we may want to know it by 10 minutes. We can use the command: show status to check.
How does the innodb engine perform compared to the myIsam engine?
In the guide library experiment, the data exported and imported for one month (8G's text volume, 2500w records) was less than 4h under the myisam engine (the test environment was pc machine), but under the innodb engine, it took 32 hours, and 28h was needed to improve the index and so on, which showed an 8-fold difference in performance.

I found out the difference between innodb and myisam on the Internet, and said that innodb_buffer_pool_size, innodb_flush_ES56en_ES58en_ES59en should be modified

Can guarantee not too big difference, tried not to improve obviously, in this machine is ok, why??? This has been bothering me for a long time

innodb_flush_log_at_trx_commit

Is Innodb 1000 times slower than MyISAM and has a larger head? Maybe you forgot to change this parameter. The default value is 1, which means that every committed update transaction (or statement outside of each transaction) is flushed to disk, which can be quite expensive, especially if there is no battery backup cache. Many applications, especially those that have transited from MyISAM, simply set the value to 2, which means not flushing logs to disk, but only to the operating system cache. The logs will still be flushed to disk every second, so the cost of 1-2 updates per second is usually not lost. Setting it to zero is much faster, but it's also less secure -- you lose 1 transaction if the MySQL server crashes. Setting 2 will only lose the portion of the transaction flushed to the operating system cache.

innodb_buffer_pool_size

Innodb behaves like a snail with the default innodb_buffer_pool_size setting. Since Innodb caches data and indexes without leaving much memory for the operating system, you can set it to 70-80% of available memory if you only need Innodb.

Finally, after a long and arduous journey, the other two parameters were found

innodb_log_file_size

This is important in the case of high write loads, especially large data sets. The larger this value is, the better the performance will be, but be aware that it may increase the recovery time. I often set it to 64-512ES90en, depending on the server size.

innodb_log_buffer_size
The default Settings are fine for moderate write loads and short transactions. If there is a peak update operation or if the load is high, you should consider increasing its value. If its value is set too high, memory can be wasted -- it refreshes once per second, so there is no need to set the memory needed for more than a second. Usually 8-16ES95en is enough. The smaller the system, the smaller the value.


Finally, the derivative data of myisam and innodb is basically 1, the data of 2500 needs about 3.5h, and the data of single library reading needs 2h. This is just a figure value for reference only (the test on pc), and the test results on official service are more obvious

Performance tuning statement reference


set profiling = 1;
show profiles\G
SHOW profile CPU,BLOCK IO io FOR query 1;
show status
Show Processlist
explain

Would parallel reads be faster?

If based on partition derivative data, I still can't reach the established goal, so I finally write shell step and multiprocess parallel based on partition derivative data, that is, start multiple ES116en-ES117en-ES118en db < exp201201.sql, ES123en-ES124en-ES126en < exp201202.sql, reading and writing per sql per day (partition per month in event environment)


SELECT IR_SID,IR_HKEY,IR_GROUPNAME,IR_SITENAME,IR_CHANNEL,IR_MID,IR_URLNAME,IR_STATUS_CONTENT,IR_CREATED_AT,date_format(IR_CREATED_AT,'%Y.%m.%d'),IR_LASTTIME,IR_VIA,IR_THUMBNAIL_PIC,IR_RTTCOUNT,IR_COMMTCOUNT,IR_UID,IR_SCREEN_NAME,IR_RETWEETED_UID,IR_RETWEETED_SCREEN_NAME,IR_RETWEETED_MID,IR_RETWEETED_URL,IR_STATUS_BODY 
INTO OUTFILE '/home/mysql/data/sinawb20120724/111101.txt'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' 
LINES TERMINATED BY '\n'
FROM TB_SINA_STATUS  
WHERE ir_created_at >='2011-11-01 00:00:00' and ir_created_at <'2011-11-01 23:59:59'

LOAD DATA  LOCAL INFILE '/home/mysql/data/sinawb20120724/111101.txt' 
IGNORE INTO TABLE `NEW_TB_SINA_STATUS` 
CHARACTER SET UTF8  
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' 
LINES TERMINATED BY '\n'


Related articles: