Why do you say that MySQL single table data should not exceed 5 million rows

2021-12-05 07:37:23
OfStack
 
Today, we discuss an interesting topic: How much data does MySQL single table need to consider sub-database and sub-table? Some people say 20 million lines, while others say 5 million lines. So, what do you think is the appropriate value? 
 
There was a saying widely circulated in China's Internet technology circle that the data volume of MySQL single table is more than 20 million rows, and its performance will obviously decline. In fact, this rumor is said to have originated from Baidu. The specific situation is probably as follows. When DBA tested the performance of MySQL in that year, it was found that when the amount of a single table was on the order of 20 million rows, the performance of SQL operation dropped sharply. Therefore, the conclusion came from this. Then it is said that Baidu's engineers moved to other companies in the industry and brought this information, so it was said that such a statement was circulated in the industry. 
 
Later, Alibaba's Java Development Manual proposed that the number of rows in a single table exceeded 5 million rows or the capacity of a single table exceeded 2GB before it was recommended to divide the database into tables. This is supported by Ali's golden iron law. Therefore, when many people design big data storage, they often take this as the standard and carry out table-dividing operations. 
 
So, what do you think is the appropriate value? Why not 3 million lines, or 8 million lines, but 5 million lines? Maybe you will say that this may be Ali's best actual combat value? Then, the question comes again. How is this value evaluated? Just a moment, please think a little for a while. 
 
In fact, this value has nothing to do with the actual number of records, but with the configuration of MySQL and the hardware of the machine. Because, in order to improve performance, MySQL loads the index of the table into memory. If InnoDB buffer size is sufficient, it can be fully loaded into memory, and there will be no problem in query. However, when a single-table database reaches the upper limit of a certain order of magnitude, its index cannot be stored in memory, so that subsequent SQL queries will produce disk IO, which will lead to performance degradation. Of course, this is also related to the design of specific table structure, and the final problem is memory limitation. Here, adding hardware configuration may bring immediate performance improvement. 
 
Then, my view on sub-library and sub-table is that it is necessary to combine the actual needs and should not be over-designed. At the beginning of Project 1, sub-library and sub-table design are not adopted, but with the growth of business, under the condition that optimization cannot be continued, sub-library and sub-table are considered to improve the performance of the system. In this regard, Alibaba's Java Development Manual adds: If the data volume is not expected to reach this level after 3 years, please don't divide the database into tables when creating tables. So, back to the question starting with 1, what do you think is the appropriate value? My suggestion is to comprehensively evaluate according to the situation of one's own machine. If there is no standard in mind, it is relatively a compromise value to take 5 million lines as the standard of one unification for the time being. 
 
Let's take a look at one more note about SQL writing, which will help you 
 
The writing of sql needs to pay attention to optimization 
 
 
Use limit to qualify the records of query results 
Avoid select * and list the fields you need to find 
Use join (join) instead of subquery 
Split large delete or insert statements 
You can find out the slower SQL by opening the slow query log 
No column operations: SELECT id WHERE age + 1 = 10, any operations on columns will result in table scans, including database tutorial functions, evaluation expressions, and so on. When querying, move operations to the right of the equal sign as much as possible 
The sql statement is as simple as possible: 1 sql can only be operated on 1 cpu; Large statements split small statements to reduce locking time; A large sql can block the whole library 
Rewrite OR to IN: The efficiency of OR is n, the efficiency of IN is log (n), and the number of in is recommended to be controlled within 200 
Implementation in the application without functions and triggers 
Avoid% xxx type queries 
Use less JOIN 
Use the same type for comparison, such as' 123 'and' 123 'ratio, 123 and 123 ratio 
Try to avoid using it in WHERE clause! = or 
< 
 
> 
Operator, otherwise the engine abandons the index and scans the full table 
For consecutive values, use BETWEEN instead of IN: SELECT id FROM t WHERE num BETWEEN 1 AND 5 
Don't take the whole table for list data, use LIMIT to paginate, and don't have too many pages