Use PHP to read the instance code for a very large file

2020-05-16 06:35:02
OfStack

Last year at the end of the various website account information database leakage, it is very gelievable ah, took the opportunity to download a few databases, ready to learn the data analyst to analyze 1 of these account information. Although these data information have been "collated", but their own learning is also very useful, after all, there is such a large amount of data.

The problem brought by the large amount of data is that the single file is very large, can open this file is quite not easy, notepad do not expect, decisive crash. The client terminal of MSSQL cannot open such a large SQL file, so it directly reports that it is out of memory. It is said that when MSSQL reads data, it puts the data it reads once in memory. If the data is too large and out of memory, the system will crash directly.

Navicat Premium
Here is a recommended software Navicat Premium, it is very powerful, hundreds of megabytes of SQL files easily open, not a point card. And the client software supports MSSQL, MYSQL, Oracle... And so on the various database connection, a lot of other functions on their own slowly studied.

Although Navicat can be used to open CSDN 274MB SQL file, the content is meaningless, and it is not convenient to query, classify, count and so on the account information. The only way to do this is to read the data one by one, then split the different pieces of each record, and then store the pieces in the database in the form of data fields, so that they can be used easily in the future.

Read large files using PHP
There are many ways to read PHP files. According to the different target files, a more appropriate method can be adopted to effectively improve the execution efficiency. Because the CSDN database files are large, we try not to read them all in a short period of time. After all, we have to split and write them every time we read them. Therefore, the more appropriate way is to read the file regionally. By combining fseek and fread of PHP, you can read some data in the file at will. The following is the example code:

 
function readBigFile($filename, $count = 20, $tag = "\r\n") { 
$content = "";// In the end the content  
$current = "";// Current read content is stored  
$step= 1;// How many characters at a time  
$tagLen = strlen($tag); 
$start = 0;// The starting position  
$i = 0;// counter  
$handle = fopen($filename,'r+');// Read-write mode opens the file and the pointer points to the start of the file  
while($i < $count && !feof($handle)) { 
fseek($handle, $start, SEEK_SET);// The pointer is set at the beginning of the file  
$current = fread($handle,$step);// Read the file  
$content .= $current;// Composite string  
$start += $step;// Move forward according to step size  
// Intercepts a string based on the length of the separator to save a few characters at the end  
$substrTag = substr($content, -$tagLen); 
if ($substrTag == $tag) { // Determines whether to determine whether a newline or other delimiter is present  
$i++; 
$content .= "<br />"; 
} 
} 
// Close the file  
fclose($handle); 
// Returns the result  
return $content; 
} 
$filename = "csdn.sql";// Files that need to be read  
$tag = "\n";// Line separators   Note that you must use double quotation marks here  
$count = 100;// Rows read  
$data = readBigFile($filename,$count,$tag); 
echo $data;

As for the value of the variable $tag passed in by the function, according to different systems, the value passed in is also different: "\r\n" for Windows, "\n" for linux/unix, "\r" for Mac OS.

The general flow of the program execution: first define some basic variables to read the file, then open the file, position the pointer to the specified location of the file, and read the specified size. The contents are stored in a variable every read until the required number of lines or files are reached.

Never assume that all 1's in your program will run as planned.

According to the code above, although you can get the data of the specified location and size in the file, the whole process is only executed once, and not all of the data can be obtained. To get all the data, you can add a loop around the loop to determine if the file is finished, but this is a waste of system resources, and even causes the PHP execution to timeout because the file is too big to read. Another method is to record and store the location of the pointer since the last read, and then, when the loop is executed again, position the pointer at the location where it ended last time, so that there is no need to read the file from beginning to end in one loop.

Actually, I haven't imported the database CSDN until now, because there was an analysis on CNBETA a few days after the leak. Oh, it was too fast. When you see someone else doing it, you automatically have less motivation to do it, but you still have to take the time to do it in order to learn.