PHP Ultra low Memory Traversing Directory Files and Reading Super Large Files

  • 2021-12-09 08:40:16
  • OfStack

This is not a tutorial, this is a note, so I will not discuss the principle and implementation systematically, but only briefly explain and give examples.

Preface

The reason why I am writing this note is that the tutorials and sample codes on the Internet about PHP traversing directory files and PHP reading text files are extremely inefficient, even if they are inefficient, some even say they are efficient, which is really hot.

This note mainly solves several problems:

How does PHP use ultra-low memory to quickly traverse tens of thousands of directory files?

How does PHP use ultra-low memory to quickly read hundreds of MB or even GB level files?

By the way, I forgot to find my own notes through the search engine. (Because the need for PHP to write these two functions is really rare, I have a bad memory, so as not to forget to take another detour.)

Traversing directory files

Most of the sample codes for the implementation of this method on the Internet are glob or opendir + readdir combination. It is no problem when there are not many directory files, but there is a problem when there are many files (this refers to the time when encapsulating into a function system 1 to return an array). Excessive arrays will require the use of super-large memory, which not only leads to slow speed, but also directly crashes when there is insufficient memory.

The correct way to do this is to use the yield keyword to return. Here is the code I used recently:


<?php

function glob2foreach($path, $include_dirs=false) {
  $path = rtrim($path, '/*');
  if (is_readable($path)) {
    $dh = opendir($path);
    while (($file = readdir($dh)) !== false) {
      if (substr($file, 0, 1) == '.')
        continue;
      $rfile = "{$path}/{$file}";
      if (is_dir($rfile)) {
        $sub = glob2foreach($rfile, $include_dirs);
        while ($sub->valid()) {
          yield $sub->current();
          $sub->next();
        }
        if ($include_dirs)
          yield $rfile;
      } else {
        yield $rfile;
      }
    }
    closedir($dh);
  }
}

//  Use 
$glob = glob2foreach('/var/www');
while ($glob->valid()) {
  
  //  Current file 
  $filename = $glob->current();
  
  //  This is the full file name including the path 
  // echo $filename;

  //  Point down 1 A, can't be less 
  $glob->next();
}

yield returns a generator object (if you don't know it, you can first know the PHP generator under 1), and does not generate an array immediately, so there will be no Big Mac array in any number of files in the directory. The memory consumption is as low as a few 10 kb levels that can be ignored, and the time consumption is almost only circular consumption.

Read a text file

Reading text files is similar to traversing directory files. Online tutorials basically use file_get_contents to read into memory or fopen + feof + fgetc combination to read and use. It is no problem when dealing with small files, but there are problems such as insufficient memory when dealing with large files. Reading hundreds of MB files with file_get_contents is almost suicide.

The correct way to handle this problem is also related to the yield keyword, which is processed line by line by yield or read from the specified location by SplFileObject.

Read the entire file line by line:


<?php
function read_file($path) {
  if ($handle = fopen($path, 'r')) {
    while (! feof($handle)) {
      yield trim(fgets($handle));
    }
    fclose($handle);
  }
}
//  Use 
$glob = read_file('/var/www/hello.txt');
while ($glob->valid()) {
  
  //  Current line text 
  $line = $glob->current();
  
  //  Process data row by row 
  // $line

  //  Point down 1 A, can't be less 
  $glob->next();
}

Reading files line by line through yield depends on the amount of data per line. If it is a log file with only a few hundred bytes per line, even if this file exceeds 100M, the memory occupied is only KB level.

But many times we don't need to read the whole document at once. For example, when we want to read a log file with the size of 1G in pages, we may want to read the first 1000 lines on page 1 and the 1000 to 2000 lines on page 2. At this time, we can't use the above method, because although that method occupies low memory, tens of thousands of loops need to consume time.

In this case, SplFileObject processing is used instead, and SplFileObject can be read from the specified number of rows. The following example is to write the array return, can decide to write the array according to their own business, I am too lazy to change.


<?php

function read_file2arr($path, $count, $offset=0) {

  $arr = array();
  if (! is_readable($path))
    return $arr;

  $fp = new SplFileObject($path, 'r');
  
  //  Navigates to the specified number of rows to start reading 
  if ($offset)
    $fp->seek($offset); 

  $i = 0;
  
  while (! $fp->eof()) {
    
    //  Must be placed at the beginning 
    $i++;
    
    //  Read-only  $count  So many lines 
    if ($i > $count)
      break;
    
    $line = $fp->current();
    $line = trim($line);

    $arr[] = $line;

    //  Point down 1 A, can't be less 
    $fp->next();
  }
  
  return $arr;
}

All the above mentioned are cases where the file is huge but the amount of data per line is very small. Sometimes this is not the case, and sometimes there are hundreds of MB in one line of data. How should this be handled?

If this is the case, it depends on the specific business. SplFileObject can locate the character position through fseek (note that the number of lines is not 1 with seek), and then read the characters of the specified length through fread.

That is to say, through fseek and fread, it is possible to read an ultra-long string in sections, that is, it can realize ultra-low memory processing, but what to do depends on what the specific business requirements allow you to do.

Copy large files

By the way, PHP copy files, copy small files with copy function is no problem, copy large files or use data flow, examples are as follows:


<?php

function copy_file($path, $to_file) {

  if (! is_readable($path))
    return false;

  if(! is_dir(dirname($to_file)))
    @mkdir(dirname($to_file).'/', 0747, TRUE);
  
  if (
    ($handle1 = fopen($path, 'r')) 
    && ($handle2 = fopen($to_file, 'w'))
  ) {

    stream_copy_to_stream($handle1, $handle2);

    fclose($handle1);
    fclose($handle2);
  }
}

Finally

I only say the conclusion, but I don't show the test data, which may be difficult to convince the public. If you are skeptical and want to verify, you can use memory_ge

ak_usage and microtime to measure the memory occupation and running time of the code under 1.


Related articles: