How to use Java for file cutting and simple content filtering

  • 2020-06-01 09:54:51
  • OfStack

1 origin

Due to the requirements of the project last year, one arbitrary file should be made into one xml file, and the content of the file itself should be kept unchanged, and the xml should be able to be restored to the original file. If small files are easy to handle, large xml files, such as several G files, are basically OOM, and it is difficult to extract data directly from the node. So I used the flow approach. So you have this file clipping tool.

2. Usage scenarios

Possible usage scenarios of this tool:

1. Cut/cut any 1 document. The section between the start and end nodes is clipped by a byte stream.

2. Concatenate the specified string to the header/footer of any 1 file. It is easy to embed a file in a single node.

3. Simple text extraction. You can extract the content of the text you want according to your own rules, and allow you to reprocess the extracted text (of course, simply extract the text, not extract it by some intelligent and complicated process).

4. Text filtering. Filter out the specified text according to your own rules.

The entire tool is a simple manipulation of api files, and nio is not used. The use of this tool is open to consideration in situations where efficient document processing is required. The purpose of this article is to give one of our own solutions, if there is a better solution, you are welcome to give appropriate Suggestions.

3 how to use

Without further ado, let's see how to use it!

1. Read the specified section of the file

Reads the contents between bytes 0 and 1048.


public void readasbytes(){
    FileExtractor cuter = new FileExtractor();
    byte[] bytes = cuter.from("D:\\11.txt").start(0).end(1048).readAsBytes();
  }

2. File cutting

Cut the section between byte 0 and byte 1048 into a new file.


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }

3. Splicing the file into one xml node

Write the entire contents of the file as an Body node to an xml file. Returns the newly generated xml file object.


  public File appendText(){

    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\44.xml").appendAsFile("<Document><Body>", "</Body></Document>");

  }

4. Read and process the specified contents in the file

If required: read the first 3 lines of 11.txt. Lines 1 and 2 cannot have the word "handsome" and the string "I am handsome" is added to the end of line 3. .


public String extractText(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").extractAsString(new EasyProcesser() {
      @Override
      public String finalStep(String line, int lineNumber, Status status) {

        if(lineNumber==3){
          status.shouldContinue = false;// Indicates that the contents of the file will no longer be read 
          return line+" I am so handsome !";
        }
        return line.replaceAll(" handsome ","");
      }
    });

  }

4. Simple text filtering

Remove all "bug" from a file and return a new file after processing.


  public File killBugs(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\bugs.txt").to("D:\\nobug.txt").extractAsFile(new EasyProcesser() {
      @Override
      public String finalStep(String line, int lineNumber, Status status) {
        return line.replaceAll("bug", "");
      }
    }); 
  }

4. Basic process

By means of interface callback, the file reading process is separated from the processing process; The IteratorFile class is defined to be responsible for traversing a file and reading the contents of the file. The file content is traversed by byte and line. The following sections will also be read and processed separately.

5 file reading

Define the callback interface

Define one interface, Process, which exposes two file content handling methods, one that supports reading by byte and one that supports reading by line.


public interface Process{

  /**
   * @param b  The data read this time 
   * @param length  The effective length of this read 
   * @param currentIndex  Current read position 
   * @param available  Read the total length of the file 
   * @return true  Means to continue reading the file, false Means to terminate the reading of the file 
   * @time 2017 years 1 month 22 day   In the afternoon 4:56:41
   */
  public boolean doWhat(byte[] b,int length,int currentIndex,int available);

  /**
   * 
   * @param line  The row that was read this time 
   * @param currentIndex  The line Numbers 
   * @return true  Means to continue reading the file, false Means to terminate the reading of the file 
   * @time 2017 years 1 month 22 day   In the afternoon 4:59:03
   */
  public boolean doWhat(String line,int currentIndex);

Let ItratorFile itself implement this interface, but the default is to return true without any processing. As follows:


public class IteratorFile implements Process
{
......
/**
   *  The traversal file contents are read in bytes, and the method is overridden for customization 
   */
  @Override
  public boolean doWhat(byte[] b, int length,int currentIndex,int available) {
    return true;
  }

  /**
   *  Reads the contents of the file by line, and overrides the method for customization 
   */
  @Override
  public boolean doWhat(String line,int currentIndex) {
    return true;
  }
......
}

Traverses the file contents in bytes

The implementation traverses (reads) the file in bytes. The skip () method is used here to control the number of nodes from which content is read; Then, when using the file stream to read, pass the data to the callback interface. It is important to note that the data is stored in the 1 byte array bytes each time it is read, and the length of each read is also passed to the callback interface. It is easy to see that once dowhat() returns false, the file reading exits immediately.


public void iterator2Bytes(){
    init();
    int length = -1;
    FileInputStream fis = null;
    try {
      file = new File(in);
      fis = new FileInputStream(file);
      available = fis.available();
      fis.skip(getStart());
      readedIndex = getStart();
      if (!beforeItrator()) return;
      while ((length=fis.read(bytes))!=-1) {
        readedIndex+=length;
        if(!doWhat(bytes, length,readedIndex,available)){
          break;
        }
      }
      if(!afterItrator()) return;
    } catch (FileNotFoundException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }finally{
      try {
        fis.close();
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }

Traverse the file contents by line

Normal file reading, in the while loop, calls the callback interface method and passes the relevant data.


  public void iterator2Line(){
    init();
    BufferedReader reader = null;
    FileReader read = null;
    String line = null;
    try {
      file = new File(in);
      read = new FileReader(file);
      reader = new BufferedReader(read);
      if (!beforeItrator()) return;
      while ( null != (line=reader.readLine())) {
        readedIndex++;
        if(!doWhat(line,readedIndex)){
          break;
        }
      }
      if(!afterItrator()) return ;
    } catch (FileNotFoundException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }finally{
      try {
        read.close();
        reader.close();
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
  }

Then, you also need to provide methods to set the path to the source file to read.


  public IteratorFile from(String in){
    this.in = in;
    return this;
  }

6. File content handling

FileExtractor introduction

The FileExtractor class is defined to encapsulate the handling of the file contents. This class references the class IteratorFile needed to traverse the file.

The basic method of FileExtractor


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
0

Among them, all the methods with a return value of File will return a new file after the content is processed.

Other methods

Similarly, the method for setting the location of the source file and the associated method for intercepting the location


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
1

File content processing when a file is read in bytes

There are several key points:

1. Since the file is to be intercepted according to the location of the byte, the file needs to be traversed according to the byte, so the method of doWhat() byte traversal is to be overridden. An OutPutStream is constructed externally to write the new file.

2. The contents of the file retrieved after each traversal and reading are stored in the 1 byte array b, but not all the data in b is useful, so it is necessary to pass b effective length length.

3.readedIndex records how many bits of data have been read so far (including this time).

4. When traversing the file according to yourself, how to determine the termination location of the read?

When (total length of data read) readedIndex > When endPos (termination node) is used, it indicates that the position where it should be terminated is exceeded. At this point, 1 part of the data in b array is overread, and this part of data should not be saved. We can calculate how many bits we overread, length-(readedIndex-endPos-1), so we just save this data.

Reads the file contents of the specified fragment:


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
2

When the file is large, it is a reliable method to generate a new file. Therefore, it is similar to directly return byte[]. Before the file is read, set up an outputSteam and write the read content to a new file in the process of content loop reading.


  public File splitAsFile(){
    ......
    final OutputStream os = FileUtils.openOut(file);
    try {
      IteratorFile itFile = new IteratorFile(){
        @Override
        public boolean doWhat(byte[] b, int length,int readedIndex,int available) {
          try {
            if(readedIndex>endPos){
              // It says it has been read endingPos location , And read too much readedIndex-getEnd()-1 position 
              os.write(b, 0, length-(readedIndex-endPos-1));
              return false;// Termination of the read 
            }else{
              os.write(b, 0, length);
            }
            return true;
          } catch (IOException e) {
            e.printStackTrace();
            return false;
          }
        }
      }.from(in).start(startPos);

      itFile.iterator2Bytes();

    } catch (Exception e) {
      e.printStackTrace();
      this.tempFile = null;
    }finally{
      try {
        os.flush();
        os.close();
      } catch (IOException e) {
        e.printStackTrace();
      }
    }
    return getTempFile();
  }

Processing of the contents of a file when it is read by line

First, again, it is only suitable for text files when traversing a file by line. Unless you are not required to use \r or \n for each 1 line break. Like the exe file, if you use the line traversal, when you write a new file, any one of the wrong line breaks may cause an exe file to become an "unexe" file!

In the process, I used:

1 helper class Status to assist in controlling the flow of traversal.

1 interface FlowLineProcesser, similar to a pipeline for text processing.

Status and FlowLineProcesser are mutually auxiliary, Status can also assist FlowLineProcesse, which is the specific process of the pipeline, and Status, which controls how d is handled in the process.

I have also thought many times about whether to make the process so complicated. But save it for now...

First, the auxiliary class Status:


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
4

Then the FlowLineProcesser interface:

FlowLineProcesser is an interface, similar to a pipeline. A two-step operation is defined for the two methods fistStep() and finalStep(), respectively. The return value of both methods is String. firstStep receives line, which is the actual line read from the file. After its own processing, line returns the processed line to finalStep. So, line in finalStep is actually the result of firstStep. But line, which is actually returned to the main process, is the return value of finalStep.


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
5

Now, you can see how to achieve text extraction 1:

All rows read are temporarily stored in an stringbuilder. Once processed, firstStep gets the return value and then passes it to finalStep. After processing again, the obtained result is saved. If the end result is null, it will not be saved.


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
6

When the text to be extracted is too large, a new file can be generated. To process basic 1 that returns string.


public File splitAsFile(){
    FileExtractor cuter = new FileExtractor();
    return cuter.from("D:\\11.txt").to("D:\\22.txt").start(0).end(1048).extractAsFile();
  }
7

Well, that's the end of the introduction. We'll talk about it next time

Code package for you to download oh! - > Code package


Related articles: