Java USES Apache POI to read a simple example of a word file

  • 2020-04-01 03:56:11
  • OfStack

Apache POI is an open source library from the Apache software foundation that provides apis for Java programs to read and write files in Microsoft Office format.

1. Read jar packages required for word 2003 and word 2007

It is relatively simple to read the word file of 2003 version (.doc), just need poi-3.5-beta6-20090622.jar and poi-scratchpad-3.5-beta6-20090622.jar.
  1. Openxml4j - bin - beta. The jar
  2. Poi - 3.5 - beta6-20090622. The jar
  3. The poi - ooxml - 3.5 - beta6-20090622. The jar
  4. Dom4j - 1.6.1. Jar
  5. Geronimo - stax - api_1. 0 _spec - 1.0. The jar
  6. Ooxml - schemas - 1.0. The jar
  7. Xmlbeans - 2.3.0. Jar
Where 4-7 is the jar package on which poi-ooxml-3.5-beta6-20090622.jar depends (available in the ooxml-lib directory in poi-bin-3.5-beta6-20090622.tar.gz).

2. Newline symbol

Hard line: Newline in the file, if the keyboard USES "enter" newline.

Soft line: The number of characters in the file line is limited, when the number of characters exceeds a certain value, will automatically cut to the downstream display.

For programs, hard newlines are recognizable and deterministic newlines, and soft newlines are related to font size and indentation.

3. Notes for reading

It is worth noting that the POI will not read the image information in the word file when reading; Also, for word 2007 (.docx), if there is a table in the word file, all the data in the table will be at the end of the string.

4. Read the code of word text content


import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;

import org.apache.poi.POIXMLDocument;
import org.apache.poi.POIXMLTextExtractor;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;

public class Test {
  public static void main(String[] args) {
    try {
      InputStream is = new FileInputStream(new File("2003.doc"));
      WordExtractor ex = new WordExtractor(is);
      String text2003 = ex.getText();
      System.out.println(text2003);

      OPCPackage opcPackage = POIXMLDocument.openPackage("2007.docx");
      POIXMLTextExtractor extractor = new XWPFWordExtractor(opcPackage);
      String text2007 = extractor.getText();
      System.out.println(text2007);
      
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}


Related articles: