How does Java get the file encoding format

  • 2020-05-30 20:04:01
  • OfStack

1: the simple judgment is UTF-8 or not UTF-8, because 1 is GBK except UTF-8, so the default setting is GBK.

When storing a file according to a given character set, it is possible to store the encoding information in the first three bytes of the file. Therefore, the basic principle is that as long as the first three bytes of the file are read and the values of these bytes are determined, the encoding format can be known. In fact, if the project is running on a Chinese operating system, and if the text files are generated within the project, the developer can control the encoding of the text by simply deciding on two common encodings: GBK and UTF-8. Since the default encoding of Chinese Windows is GBK, the encoding format of UTF-8 is generally determined by 1.

For UTF-8 encoded text files, the values of the first three bytes are -17, -69, and -65, so the code snippet to determine whether UTF-8 is encoded is as follows:


File file = new File(path); 
InputStream in= new java.io.FileInputStream(file); 
byte[] b = new byte[3]; 
in.read(b); 
in.close(); 
if (b[0] == -17 && b[1] == -69 && b[2] == -65) 
 System.out.println(file.getName() + " : code for UTF-8"); 
else 
 System.out.println(file.getName() + " : it may be GBK It could be something else "); 

2: if you want to achieve more complex files encoding detection, can use an open source project cpdetector, it's address is: http: / / cpdetector sourceforge. net /. Its class library is very small, only about 500K, cpDetector is based on the principle of statistics, not guaranteed to be completely correct, using this library to determine the text file code as follows:

Read the external file (first using cpdetector to detect the file encoding format, and then using the detected encoding to read the file):


/** 
 *  Use the first 3 The open source packages cpdetector Gets the file encoding format  
 * 
 * @param path 
 *    To determine the file encoding format of the source file path  
 * @author huanglei 
 * @version 2012-7-12 14:05 
 */ 
public static String getFileEncode(String path) { 
 /* 
  * detector Is the probe, which delegates the probe task to an instance of a concrete probe implementation class.  
  * cpDetector The built-in 1 Some commonly used probe implementation classes whose instances can be passed add methods   Add in, e.g ParsingDetector ,  
  * JChardetFacade , ASCIIDetector , UnicodeDetector .  
  * detector According to the principle of "whoever returns the non-empty result first, the result shall prevail", the detected result shall be returned  
  *  Character set encoding. You need to use 3 A first 3 party JAR Package: antlr.jar , chardet.jar and cpdetector.jar 
  * cpDetector It is based on statistical principles and is not guaranteed to be completely correct.  
  */ 
 CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance(); 
 /* 
  * ParsingDetector Available for inspection HTML , XML The encoding of a file or stream of characters , The parameters in the constructor are used  
  *  Indicates whether to display details of the probe process, as false Don't show.  
  */ 
 detector.add(new ParsingDetector(false)); 
 /* 
  * JChardetFacade Encapsulates the by Mozilla organization-provided JChardet , it does the coding for most files  
  *  Determination. So, 1 With this detector, you can meet the requirements of most projects. If you are still worried, you can  
  *  A couple more detectors, like the one down here ASCIIDetector , UnicodeDetector And so on.  
  */ 
 detector.add(JChardetFacade.getInstance());//  use antlr.jar , chardet.jar 
 // ASCIIDetector Used for ASCII Determination of coding  
 detector.add(ASCIIDetector.getInstance()); 
 // UnicodeDetector Used for Unicode The determination of the family code  
 detector.add(UnicodeDetector.getInstance()); 
 java.nio.charset.Charset charset = null; 
 File f = new File(path); 
 try { 
  charset = detector.detectCodepage(f.toURI().toURL()); 
 } catch (Exception ex) { 
  ex.printStackTrace(); 
 } 
 if (charset != null) 
  return charset.name(); 
 else 
  return null; 
} 


String charsetName = getFileEncode(configFilePath); 
System.out.println(charsetName); 
inputStream = new FileInputStream(configFile); 
BufferedReader in = new BufferedReader(new InputStreamReader(inputStream, charsetName));

Read the jar package internal resource file (first, use cpdetector to detect the encoding format of the jar internal resource file, and then read the file in the detected encoding mode) :


/** 
 *  Use the first 3 The open source packages cpdetector To obtain URL The corresponding file encoding  
 * 
 * @param path 
 *    To determine the file encoding format of the source file URL 
 * @author huanglei 
 * @version 2012-7-12 14:05 
 */ 
public static String getFileEncode(URL url) { 
 /* 
  * detector Is the probe, which delegates the probe task to an instance of a concrete probe implementation class.  
  * cpDetector The built-in 1 Some commonly used probe implementation classes whose instances can be passed add methods   Add in, e.g ParsingDetector ,  
  * JChardetFacade , ASCIIDetector , UnicodeDetector .  
  * detector According to the principle of "whoever returns the non-empty result first, the result shall prevail", the detected result shall be returned  
  *  Character set encoding. You need to use 3 A first 3 party JAR Package: antlr.jar , chardet.jar and cpdetector.jar 
  * cpDetector It is based on statistical principles and is not guaranteed to be completely correct.  
  */ 
 CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance(); 
 /* 
  * ParsingDetector Available for inspection HTML , XML The encoding of a file or stream of characters , The parameters in the constructor are used  
  *  Indicates whether to display details of the probe process, as false Don't show.  
  */ 
 detector.add(new ParsingDetector(false)); 
 /* 
  * JChardetFacade Encapsulates the by Mozilla organization-provided JChardet , it does the coding for most files  
  *  Determination. So, 1 With this detector, you can meet the requirements of most projects. If you are still worried, you can  
  *  A couple more detectors, like the one down here ASCIIDetector , UnicodeDetector And so on.  
  */ 
 detector.add(JChardetFacade.getInstance());//  use antlr.jar , chardet.jar 
 // ASCIIDetector Used for ASCII Determination of coding  
 detector.add(ASCIIDetector.getInstance()); 
 // UnicodeDetector Used for Unicode The determination of the family code  
 detector.add(UnicodeDetector.getInstance()); 
 java.nio.charset.Charset charset = null; 
 try { 
  charset = detector.detectCodepage(url); 
 } catch (Exception ex) { 
  ex.printStackTrace(); 
 } 
 if (charset != null) 
  return charset.name(); 
 else 
  return null; 
} 

URL url = CreateStationTreeModel.class.getResource("/resource/" + " The configuration file "); 
URLConnection urlConnection = url.openConnection(); 
inputStream=urlConnection.getInputStream(); 
String charsetName = getFileEncode(url); 
System.out.println(charsetName); 
BufferedReader in = new BufferedReader(new InputStreamReader(inputStream, charsetName)); 

3: detect the encoding of any input text stream by calling its overloaded form:


charset=detector.detectCodepage( The text input stream to be tested , Measure the number of bytes read into the stream ); 

The number of bytes above is specified by the programmer, and the more bytes there are, the more accurate the decision will be, and of course the longer it will take. Note that the number of bytes specified cannot exceed the maximum length of the text stream.

4. Specific application examples of decision file encoding:

Properties files (.properties) are a common method of text storage in Java programs, such as the STRUTS framework, which USES properties files to store string resources in programs. Its contents are as follows:

# comment statement

Property name = property value

The first way to read in a properties file is:


FileInputStream ios=new FileInputStream( "Property file name" ); 
Properties prop=new Properties(); 
prop.load(ios); 
String value=prop.getProperty( "Property name" ); 
ios.close(); 

Although it is convenient to use the load method of java.io.Properties to read in the properties file, if there is Chinese in the properties file, it will be found after reading in. This happens because the load method USES a byte stream to read in the text. After reading in, the byte stream needs to be encoded as a string, and the encoding it USES is "iso-8859-1". This character set is the ASCII code character set, which does not support Chinese encoding.

Method 1: use explicit transcodes:


String value=prop.getProperty( "Property name" ); 
String encValue=new String(value.getBytes( " iso-8859-1 " ), "Actual encoding of the properties file" ); 

Method 2: as this property file is internal to the project, we can control the encoding format of the property file. For example, if the GBK specified by Windows is used, "gbk" will be directly used to transcode; if the UTF-8 is used, "UTF-8" will be used to transcode directly.

Method 3: if you want to be flexible and have automatic probe coding, you can use the method described above to determine the encoding of the properties file, which is convenient for developers

Addition: the Java support encoding set can be obtained with the following code:


Charset.availableCharsets().keySet();

The system default encoding can be obtained with the following code:


Charset.defaultCharset();

Related articles: