Node. js Text File BOM Header Removal Method

2021-09-24 21:25:48
OfStack

BOM

The byte order mark (byte order mark) is the name of the 1 code character at the code point U+FEFF. When UTF-16 or UTF-32 is used to encode a string of UCS/T1 characters, this character is used to indicate its byte order. It is often used as a token to indicate whether a file is encoded in UTF-8, UTF-16, or UTF-32.

Representation of byte order marks with different encodings:

编码	表示(106进制)	表示(10进制)
UTF8	EF BB BF	239 187 191
UTF-16（大端序）	FE FF	254 255
UTF-16（小端序）	FF FE	255 254
UTF-32（大端序）	00 00 FE FF	0 0 254 255
UTF-32（小端序）	FF FE 00 00	255 254 0 0

BOM addition

UTF8 encoding does not require BOM, but we can manually add an BOM header to the UTF8 encoding file


const fs = require('fs');

fs.writeFile('./bom.js', '\ufeffThis is an example with accents :  é   è   à  ', 'utf8', function (err) {})

BOM Removal

For UTF8, the existence of BOM is not necessary, because UTF8 bytes have no order and do not need to be marked, which means that an UTF8 file may have BOM or no BOM.

According to the different encoded BOM, we can judge whether the file contains BOM and which Unicode encoding is used according to the first few bytes of the file.

Although BOM character plays the role of marking file encoding, it does not belong to part 1 of file content. If BOM is not removed when reading text file, there will be problems in some usage scenarios. For example, after merging several JS files into one file, if the file contains BOM characters in the middle, it will lead to browser JS syntax errors. Therefore, when reading text files using Node. js, BOM needs to be removed.


//  For string content 
function stripBOM(content) { 
 //  Detection section 1 Is the number of characters BOM 
 if (content.charCodeAt(0) === 0xFEFF) {
 content = content.slice(1);
 }
 return content;
}

//  For Buffer
function stripBOMBuffer(buf) { 
 if (buf[0] === 0xEF && buf[1] === 0xBB && buf[2] === 0xBF) { 
 buf = buf.slice(3); 
 } 
 return buf;
}

Reference

Character encoding notes: ASCII, Unicode and UTF-8 Byte order mark

Summarize