Node. js Text File BOM Header Removal Method
- 2021-09-24 21:25:48
- OfStack
BOM
The byte order mark (byte order mark) is the name of the 1 code character at the code point U+FEFF. When UTF-16 or UTF-32 is used to encode a string of UCS/T1 characters, this character is used to indicate its byte order. It is often used as a token to indicate whether a file is encoded in UTF-8, UTF-16, or UTF-32.
Representation of byte order marks with different encodings:
编码 | 表示(106进制) | 表示(10进制) |
UTF8 | EF BB BF | 239 187 191 |
UTF-16(大端序) | FE FF | 254 255 |
UTF-16(小端序) | FF FE | 255 254 |
UTF-32(大端序) | 00 00 FE FF | 0 0 254 255 |
UTF-32(小端序) | FF FE 00 00 | 255 254 0 0 |
BOM addition
UTF8 encoding does not require BOM, but we can manually add an BOM header to the UTF8 encoding file
const fs = require('fs');
fs.writeFile('./bom.js', '\ufeffThis is an example with accents : é è à ', 'utf8', function (err) {})
BOM Removal
For UTF8, the existence of BOM is not necessary, because UTF8 bytes have no order and do not need to be marked, which means that an UTF8 file may have BOM or no BOM.
According to the different encoded BOM, we can judge whether the file contains BOM and which Unicode encoding is used according to the first few bytes of the file.
Although BOM character plays the role of marking file encoding, it does not belong to part 1 of file content. If BOM is not removed when reading text file, there will be problems in some usage scenarios. For example, after merging several JS files into one file, if the file contains BOM characters in the middle, it will lead to browser JS syntax errors. Therefore, when reading text files using Node. js, BOM needs to be removed.
// For string content
function stripBOM(content) {
// Detection section 1 Is the number of characters BOM
if (content.charCodeAt(0) === 0xFEFF) {
content = content.slice(1);
}
return content;
}
// For Buffer
function stripBOMBuffer(buf) {
if (buf[0] === 0xEF && buf[1] === 0xBB && buf[2] === 0xBF) {
buf = buf.slice(3);
}
return buf;
}
Reference
Character encoding notes: ASCII, Unicode and UTF-8 Byte order markSummarize