Use Node.js to handle the coding of the front end code file

  • 2020-12-21 17:57:30
  • OfStack

When writing front-end tools with NodeJS, the majority of the work is done on text files, so the processing of file encoding is involved. The common text encodings are UTF8 and GBK, and UTF8 files may also have BOM. When reading different encoded text files, the file content needs to be converted to the UTF8 encoded string used by JS before it can be processed properly.

The removal of BOM
BOM is used to mark a text file using the Unicode encoding, which is itself an Unicode character ("\uFEFF") at the head of the text file. Under different Unicode encodings, the corresponding base 2 bytes of BOM characters are as follows:


  Bytes   Encoding
----------------------------
  FE FF    UTF16BE
  FF FE    UTF16LE
  EF BB BF  UTF8

Therefore, we can determine whether a file contains BOM and which Unicode encoding is used based on what the first few bytes of the text file are equal to. However, while the BOM character acts as an encoding for a file, it is not itself part 1 of the file's content, and if BOM is not removed when reading a text file, it can be problematic in some usage scenarios. For example, if we combine several JS files into one file, if the file contains the BOM character in the middle, it will cause browser JS syntax error. Therefore, BOM generally needs to be removed when reading text files using NodeJS. For example, the following code implements the ability to recognize and remove UTF8 BOM.


function readText(pathname) {
  var bin = fs.readFileSync(pathname);

  if (bin[0] === 0xEF && bin[1] === 0xBB && bin[2] === 0xBF) {
    bin = bin.slice(3);
  }

  return bin.toString('utf-8');
}

Turn GBK UTF8
NodeJS supports specifying the text encoding when reading a text file or when Buffer converts to a string, but unfortunately the GBK encoding is not within the scope of NodeJS's own support. Therefore, we usually use es42EN-ES43en, a 3 square package, to convert the encoding. After downloading the package using NPM, we can write a function to read the GBK text file as follows.


var iconv = require('iconv-lite');

function readGBKText(pathname) {
  var bin = fs.readFileSync(pathname);

  return iconv.decode(bin, 'gbk');
}

Single byte coding
Sometimes, we can't predict which encoding the file we want to read will use, so we can't specify the correct encoding. For example, some of the CSS files we will be working with will be encoded in GBK and UTF8. Although it is possible to guess the text encoding based on the byte content of a file to a certain extent, this is a somewhat limited but much simpler technique.

First of all, we know that if a text file contains only English characters, such as Hello World, it is ok to read the file in either GBK or UTF8 encoding. This is because under these encodings, characters in the ASCII0 to 128 range use the same single-byte encoding.

On the other hand, even if a text file contains characters such as Chinese, if the characters we need to deal with are only in the range of ASCII0~128, such as JS code except comments and strings, we can use single-byte encoding to read the file without caring whether the actual encoding of the file is GBK or UTF8. The following example illustrates this approach.

1. Contents of GBK encoded source file:


  var foo = ' Chinese ';

2. Corresponding bytes:


  76 61 72 20 66 6F 6F 20 3D 20 27 D6 D0 CE C4 27 3B

3. Contents read by single-byte encoding:


  var foo = '{ The statement }{ The statement }{ The statement }{ The statement }';

4. Replacement content:


  var bar = '{ The statement }{ The statement }{ The statement }{ The statement }';

5. Use single byte encoding to save the corresponding byte:


  76 61 72 20 62 61 72 20 3D 20 27 D6 D0 CE C4 27 3B

6. GBK encoding is used to read the contents:


  var bar = ' Chinese ';

The trick here is that no matter what gibberish characters a single byte greater than 0xEF is parsed into under the single-byte encoding, the corresponding bytes behind the gibberish characters are kept the same when the same single-byte encoding is used.

An binary code is shipped with NodeJS to implement this method, so in the following example, we will use this code to demonstrate how the code for the above example should be written.


function replace(pathname) {
  var str = fs.readFileSync(pathname, 'binary');
  str = str.replace('foo', 'bar');
  fs.writeFileSync(pathname, str, 'binary');
}


Related articles: