The implementation of Huffman encoding algorithm is explained in detail

  • 2020-05-24 05:35:20
  • OfStack

Introduction to Huffman coding

Huffman encodings deal with the encoding pairing of characters and their corresponding binary characters, which are divided into encoding and decoding, in order to compress the length of binary data corresponding to characters. We know that characters are stored and transmitted in base 2 (computers only know 0/1), so there is an mapping relationship between characters and base 2. Characters belong to the character set (Charset), which requires encoding (encode) for binary storage and transmission, and decoding (decode) for echo characters when displayed. The character set and encoding method are 1-to-many (Unicode can be encoded with UTF-8, UTF-16, etc.). Understand the character set, encoding and decoding, flying around the problem of random code will be easy to solve. Take the English letter a as an example. In ASCII code, the base 10 is 97 and the base 2 is 01100001. Each character of ASCII is encoded with 8 Bit(1Byte). If there are 1000 characters to be transmitted, then 8000 Bit must be transmitted. The problem is that the letter e is used 12.702% of the time, while the letter z is used 0.074%. The former is more than 100 times the latter, but it does use base 2 of the same number of digits. You can do it better by using variable length coding, and the guiding principle is to use shorter bits for high frequencies and longer bits for low frequencies. The Huffman encoding algorithm deals with such problems.

Huffman encodes the Java implementation

The main data structures used by the Huffman encoding algorithm are the complete two-fork tree (full binary tree) and the priority queue. The latter USES Java.util.PriorityQueue, which is implemented by itself (all inner classes) with the following code:


static class Tree { 
    private Node root; 
 
    public Node getRoot() { 
      return root; 
    } 
 
    public void setRoot(Node root) { 
      this.root = root; 
    } 
  } 
 
  static class Node implements Comparable<Node> { 
    private String chars = ""; 
    private int frequence = 0; 
    private Node parent; 
    private Node leftNode; 
    private Node rightNode; 
 
    @Override 
    public int compareTo(Node n) { 
      return frequence - n.frequence; 
    } 
 
    public boolean isLeaf() { 
      return chars.length() == 1; 
    } 
 
    public boolean isRoot() { 
      return parent == null; 
    } 
 
    public boolean isLeftChild() { 
      return parent != null && this == parent.leftNode; 
    } 
 
    public int getFrequence() { 
      return frequence; 
    } 
 
    public void setFrequence(int frequence) { 
      this.frequence = frequence; 
    } 
 
    public String getChars() { 
      return chars; 
    } 
 
    public void setChars(String chars) { 
      this.chars = chars; 
    } 
 
    public Node getParent() { 
      return parent; 
    } 
 
    public void setParent(Node parent) { 
      this.parent = parent; 
    } 
 
    public Node getLeftNode() { 
      return leftNode; 
    } 
 
    public void setLeftNode(Node leftNode) { 
      this.leftNode = leftNode; 
    } 
 
    public Node getRightNode() { 
      return rightNode; 
    } 
 
    public void setRightNode(Node rightNode) { 
      this.rightNode = rightNode; 
    } 
  } 

statistics

Since the coding table is to be arranged by frequency, the frequency statistics must be obtained first. I implemented a method to deal with this problem. If you already have statistics, switch to Map < Character,Integer > Can. If you get the information as a percentage, multiply it by 100 or 1,000, or 10,000. It always turns into an integer. For example, 12.702% times 1000 is 12702. Huffman only CARES about size. The statistical method is implemented as follows:


public static Map<Character, Integer> statistics(char[] charArray) { 
    Map<Character, Integer> map = new HashMap<Character, Integer>(); 
    for (char c : charArray) { 
      Character character = new Character(c); 
      if (map.containsKey(character)) { 
        map.put(character, map.get(character) + 1); 
      } else { 
        map.put(character, 1); 
      } 
    } 
 
    return map; 
  } 

Build a tree

Building a tree is the core step of Huffman encoding algorithm. The idea is to hook all the characters to a leaf node of a complete two-fork tree, and the frequency of the left node of any non-page child node should not be greater than the frequency of the right node. Algorithm for the statistics to Node deposit to a priority queue, each one from the queue it pop up two minimum frequency of nodes, build a new father Node (not a leaf node), character content just pop out of the sum of two nodes character content, frequency and them and, most began to play as a left child node, behind one as the right child nodes, and to just build the parent node into the queue. Repeat the above action N-1 time. N is the number of different characters (subtract 1 from the number in each queue). At the end of the above steps, there is only one node left in the queue, which pops up as the root node of the tree. The code is as follows:


private static Tree buildTree(Map<Character, Integer> statistics, 
      List<Node> leafs) { 
    Character[] keys = statistics.keySet().toArray(new Character[0]); 
 
    PriorityQueue<Node> priorityQueue = new PriorityQueue<Node>(); 
    for (Character character : keys) { 
      Node node = new Node(); 
      node.chars = character.toString(); 
      node.frequence = statistics.get(character); 
      priorityQueue.add(node); 
      leafs.add(node); 
    } 
 
    int size = priorityQueue.size(); 
    for (int i = 1; i <= size - 1; i++) { 
      Node node1 = priorityQueue.poll(); 
      Node node2 = priorityQueue.poll(); 
 
      Node sumNode = new Node(); 
      sumNode.chars = node1.chars + node2.chars; 
      sumNode.frequence = node1.frequence + node2.frequence; 
 
      sumNode.leftNode = node1; 
      sumNode.rightNode = node2; 
 
      node1.parent = sumNode; 
      node2.parent = sumNode; 
 
      priorityQueue.add(sumNode); 
    } 
 
    Tree tree = new Tree(); 
    tree.root = priorityQueue.poll(); 
    return tree; 
  } 

coding

If the character node is the left node of the parent node, add 0 before the encoding character; if the character node is the right node, add 1 until it reaches the root node. Once the mapping relationship between the characters and the base 2 code is obtained, the encoding is very simple. The code is as follows:


public static String encode(String originalStr, 
      Map<Character, Integer> statistics) { 
    if (originalStr == null || originalStr.equals("")) { 
      return ""; 
    } 
 
    char[] charArray = originalStr.toCharArray(); 
    List<Node> leafNodes = new ArrayList<Node>(); 
    buildTree(statistics, leafNodes); 
    Map<Character, String> encodInfo = buildEncodingInfo(leafNodes); 
 
    StringBuffer buffer = new StringBuffer(); 
    for (char c : charArray) { 
      Character character = new Character(c); 
      buffer.append(encodInfo.get(character)); 
    } 
 
    return buffer.toString(); 
  } 
private static Map<Character, String> buildEncodingInfo(List<Node> leafNodes) { 
    Map<Character, String> codewords = new HashMap<Character, String>(); 
    for (Node leafNode : leafNodes) { 
      Character character = new Character(leafNode.getChars().charAt(0)); 
      String codeword = ""; 
      Node currentNode = leafNode; 
 
      do { 
        if (currentNode.isLeftChild()) { 
          codeword = "0" + codeword; 
        } else { 
          codeword = "1" + codeword; 
        } 
 
        currentNode = currentNode.parent; 
      } while (currentNode.parent != null); 
 
      codewords.put(character, codeword); 
    } 
 
    return codewords; 
  } 

decoding

Because the Huffman encoding algorithm can guarantee that any binary code will not be the prefix of another code, the decoding is very simple, take out every bit of the binary code in sequence, search from the root, 1 to the right, 0 to the left, to the leaf node (hit), back to the root node and continue to repeat the above action. The code is as follows:


public static String decode(String binaryStr, 
      Map<Character, Integer> statistics) { 
    if (binaryStr == null || binaryStr.equals("")) { 
      return ""; 
    } 
 
    char[] binaryCharArray = binaryStr.toCharArray(); 
    LinkedList<Character> binaryList = new LinkedList<Character>(); 
    int size = binaryCharArray.length; 
    for (int i = 0; i < size; i++) { 
      binaryList.addLast(new Character(binaryCharArray[i])); 
    } 
 
    List<Node> leafNodes = new ArrayList<Node>(); 
    Tree tree = buildTree(statistics, leafNodes); 
 
    StringBuffer buffer = new StringBuffer(); 
 
    while (binaryList.size() > 0) { 
      Node node = tree.root; 
 
      do { 
        Character c = binaryList.removeFirst(); 
        if (c.charValue() == '0') { 
          node = node.leftNode; 
        } else { 
          node = node.rightNode; 
        } 
      } while (!node.isLeaf()); 
 
      buffer.append(node.chars); 
    } 
 
    return buffer.toString(); 
  } 

Testing and comparing

The following tests test the correctness of the Huffman encoding (encoding first, decoding later, including Chinese) and the Huffman encoding compared with the common character encoding of the base 2 string. The code is as follows:


public static void main(String[] args) { 
    String oriStr = "Huffman codes compress data very effectively: savings of 20% to 90% are typical, " 
        + "depending on the characteristics of the data being compressed.  The rise of "; 
    Map<Character, Integer> statistics = statistics(oriStr.toCharArray()); 
    String encodedBinariStr = encode(oriStr, statistics); 
    String decodedStr = decode(encodedBinariStr, statistics); 
 
    System.out.println("Original sstring: " + oriStr); 
    System.out.println("Huffman encoed binary string: " + encodedBinariStr); 
    System.out.println("decoded string from binariy string: " + decodedStr); 
 
    System.out.println("binary string of UTF-8: " 
        + getStringOfByte(oriStr, Charset.forName("UTF-8"))); 
    System.out.println("binary string of UTF-16: " 
        + getStringOfByte(oriStr, Charset.forName("UTF-16"))); 
    System.out.println("binary string of US-ASCII: " 
        + getStringOfByte(oriStr, Charset.forName("US-ASCII"))); 
    System.out.println("binary string of GB2312: " 
        + getStringOfByte(oriStr, Charset.forName("GB2312"))); 
  } 
 
  public static String getStringOfByte(String str, Charset charset) { 
    if (str == null || str.equals("")) { 
      return ""; 
    } 
 
    byte[] byteArray = str.getBytes(charset); 
    int size = byteArray.length; 
    StringBuffer buffer = new StringBuffer(); 
    for (int i = 0; i < size; i++) { 
      byte temp = byteArray[i]; 
      buffer.append(getStringOfByte(temp)); 
    } 
 
    return buffer.toString(); 
  } 
 
  public static String getStringOfByte(byte b) { 
    StringBuffer buffer = new StringBuffer(); 
    for (int i = 7; i >= 0; i--) { 
      byte temp = (byte) ((b >> i) & 0x1); 
      buffer.append(String.valueOf(temp)); 
    } 
 
    return buffer.toString(); 
  } 


Related articles: