Java language Consistent Hash algorithm learning notes (code example)

2021-01-22 05:11:13
OfStack

This paper mainly studies the ConsistentHashing algorithm code.

1 Generic hash (Consistent Hash)

Introduction of agreement

The homogeneous hash algorithm was proposed by the Massachusetts Institute of Technology in 1997 (see 0). It was designed to solve hot spots in the Internet (Hot pot), and the original intention is similar to CARP10 classification. Uniformness hashing corrects the simple hashing algorithm used by CARP, making DHT really useful in P2P environment.

The hash algorithm

In the dynamic changing Cache environment, the four adaptive conditions that the algorithm should meet are proposed:

Balance (Balance)

Balance means that the result of the hash is distributed among all caches as much as possible, so that all cache space is utilized. Many hash algorithms satisfy this 1 condition.

Monotonicity (Monotonicity)

Monotony means that if something has been hashed to the corresponding cache, new caches are added to the system. The result of the hash should be such that the original allocated content can be mapped to the new cache and not to other buffers in the old cache set.

Simple hash algorithms often fail to meet the requirements of monotony, such as the simplest linear hash:

x → ax + b mod (P)

In the above formula, P represents the size of the total cache. It is not difficult to see that when the cache size changes (from P1 to P2), all the original hash results will change, which does not meet the requirement of monotony.

A change in the hash result means that when the cache space changes, all the mappings need to be updated throughout the system. In the P2P system, the change of cache is equivalent to the Peer joining or quitting the system. This situation will occur frequently in the P2P system, so it will bring great computing and transmission load. Monotony is the requirement that the hash algorithm is able to avoid this situation.

Dispersion (Spread)

In a distributed environment, an endpoint may not see all of the caches, but only one part of them. When an endpoint wants to map the content to the cache through the hash process, different endpoints may see different cache ranges, resulting in different hash results, and the final result is that the same content is mapped to different caches by different endpoints. This situation should obviously be avoided because it results in the same content being stored in different buffers, reducing the efficiency of system storage. Dispersion is defined as the extent to which this happens. A good hash algorithm should be able to avoid the occurrence of inconsistency, that is, to minimize the dispersion.

Load (Load)

The load problem is actually another way to look at the dispersion problem. Since it is possible for different endpoints to map the same content to different buffers, it is also possible for a specific buffer to be mapped to different content by different users. Like dispersion, this situation should be avoided, so a good hash algorithm should be able to minimize the buffer load.

On the surface, homogeneous hashing addresses the problem of distributed buffering, but if you think of buffering as Peer in the P2P system, and the mapped content as a variety of shared resources (data, files, media streams, etc.), you will find that both are actually describing the same problem.

Routing algorithm

In the homogeneous hash algorithm, each node (corresponding to Peer in P2P system) has a randomly assigned ID. When the content is mapped to a node, the content keyword and the ID of the node are used to perform a one-dimensional hash and obtain the key value. Unitropic hashing requires that the key value and node ES69en be in the same range of values. The simplest key values and ID can be 1-dimensional, such as a collection of integers from 0000 to 9999.

When content is stored by key value, the content is stored on the node that has the closest ID with its key value. For example, if the key value is 1001, and there are ID nodes of 10001010 1100 in the system, this content will be mapped to 1000 nodes.

In order to build the routes required for the query, a one-to-one hash requires each node to store the location information (the IP address) of its ascending node (the smallest node with an ID value greater than itself) and its descending node (the largest node with an ID value less than itself). When a node needs to find content, it can decide to initiate a query request according to the key value of the content. If the node that receives the query request finds that it has the requested target, it can directly return confirmation to the node that initiated the query request. If you find a scope that does not belong to you, you can forward the request to your own upstream/downstream node.

In order to maintain the routing information mentioned above, neighboring nodes must update the routing information in a timely manner when a node joins/exits the system. This requires the nodes not only to store the location information of the directly connected downstream nodes, but also to know the information of the indirect downstream nodes at a certain depth (n hop), and to dynamically maintain the node list. When a node exits the system, its upstream node will try to connect directly to the nearest downstream node, and after successful connection, it will get the list of downstream nodes from the new downstream node and update its own list of nodes. Similarly, when a new node is added to the system, it first finds the downstream node according to its own ID and gets the list of downstream nodes, and then asks the upstream node to modify its list of downstream nodes, thus restoring the routing relationship.

discuss

Generic hashing basically solves the most critical problem in P2P environment -- how to distribute storage and routing in a dynamic network topology. Each node only maintains a small amount of information about its neighbors, and when a node joins/exits the system, only a small number of related nodes participate in the maintenance of the topology. All of this one-cut makes the homogenous hash the first practical DHT algorithm.

But the routing algorithm of homogeneous hash has some shortcomings. In the process of query, the query message can reach the queried node only after O(N) step (O(N) represents a proportional relationship with N, and N represents the total number of nodes in the system). It is not difficult to imagine that when the scale of the system is very large, the number of nodes may exceed one million, and such query efficiency is obviously difficult to meet the needs of use. On the other hand, even if the user can tolerate long delays, the large number of messages generated during the query can put an unnecessary load on the network.

The source code:


package heritrix;
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
public class ConsistentHash<T> {
	// The hash algorithm 
	private final HashFunction hashFunction;
	// Number of virtual nodes 
	private final int numberOfReplicas;
	private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();
	public ConsistentHash(HashFunction hashFunction, int numberOfReplicas, Collection<T> nodes){
		this.hashFunction = hashFunction;
		this.numberOfReplicas = numberOfReplicas;
		for (T node : nodes){
			add(node);
		}
	}
	public void add(T node){
		for (int i = 0; i < numberOfReplicas; i++){
			circle.put(hashFunction.hash(node.toString() + i), node);
		}
	}
	public void remove(T node){
		for (int i = 0; i < numberOfReplicas; i++){
			circle.remove(hashFunction.hash(node.toString() + i));
		}
	}
	// The key algorithm 
	public T get(Object key){
		if(circle.isEmpty()){
			return null;
		}
		// To calculate hash value 
		int hash = hashFunction.hash(key);
		// If you don't include this hash value 
		if(!circle.containsKey(hash)){
			SortedMap<Integer, T> tailMap = circle.tailMap(hash);
			hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
		}
		return circle.get(hash);
	}
}

conclusion

The above is this article about Java language Consistent Hash algorithm learning notes (code examples) of all the content, I hope to help you. Interested friends can continue to refer to the site of other related topics, if there are shortcomings, welcome to leave a message to point out. Thank you for your support!