php kernel parsing: Hash tables in PHP

2020-12-20 03:30:49
OfStack

The most frequently used data types in PHP are strings and arrays, and PHP is easy to get used to because of its flexible array types. It is worth introducing 1 hash table (HashTable) before getting into the details of these data types. Hash tables are a key data structure in PHP implementation.

Hash tables are widely used in practice, such as a symbolic table that compilers typically maintain to hold tags, and are explicitly supported in many high-level languages. Hash tables usually provide lookup (Search), insert (Insert), delete (Delete), and so on, which in the worst case and linked list performance 1 is O(n). But it's usually not that bad, and a well-designed hash algorithm can effectively avoid such situations, and the time complexity of these operations on the hash table is usually O(1). That's why it's so beloved.
Because of the convenience and efficiency of the use of hash tables, most of the current implementation of dynamic languages use hash tables.

To make it easier for the reader to follow, the basic concepts that appear in the HashTable implementation below are listed in advance. A hash table is a data structure that maps a particular key to a particular value through a hash function, and it maintains an 11 correspondence between the key and the value.
Key (key) : An identifier used to manipulate data, such as an index in an PHP array, or a string key, etc.

Slot (slot/bucket) : The 1 unit in a hash table used to hold data, that is, the container where the data is actually stored.

Hash function (hash function) : A function that maps key (map) to the location of slot where the data should be stored.

Hash conflict (hash collision) : A case where the hash function maps two different key to the same index.

The expansion of the hash table can be interpreted as an array or an associative array, use digital array subscript to addressing, if small and the scope of the keyword (key) Numbers, we can use an array to complete the hash table directly, and if the keyword range is too big, if we need directly using the array for all possible key application space. In many cases, this is unrealistic. Even if there is enough space, space utilization will be low, which is not ideal. Also, keys may not be numbers, especially in PHP, so one mapping function (hash function) is used to map key to a specific domain:


h(key) -> index

With a properly designed hash function, we can map key to the appropriate range, because our key space can be large (for example, the string key), and when mapped to a smaller space, there may be two different key maps mapped to the same index, which is what we call a conflict. At present, there are two main methods to resolve hash conflict: linking method and open addressing method.

Conflict resolution

Chaining: The chaining method resolves conflicts by using a linked list to hold slot values, that is, a linked list to hold the values when different key maps to a slot. So using the chaining method is in the worst case where all key are mapped to the same slot and the time complexity of operating the linked list is O(n). So choosing the right hash function is critical. The current implementation of HashTable in PHP takes this approach to resolving conflicts.
Open addressing: There is usually another way to resolve conflicts: open addressing. Using open addressing method is slot itself to store data directly, if key when inserting data have been mapped to the index data, it shows that conflict happens, it's will be looking for a slot, if the tank has been occupied continued to search for a slot, until find the slot without being used, also use the same strategies when looking for law.

The realization of a hash table

It is also easy to implement a hash table once you understand how it works. There are only three things you need to do:
Implement the hash function
Conflict resolution
Implementation of the operation interface
First of all, we need a container to store our hash table. The content of the hash table is mainly the data stored in it. Meanwhile, in order to conveniently know the number of elements stored in the hash table, we need to save a size field, and the second one is the container to store the data. As an example, a simple hash table is implemented below. There are two basic data structures. One is used to store the hash table itself, and the other is a single linked list used to actually store the data. It is defined as follows:


typedef struct _Bucket
{
 char *key;
 void *value;
 struct _Bucket *next;

} Bucket;

typedef struct _HashTable
{
 int size;
 Bucket* buckets;
} HashTable;

The above definition is similar to the implementation in PHP, with most of the extraneous details cropped out to make it easier to understand. For simplicity in this section, the data type of key is a string, and the stored data type can be any type.
The Bucket structure is a single linked list, which is used to solve the problem of multiple key hash conflicts, also known as the linking method mentioned earlier. Links conflicting elements when multiple key are mapped to the same index.
The hash function needs to map different key to different slots (slot or bucket) as much as possible. First, we use one of the simplest hash algorithms: add up all characters of the key string, and then modulus the resulting hash table so that the index falls within the range of the array index.


static int hash_str(char *key)
{
 int hash = 0;

 char *cur = key;

 while(*(cur++) != '\0') {
 hash += *cur;
 }

 return hash;
}

//  Use this macro to figure out key An index in a hash table 
#define HASH_INDEX(ht, key) (hash_str((key)) % (ht)->size)

This hash algorithm is relatively simple, its effect is not good, in the actual scene will not use this hash algorithm, for example, PHP is used in DJBX33A algorithm, Mysql, OpenSSL and other open source software hash algorithm is listed here, interested readers can go to reference.
Implementation of the operation interface
In order to operate the hash table, the following operation functions are implemented:


int hash_init(HashTable *ht); //  Initialize the hash table 
int hash_lookup(HashTable *ht, char *key, void **result); //  According to the key To find the content 
int hash_insert(HashTable *ht, char *key, void *value); //  Inserts the contents into the hash table 
int hash_remove(HashTable *ht, char *key); //  delete key The content to point to 
int hash_destroy(HashTable *ht);

Take the insert and get operation functions as an example:


int hash_insert(HashTable *ht, char *key, void *value)
 {
 // check if we need to resize the hashtable
 resize_hash_table_if_needed(ht); //  The hash table is not fixed in size, when the inserted content quickly occupies the storage space of the table 
 //  We're going to expand the hash table,   In order to hold all the elements 

int index = HASH_INDEX(ht, key); //  find key The index to which to map 

Bucket *org_bucket = ht->buckets[index];
 Bucket *bucket = (Bucket *)malloc(sizeof(Bucket)); //  Request space for new elements 

bucket->key = strdup(key);
 //  Save the value content,   Instead of copying the content, you simply point the pointer to the content to store. 
 bucket->value = value;

LOG_MSG("Insert data p: %p\n", value);

ht->elem_num += 1; //  record 1 Now the number of elements in the hash table 

if(org_bucket != NULL) { //  A collision occurs, placing the new element at the head of the list 
 LOG_MSG("Index collision found with org hashtable: %p\n", org_bucket);
 bucket->next = org_bucket;
 }

ht->buckets[index]= bucket;

LOG_MSG("Element inserted at index %i, now we have: %i elements\n",
 index, ht->elem_num);

return SUCCESS;
 }

The above hashtable insert operation is relatively simple, simple as key hash, find the location where the element should be stored, and check whether there is already content at that location, if a collision occurs, the new element will be linked to the original element list header. When searching, the same strategy is followed to find the location of the element. If there are elements, the key of all elements in the linked list is compared with the key to be searched in turn until the element of 1 is found, otherwise there is no matching content of the value.


int hash_lookup(HashTable *ht, char *key, void **result)
{
 int index = HASH_INDEX(ht, key);
 Bucket *bucket = ht->buckets[index];

 if(bucket == NULL) return FAILED;

 //  Look for the list to find the correct element, usually the list should be only 1 You don't have to do it multiple times 
 //  Cycle. To ensure that the 1 Points need to be 1 For an appropriate hash algorithm, see the link to the relevant hash function above. 
 while(bucket)
 {
 if(strcmp(bucket->key, key) == 0)
 {
 LOG_MSG("HashTable found key in index: %i with key: %s value: %p\n",
 index, key, bucket->value);
 *result = bucket->value;
 return SUCCESS;
 }

 bucket = bucket->next;
 }

 LOG_MSG("HashTable lookup missed the key: %s\n", key);
 return FAILED;
}

PHP arrays are implemented based on hash table, in order to add an array element, between elements is order, and the hash table on the physical location is close to the average distribution obviously, this is not according to the order of insertion of access to these elements, in the implementation of PHP Bucket structure also maintains a pointer to another the field to the relationship between the maintenance elements. This is detailed in HashTable in PHP in the following section. The above example is a stripped-down version implemented in PHP.