Lecture number 07: Hash Tables

Please refer to the following resources for more information:

cover

1. Data Structure#

DBMS mainly includes the following types of data:

Internal Meta-Data: Records information about the database and system.
Core Data Storage: Records information about tuples.
Temporary Data Structures
Table Indices

The main considerations are:

Data organization
Data reliability

2. Hash Table#

hash_table

hash_table_target

Hash Function

Converts a value into a smaller key.
Balances between speed and collision rate.

Hashing Scheme

How to handle collision cases.
Balancing between hash table size and additional maintenance information.

3. Hash Function#

A function that outputs a specific number regardless of the input key.

Hash_Function

Hash_Function bench_mark

4. Static Hashing Schemes#

4.1 Linear Probe Hashing#

A linear probing method is used to resolve hash collisions. If the position returned by the hash function is already occupied, the algorithm continues to search for the next available position until an empty slot is found.

In the case of deletion, there may be values that are not directly produced by the hash function but are obtained by continuously searching for the next position.

In this case, there are two methods to handle it:

Adjust the values affected by the offset by rehashing. This method is less efficient.
Mark the deleted values as "deleted". These values can be reused when new values need to be inserted. However, periodic garbage collection is required.

Storing the same values

NON-UNIQUE-KEY

When storing values with the same key but different values, there are two ways to handle it:

Allocate a separate space for each key and its value.
Hash all keys into one storage space using the hash function.

4.2 Robin Hood Hashing#

This is an upgraded version of linear probing that aims to solve the problem of uneven distribution. In the previous algorithm, there may be some keys that, after hashing, may cluster together and degrade the hash algorithm. Robin Hood hashing resolves collisions by redistributing some "rich" keys to "poor" keys.

The method to avoid hash collisions is to record an offset value for each value. When searching down the table, the current value needs to be compared with the values in the table. If the current offset value is greater than the offset value of the value in the table, the current value will replace the value in the table. This process continues until an empty slot is found.

void Robin_hood_Hashing(int key) {
     int val = hash(key);
     int offset = 0;
     for (int i = val ; ; i ++ , offset ++ ) {
          if(hash_table->is_empty(i)){ 
               hash_table->insert(i , val , offset);
               break;
          }

          if(hash_table->get_offset(i) < offset) {
               auto [n_val , n_offset] = hash_table->get_data(i);
               hash_table->insert(i , val , offset);
               val = n_val , offset = n_offset;
          }
          if(i == MAXSIZE) i = 0;
     }
}

4.3 Cuckoo Hashing#

This method of resolving collisions is achieved by using two hash functions and two hash tables.

For the input key, it calculates val[0] and val[1]. In uncertain situations, one of them is randomly chosen as the final result. When one of them triggers a collision, it uses a hash function and a hash table. If both of them collide, the collision is recalculated until there is no collision.

Using this method, the complexity of querying is always O(1), but the cost of insertion is increased.

Open-source CMU implementation

5 Dynamic Hashing Schemes#

Observation

The above hash tables are based on fixed-size elements, so dynamic hash tables are needed.

5.1 Chained Hashing#

Maintains a linked list for each hash value. All keys with the same hash value are placed in one bucket. Buckets are connected together in the form of linked lists.

Chained Hashing