Hash table implementation in python

Learn hash table implementation in python with practical examples, diagrams, and best practices. Covers python, hash development techniques with visual explanations.

Implementing Hash Tables in Python: A Comprehensive Guide

Abstract representation of data being hashed and stored in buckets, symbolizing a hash table

Explore the fundamentals of hash tables, their underlying principles, and practical implementation techniques in Python. Learn how to build your own hash map from scratch.

Hash tables, also known as hash maps or dictionaries, are one of the most fundamental and widely used data structures in computer science. They provide efficient storage and retrieval of data by mapping keys to values. In Python, the built-in dict type is an excellent example of a hash table implementation. However, understanding how they work under the hood is crucial for optimizing performance and debugging issues. This article will guide you through the core concepts of hash tables and demonstrate how to implement a basic version in Python.

Understanding Hash Table Fundamentals

At its core, a hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. The key components are:

Hash Function: Takes an input key and returns an integer hash value. A good hash function distributes keys uniformly across the hash table to minimize collisions.
Buckets/Slots: An array where key-value pairs are stored. Each index in this array corresponds to a hash value.
Collision Resolution: Since different keys can produce the same hash value (a collision), a mechanism is needed to handle these situations. Common strategies include separate chaining (using linked lists or arrays in each bucket) or open addressing (probing for the next available slot).

flowchart TD
    A[Input Key] --> B{"Hash Function"}
    B --> C[Hash Value (Integer)]
    C --> D["Modulo Operator (Index Calculation)"]
    D --> E["Array Index (Bucket)"]
    E --> F{"Collision Resolution"}
    F --> G["Store/Retrieve Value"]
    G --> H[Output Value]

Basic workflow of a hash table operation

Implementing a Simple Hash Table in Python

Let's build a basic hash table using separate chaining for collision resolution. We'll define a HashTable class with __init__, _hash, insert, and get methods. For simplicity, our hash function will use Python's built-in hash() function and the modulo operator to determine the bucket index.

class HashTable:
    def __init__(self, capacity=10):
        self.capacity = capacity
        self.table = [[] for _ in range(self.capacity)]

    def _hash(self, key):
        return hash(key) % self.capacity

    def insert(self, key, value):
        index = self._hash(key)
        # Check if key already exists in the bucket (update operation)
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                self.table[index][i] = (key, value) # Update value
                return
        self.table[index].append((key, value)) # Add new key-value pair

    def get(self, key):
        index = self._hash(key)
        for k, v in self.table[index]:
            if k == key:
                return v
        raise KeyError(f"Key '{key}' not found")

    def delete(self, key):
        index = self._hash(key)
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                del self.table[index][i]
                return
        raise KeyError(f"Key '{key}' not found")

    def __str__(self):
        items = []
        for i, bucket in enumerate(self.table):
            if bucket:
                items.append(f"Bucket {i}: {bucket}")
        return "\n".join(items)

# Example Usage:
hash_map = HashTable(capacity=5)
hash_map.insert("apple", 10)
hash_map.insert("banana", 20)
hash_map.insert("cherry", 30)
hash_map.insert("date", 40)
hash_map.insert("elderberry", 50)

print("Initial Hash Table:")
print(hash_map)

print(f"\nValue for 'banana': {hash_map.get('banana')}")
hash_map.insert("banana", 25) # Update value
print(f"Updated value for 'banana': {hash_map.get('banana')}")

hash_map.insert("fig", 60) # This might cause a collision depending on hash()
print("\nHash Table after adding 'fig':")
print(hash_map)

try:
    print(f"\nValue for 'grape': {hash_map.get('grape')}")
except KeyError as e:
    print(e)

hash_map.delete("cherry")
print("\nHash Table after deleting 'cherry':")
print(hash_map)

A basic Python hash table implementation using separate chaining.

💡

Python's built-in hash() function is designed to be efficient and provide good distribution for various data types. However, for custom objects, you'll need to implement your own __hash__ and __eq__ methods to ensure they can be used as dictionary keys.

Performance Considerations and Load Factor

The efficiency of a hash table heavily depends on its load factor, which is the ratio of the number of stored items to the number of buckets. A high load factor means more items per bucket, leading to more collisions and slower retrieval times (O(N) in the worst case for a single bucket). Conversely, a low load factor means more empty buckets, wasting space.

To maintain good performance (ideally O(1) average time complexity for insertions, deletions, and lookups), hash tables often resize themselves when the load factor exceeds a certain threshold. This involves creating a new, larger array of buckets and re-hashing all existing key-value pairs into the new table.

graph TD
    A[Number of Items] --> B[Number of Buckets]
    B --> C{"Load Factor (Items / Buckets)"}
    C --> D{Load Factor > Threshold?}
    D -- Yes --> E[Resize Table (Increase Buckets)]
    E --> F[Rehash All Items]
    D -- No --> G[Maintain Current Performance]
    F --> G

Decision process for hash table resizing based on load factor

ℹ️

While our simple implementation doesn't include resizing, a production-ready hash table would dynamically adjust its capacity to maintain optimal performance. Python's dict handles this automatically.