Which column for foreign key: id or any other column and why?

Learn which column for foreign key: id or any other column and why? with practical examples, diagrams, and best practices. Covers sql, postgresql, foreign-keys development techniques with visual ex...

Foreign Key Best Practices: 'id' vs. Other Columns

Hero image for Which column for foreign key: id or any other column and why?

Explore the best practices for choosing foreign key columns in relational databases, focusing on why using the primary key 'id' is generally preferred over other columns, and the implications for data integrity and performance.

When designing a relational database, one of the fundamental decisions involves how to establish relationships between tables. Foreign keys are crucial for this, linking records in one table to records in another. A common question arises: should a foreign key always reference the primary key (typically an id column), or can it reference any unique column? This article delves into the reasons why referencing the primary key, especially an auto-incrementing id, is almost always the superior choice, ensuring data integrity, performance, and maintainability.

The Role of Primary Keys and Foreign Keys

A primary key uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values. The most common primary key is an auto-incrementing integer column named id. A foreign key, on the other hand, is a column or set of columns in one table that refers to the primary key (or a unique key) in another table. Its purpose is to enforce referential integrity, ensuring that relationships between tables remain consistent. For example, if you have a users table and an orders table, the orders table would likely have a user_id foreign key referencing the id primary key in the users table.

erDiagram
    USERS ||--o{ ORDERS : places
    USERS { 
        INT id PK
        VARCHAR name
        VARCHAR email
    }
    ORDERS { 
        INT id PK
        INT user_id FK
        DATE order_date
        DECIMAL total_amount
    }

Entity-Relationship Diagram showing USERS and ORDERS tables with a foreign key relationship.

Why 'id' (Primary Key) is the Preferred Foreign Key Target

Referencing the primary key (id) for foreign key constraints offers several significant advantages:

  1. Uniqueness and Immutability: Primary keys are guaranteed to be unique and, ideally, immutable. This ensures that the foreign key always points to a specific, unchanging record. If you were to reference a non-primary key column that might change (e.g., a username or email), updating that column would require cascading updates to all referencing foreign keys, which can be complex, error-prone, and performance-intensive.

  2. Performance: Database systems are highly optimized for primary key lookups. When a foreign key references a primary key, the database can efficiently join tables and enforce constraints. Primary keys are almost always indexed, making lookups extremely fast. Referencing a non-indexed column would lead to slower queries and constraint checks.

  3. Simplicity and Consistency: Using id as the foreign key target simplifies schema design and makes it consistent across your database. Developers can easily understand relationships without needing to remember which specific unique column in a parent table is being referenced.

  4. Data Integrity: The primary key's non-null and unique constraints are the strongest guarantees of data integrity. By referencing it, you leverage these guarantees directly for your foreign key relationships.

  5. Storage Efficiency: Auto-incrementing integer id columns are typically small and efficient to store and index, which translates to better performance and less disk usage compared to larger string-based unique keys.

When Other Columns Might Be Considered (and Why to Be Cautious)

While generally discouraged, a foreign key can technically reference any column or set of columns in the parent table that has a UNIQUE constraint. For example, if a users table has a username column with a UNIQUE constraint, another table could theoretically use username as a foreign key. However, this approach comes with significant drawbacks:

  • Mutability Issues: If the username of a user changes, all records in tables referencing that username would need to be updated. This requires ON UPDATE CASCADE rules, which can be tricky to manage and can lead to performance bottlenecks during updates.
  • Performance Overhead: While a unique index helps, integer-based primary keys are often more efficient for joins and lookups than string-based unique keys.
  • Semantic Meaning vs. Identifier: Primary keys are ideal as surrogate identifiers – they have no inherent meaning beyond identifying a record. Columns like username or email have semantic meaning and might change, making them poor choices for stable identifiers in relationships.
CREATE TABLE users (
    id SERIAL PRIMARY KEY, -- Preferred: Auto-incrementing integer primary key
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL
);

CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id), -- Correct: References the primary key 'id'
    title VARCHAR(255) NOT NULL,
    content TEXT
);

-- Example of referencing a unique non-PK column (generally discouraged)
CREATE TABLE user_profiles (
    id SERIAL PRIMARY KEY,
    user_email VARCHAR(100) REFERENCES users(email), -- Possible, but problematic if email changes
    bio TEXT
);

SQL examples demonstrating foreign key creation, contrasting referencing 'id' vs. a unique 'email' column.

Conclusion

The best practice for foreign key relationships is to always reference the primary key of the parent table, which is almost universally an auto-incrementing integer id column. This approach guarantees uniqueness, immutability, optimal performance, and simplifies database design and maintenance. While databases allow foreign keys to reference any unique column, deviating from the primary key standard introduces unnecessary risks and complexities that can negatively impact data integrity and application performance in the long run.