Sortrows with multiple sorting keys in numpy

Learn sortrows with multiple sorting keys in numpy with practical examples, diagrams, and best practices. Covers python, arrays, sorting development techniques with visual explanations.

Sorting NumPy Arrays by Multiple Columns (Sortrows Equivalent)

Hero image for Sortrows with multiple sorting keys in numpy

Learn how to sort NumPy arrays based on the values of multiple columns, mimicking the functionality of MATLAB's sortrows for complex data ordering.

Sorting data is a fundamental operation in data analysis and manipulation. While NumPy provides powerful tools for array operations, sorting by multiple keys, similar to MATLAB's sortrows function, requires a specific approach. This article will guide you through various methods to achieve multi-key sorting in NumPy, ensuring your data is ordered precisely according to your criteria.

Understanding Multi-Key Sorting

Multi-key sorting involves ordering a dataset based on the values in one column (the primary key), and then, for rows with identical primary key values, further ordering them based on a second column (the secondary key), and so on. This is crucial for datasets where a single column might not provide sufficient uniqueness for the desired order. NumPy's argsort function, combined with structured arrays or specific indexing techniques, is the cornerstone for this type of sorting.

flowchart TD
    A[Start with Unsorted Array] --> B{Define Primary Sort Key}
    B --> C{Define Secondary Sort Key (and others)}
    C --> D[Create Composite Key/Structured Array]
    D --> E[Apply `argsort` to Composite Key]
    E --> F[Reorder Original Array using Indices]
    F --> G[Sorted Array by Multiple Keys]

Workflow for multi-key sorting in NumPy

Method 1: Using np.lexsort for Multiple Keys

np.lexsort is specifically designed for indirect stable sorting using multiple keys. It takes a sequence of keys (columns) and returns an array of integer indices that would sort the array lexicographically. The last key in the sequence is the primary sort key, the second to last is the secondary, and so on. This is often the most straightforward and efficient method for multi-key sorting.

import numpy as np

# Sample data: [ID, Score, Name]
data = np.array([
    [1, 90, 'Alice'],
    [2, 85, 'Bob'],
    [3, 90, 'Charlie'],
    [4, 95, 'David'],
    [5, 85, 'Eve']
])

# Convert to a structured array for better handling of mixed types
dtype = [('ID', int), ('Score', int), ('Name', 'U10')]
structured_data = np.array(list(map(tuple, data)), dtype=dtype)

# Sort by 'Score' (primary) then by 'Name' (secondary)
# lexsort expects keys in reverse order of precedence
sort_indices = np.lexsort((structured_data['Name'], structured_data['Score']))

sorted_data_lexsort = structured_data[sort_indices]

print("Original Structured Data:\n", structured_data)
print("\nSorted by Score (primary) then Name (secondary) using lexsort:\n", sorted_data_lexsort)

Sorting a structured NumPy array using np.lexsort.

Method 2: Using Structured Arrays with np.sort

NumPy structured arrays are powerful for handling heterogeneous data, similar to tables or records. When you sort a structured array directly using np.sort or the .sort() method, it will sort based on the fields specified in its dtype definition. If multiple fields are defined, it sorts lexicographically by default, using the order of fields in the dtype as the sorting precedence.

import numpy as np

# Sample data
data = np.array([
    (1, 90, 'Alice'),
    (2, 85, 'Bob'),
    (3, 90, 'Charlie'),
    (4, 95, 'David'),
    (5, 85, 'Eve')
], dtype=[('ID', int), ('Score', int), ('Name', 'U10')])

# To sort by 'Score' then 'Name', we need to define the dtype with this order
# If your original structured array has a different order, you might need to create a new one
# or use lexsort as shown above.

# Let's create a structured array with the desired sort order in its dtype
dtype_sorted_preference = [('Score', int), ('Name', 'U10'), ('ID', int)]

# Create a new structured array with the desired dtype order for sorting
# This is a conceptual step; in practice, you'd define your dtype carefully from the start.
# For existing data, lexsort is often more flexible.

# If we want to sort by 'Score' then 'Name', we can define the dtype like this:
# Or, if the data is already in a structured array, we can use lexsort.

# For demonstration, let's assume we want to sort by 'Score' then 'Name'
# We can create a view or a new array with the desired field order for sorting
# This is more complex than lexsort for arbitrary column sorting.

# A simpler way with structured arrays is to sort by a specific field, then another.
# However, for true multi-key sort, lexsort is generally preferred.

# Let's re-demonstrate with lexsort on the structured array, as it's more direct for arbitrary key order
# (as shown in Method 1). If the structured array's dtype *is* the sort order, then np.sort works.

# Example: If the dtype was defined as [('Score', int), ('Name', 'U10'), ('ID', int)]
# then np.sort(data) would sort by Score, then Name.

# Let's create a structured array where the dtype order matches the desired sort order
data_for_dtype_sort = np.array([
    (90, 'Alice', 1),
    (85, 'Bob', 2),
    (90, 'Charlie', 3),
    (95, 'David', 4),
    (85, 'Eve', 5)
], dtype=[('Score', int), ('Name', 'U10'), ('ID', int)])

sorted_data_dtype = np.sort(data_for_dtype_sort, order=['Score', 'Name'])

print("\nOriginal Structured Data (dtype order for sorting):\n", data_for_dtype_sort)
print("\nSorted by Score (primary) then Name (secondary) using np.sort with 'order':\n", sorted_data_dtype)

Sorting a structured NumPy array using np.sort with the order parameter.

Method 3: Manual Sorting with argsort (Less Common for Multiple Keys)

While np.lexsort is optimized for this, it's conceptually possible to achieve multi-key sorting by repeatedly applying argsort and then using the resulting indices. However, this method is generally less efficient and more prone to errors than np.lexsort or structured array sorting, especially for more than two keys. It's included here for completeness and to illustrate the underlying mechanics.

import numpy as np

# Sample data (regular 2D array)
data = np.array([
    [1, 90, 'Alice'],
    [2, 85, 'Bob'],
    [3, 90, 'Charlie'],
    [4, 95, 'David'],
    [5, 85, 'Eve']
], dtype=object) # Use dtype=object for mixed types in a regular array

# Convert relevant columns to appropriate types for sorting if necessary
scores = data[:, 1].astype(int)
names = data[:, 2]

# Sort by primary key (Score) first
primary_sort_indices = np.argsort(scores)

# Apply primary sort to get an intermediate sorted array
intermediate_data = data[primary_sort_indices]

# Now, within groups of identical scores, sort by secondary key (Name)
# This part is tricky and where lexsort shines. 
# For a true manual multi-key sort, you'd need to identify groups of equal primary keys
# and then sort each group by the secondary key. This is not trivial with argsort directly.

# For simplicity, let's just show how argsort works on a single key
# and reiterate that lexsort is the way to go for multiple keys.

# If we were to sort by Name, then Score (reverse of lexsort example)
# This is just for demonstration of argsort, not a robust multi-key solution
indices_by_name = np.argsort(names)
indices_by_score = np.argsort(scores)

# This approach does not correctly combine multiple keys for a stable sort.
# It's better to use lexsort or structured arrays.

print("\nOriginal Data (regular array):\n", data)
print("\nIndices if sorted by Score only:\n", primary_sort_indices)
print("\nData sorted by Score only (not multi-key):\n", data[primary_sort_indices])

Demonstration of argsort for a single key, highlighting the complexity for multiple keys.

In summary, when you need to sort a NumPy array by multiple columns, np.lexsort is generally the most flexible and recommended approach, especially when dealing with regular (non-structured) arrays or when the sort key order doesn't align with a structured array's dtype definition. For structured arrays, defining the dtype with the desired sort order or using np.sort with the order parameter provides a clean solution. Choose the method that best fits your data structure and specific sorting requirements.