Sort a file with huge volume of data given memory constraint

Learn sort a file with huge volume of data given memory constraint with practical examples, diagrams, and best practices. Covers java, file, sorting development techniques with visual explanations.

Efficiently Sorting Gigantic Files with Limited Memory in Java

Hero image for Sort a file with huge volume of data given memory constraint

Learn robust techniques for sorting files that exceed available RAM, focusing on external merge sort strategies in Java.

Sorting a file that is too large to fit into memory is a common challenge in data processing. Traditional in-memory sorting algorithms like quicksort or mergesort are not suitable for such scenarios. This article explores the external merge sort algorithm, a powerful technique designed to handle massive datasets by leveraging disk I/O and breaking the problem into manageable, memory-friendly chunks. We'll cover the core concepts, implementation details in Java, and best practices for optimizing performance.

Understanding External Merge Sort

External merge sort is a divide-and-conquer algorithm that operates in two main phases: the 'split and sort' phase and the 'merge' phase. The goal is to sort data that resides on disk, using only a limited amount of main memory. This approach minimizes memory usage by processing data in smaller, memory-sized blocks.

flowchart TD
    A[Start: Large Unsorted File] --> B{Split into Chunks}
    B --> C[Chunk 1]
    B --> D[Chunk 2]
    B --> E[...]
    B --> F[Chunk N]
    C --> G[Sort Chunk 1 (in-memory)]
    D --> H[Sort Chunk 2 (in-memory)]
    F --> I[Sort Chunk N (in-memory)]
    G --> J[Sorted Chunk 1]
    H --> K[Sorted Chunk 2]
    I --> L[Sorted Chunk N]
    J & K & L --> M{Merge Sorted Chunks}
    M --> N[Final Sorted File]
    N --> O[End]

Overview of the External Merge Sort Process

Phase 1: Splitting and Sorting

In the first phase, the large input file is read sequentially. As data is read, it's divided into smaller, fixed-size chunks. Each chunk is small enough to fit entirely into available main memory. Once a chunk is loaded, it's sorted using an efficient in-memory sorting algorithm (like Arrays.sort() in Java) and then written back to a temporary file on disk. This process continues until the entire input file has been processed, resulting in multiple sorted temporary files.

import java.io.*;
import java.util.*;

public class ExternalSort {

    public static List<File> splitAndSort(File inputFile, long chunkSize, Comparator<String> comparator) throws IOException {
        List<File> sortedChunks = new ArrayList<>();
        try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
            List<String> currentChunk = new ArrayList<>();
            String line;
            while ((line = reader.readLine()) != null) {
                currentChunk.add(line);
                if (currentChunk.size() * estimateLineSize(line) >= chunkSize) {
                    Collections.sort(currentChunk, comparator);
                    sortedChunks.add(writeChunkToFile(currentChunk));
                    currentChunk.clear();
                }
            }
            // Sort and write any remaining lines
            if (!currentChunk.isEmpty()) {
                Collections.sort(currentChunk, comparator);
                sortedChunks.add(writeChunkToFile(currentChunk));
            }
        }
        return sortedChunks;
    }

    private static File writeChunkToFile(List<String> chunk) throws IOException {
        File tempFile = File.createTempFile("sorted_chunk_", ".tmp");
        try (BufferedWriter writer = new BufferedWriter(new FileWriter(tempFile))) {
            for (String line : chunk) {
                writer.write(line);
                writer.newLine();
            }
        }
        return tempFile;
    }

    // Simple estimation, can be improved for accuracy
    private static long estimateLineSize(String line) {
        return line.length() * 2L; // Assuming 2 bytes per char for UTF-16
    }
}

Java code for splitting a large file into sorted temporary chunks.

Phase 2: Merging Sorted Chunks

The second phase involves merging the multiple sorted temporary files into a single, fully sorted output file. This is typically done using a k-way merge algorithm. A small buffer is maintained for each temporary file, reading the smallest element from each buffer, writing the overall smallest to the output file, and then replenishing the buffer from which the element was taken. This process continues until all temporary files are exhausted.

import java.io.*;
import java.util.*;

public class ExternalSortMerge {

    public static void mergeSortedFiles(List<File> sortedFiles, File outputFile, Comparator<String> comparator) throws IOException {
        // Use a PriorityQueue to efficiently find the smallest element across all files
        PriorityQueue<MergeReader> pq = new PriorityQueue<>((mr1, mr2) -> comparator.compare(mr1.peek(), mr2.peek()));

        for (File file : sortedFiles) {
            MergeReader reader = new MergeReader(new BufferedReader(new FileReader(file)));
            if (reader.hasNext()) {
                pq.add(reader);
            }
        }

        try (BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile))) {
            while (!pq.isEmpty()) {
                MergeReader reader = pq.poll();
                writer.write(reader.next());
                writer.newLine();

                if (reader.hasNext()) {
                    pq.add(reader); // Add reader back if it still has lines
                } else {
                    reader.close(); // Close reader if exhausted
                }
            }
        }

        // Clean up temporary files
        for (File file : sortedFiles) {
            file.delete();
        }
    }

    // Helper class to manage reading from a single sorted file
    static class MergeReader implements Closeable {
        private BufferedReader reader;
        private String nextLine;

        public MergeReader(BufferedReader reader) throws IOException {
            this.reader = reader;
            readNextLine();
        }

        public boolean hasNext() {
            return nextLine != null;
        }

        public String peek() {
            return nextLine;
        }

        public String next() throws IOException {
            String currentLine = nextLine;
            readNextLine();
            return currentLine;
        }

        private void readNextLine() throws IOException {
            nextLine = reader.readLine();
        }

        @Override
        public void close() throws IOException {
            reader.close();
        }
    }
}

Java code for merging multiple sorted temporary files using a PriorityQueue.

Putting It All Together: A Complete Example

Combining the splitting, sorting, and merging phases provides a complete solution for sorting large files. The Comparator allows for flexible sorting criteria, whether it's lexicographical, numerical, or based on specific fields within each line.

import java.io.*;
import java.util.*;

public class MainExternalSort {

    public static void main(String[] args) {
        File inputFile = new File("large_unsorted_data.txt");
        File outputFile = new File("sorted_data.txt");
        long chunkSize = 10 * 1024 * 1024; // 10 MB chunk size (adjust based on available RAM)

        // Example comparator: natural string order
        Comparator<String> stringComparator = Comparator.naturalOrder();

        try {
            System.out.println("Starting split and sort phase...");
            List<File> sortedChunks = ExternalSort.splitAndSort(inputFile, chunkSize, stringComparator);
            System.out.println("Split and sort complete. " + sortedChunks.size() + " sorted chunks created.");

            System.out.println("Starting merge phase...");
            ExternalSortMerge.mergeSortedFiles(sortedChunks, outputFile, stringComparator);
            System.out.println("Merge complete. Final sorted file: " + outputFile.getAbsolutePath());

        } catch (IOException e) {
            System.err.println("An error occurred during external sort: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Main class demonstrating the usage of external merge sort.