Sort a file with huge volume of data given memory constraint
Categories:
Efficiently Sorting Gigantic Files with Limited Memory in Java

Learn robust techniques for sorting files that exceed available RAM, focusing on external merge sort strategies in Java.
Sorting a file that is too large to fit into memory is a common challenge in data processing. Traditional in-memory sorting algorithms like quicksort or mergesort are not suitable for such scenarios. This article explores the external merge sort algorithm, a powerful technique designed to handle massive datasets by leveraging disk I/O and breaking the problem into manageable, memory-friendly chunks. We'll cover the core concepts, implementation details in Java, and best practices for optimizing performance.
Understanding External Merge Sort
External merge sort is a divide-and-conquer algorithm that operates in two main phases: the 'split and sort' phase and the 'merge' phase. The goal is to sort data that resides on disk, using only a limited amount of main memory. This approach minimizes memory usage by processing data in smaller, memory-sized blocks.
flowchart TD A[Start: Large Unsorted File] --> B{Split into Chunks} B --> C[Chunk 1] B --> D[Chunk 2] B --> E[...] B --> F[Chunk N] C --> G[Sort Chunk 1 (in-memory)] D --> H[Sort Chunk 2 (in-memory)] F --> I[Sort Chunk N (in-memory)] G --> J[Sorted Chunk 1] H --> K[Sorted Chunk 2] I --> L[Sorted Chunk N] J & K & L --> M{Merge Sorted Chunks} M --> N[Final Sorted File] N --> O[End]
Overview of the External Merge Sort Process
Phase 1: Splitting and Sorting
In the first phase, the large input file is read sequentially. As data is read, it's divided into smaller, fixed-size chunks. Each chunk is small enough to fit entirely into available main memory. Once a chunk is loaded, it's sorted using an efficient in-memory sorting algorithm (like Arrays.sort()
in Java) and then written back to a temporary file on disk. This process continues until the entire input file has been processed, resulting in multiple sorted temporary files.
import java.io.*;
import java.util.*;
public class ExternalSort {
public static List<File> splitAndSort(File inputFile, long chunkSize, Comparator<String> comparator) throws IOException {
List<File> sortedChunks = new ArrayList<>();
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
List<String> currentChunk = new ArrayList<>();
String line;
while ((line = reader.readLine()) != null) {
currentChunk.add(line);
if (currentChunk.size() * estimateLineSize(line) >= chunkSize) {
Collections.sort(currentChunk, comparator);
sortedChunks.add(writeChunkToFile(currentChunk));
currentChunk.clear();
}
}
// Sort and write any remaining lines
if (!currentChunk.isEmpty()) {
Collections.sort(currentChunk, comparator);
sortedChunks.add(writeChunkToFile(currentChunk));
}
}
return sortedChunks;
}
private static File writeChunkToFile(List<String> chunk) throws IOException {
File tempFile = File.createTempFile("sorted_chunk_", ".tmp");
try (BufferedWriter writer = new BufferedWriter(new FileWriter(tempFile))) {
for (String line : chunk) {
writer.write(line);
writer.newLine();
}
}
return tempFile;
}
// Simple estimation, can be improved for accuracy
private static long estimateLineSize(String line) {
return line.length() * 2L; // Assuming 2 bytes per char for UTF-16
}
}
Java code for splitting a large file into sorted temporary chunks.
chunkSize
parameter is crucial. It should be carefully chosen based on available memory. A good starting point is to allocate 50-75% of your JVM's heap space for the currentChunk
list to avoid OutOfMemoryError
.Phase 2: Merging Sorted Chunks
The second phase involves merging the multiple sorted temporary files into a single, fully sorted output file. This is typically done using a k-way merge algorithm. A small buffer is maintained for each temporary file, reading the smallest element from each buffer, writing the overall smallest to the output file, and then replenishing the buffer from which the element was taken. This process continues until all temporary files are exhausted.
import java.io.*;
import java.util.*;
public class ExternalSortMerge {
public static void mergeSortedFiles(List<File> sortedFiles, File outputFile, Comparator<String> comparator) throws IOException {
// Use a PriorityQueue to efficiently find the smallest element across all files
PriorityQueue<MergeReader> pq = new PriorityQueue<>((mr1, mr2) -> comparator.compare(mr1.peek(), mr2.peek()));
for (File file : sortedFiles) {
MergeReader reader = new MergeReader(new BufferedReader(new FileReader(file)));
if (reader.hasNext()) {
pq.add(reader);
}
}
try (BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile))) {
while (!pq.isEmpty()) {
MergeReader reader = pq.poll();
writer.write(reader.next());
writer.newLine();
if (reader.hasNext()) {
pq.add(reader); // Add reader back if it still has lines
} else {
reader.close(); // Close reader if exhausted
}
}
}
// Clean up temporary files
for (File file : sortedFiles) {
file.delete();
}
}
// Helper class to manage reading from a single sorted file
static class MergeReader implements Closeable {
private BufferedReader reader;
private String nextLine;
public MergeReader(BufferedReader reader) throws IOException {
this.reader = reader;
readNextLine();
}
public boolean hasNext() {
return nextLine != null;
}
public String peek() {
return nextLine;
}
public String next() throws IOException {
String currentLine = nextLine;
readNextLine();
return currentLine;
}
private void readNextLine() throws IOException {
nextLine = reader.readLine();
}
@Override
public void close() throws IOException {
reader.close();
}
}
}
Java code for merging multiple sorted temporary files using a PriorityQueue.
Putting It All Together: A Complete Example
Combining the splitting, sorting, and merging phases provides a complete solution for sorting large files. The Comparator
allows for flexible sorting criteria, whether it's lexicographical, numerical, or based on specific fields within each line.
import java.io.*;
import java.util.*;
public class MainExternalSort {
public static void main(String[] args) {
File inputFile = new File("large_unsorted_data.txt");
File outputFile = new File("sorted_data.txt");
long chunkSize = 10 * 1024 * 1024; // 10 MB chunk size (adjust based on available RAM)
// Example comparator: natural string order
Comparator<String> stringComparator = Comparator.naturalOrder();
try {
System.out.println("Starting split and sort phase...");
List<File> sortedChunks = ExternalSort.splitAndSort(inputFile, chunkSize, stringComparator);
System.out.println("Split and sort complete. " + sortedChunks.size() + " sorted chunks created.");
System.out.println("Starting merge phase...");
ExternalSortMerge.mergeSortedFiles(sortedChunks, outputFile, stringComparator);
System.out.println("Merge complete. Final sorted file: " + outputFile.getAbsolutePath());
} catch (IOException e) {
System.err.println("An error occurred during external sort: " + e.getMessage());
e.printStackTrace();
}
}
}
Main class demonstrating the usage of external merge sort.
java.nio.MappedByteBuffer
) for even more efficient I/O, especially during the merging phase, though it adds complexity.