Merge fastq.gz files in Unix
Categories:
Efficiently Merge Fastq.gz Files in Unix for Bioinformatics Workflows

Learn how to combine multiple gzipped FASTQ files into a single file using standard Unix commands, optimizing your bioinformatics data processing pipelines.
In bioinformatics, it's common to receive sequencing data split across multiple fastq.gz
files. Before downstream analysis, such as alignment or variant calling, these files often need to be merged into a single, consolidated file. This article provides practical Unix commands and strategies to efficiently merge fastq.gz
files, ensuring data integrity and preparing your data for the next steps in your pipeline.
Understanding Fastq.gz Files and Merging Requirements
FASTQ files store sequencing reads and their corresponding quality scores. The .gz
extension indicates that the files are compressed using gzip
. When merging these files, it's crucial to concatenate them in a way that preserves the original data structure and ensures the resulting file remains valid and readable by bioinformatics tools. Direct concatenation of gzipped files is often the most efficient method, as it avoids decompression and recompression overhead.
flowchart TD A[Multiple fastq.gz files] --> B{Concatenate using 'cat'} B --> C[Single merged.fastq.gz file] C --> D[Downstream Bioinformatics Analysis]
Basic workflow for merging fastq.gz files
Method 1: Simple Concatenation with cat
The most straightforward way to merge fastq.gz
files is by using the cat
command. When cat
is used on gzipped files, it concatenates their compressed streams directly. The resulting file is also a valid gzip
compressed file containing the combined content of the input files. This method is highly efficient as it doesn't require explicit decompression and recompression.
cat sample_R1_L001.fastq.gz sample_R1_L002.fastq.gz > merged_sample_R1.fastq.gz
Merging two specific fastq.gz files into one.
cat sample_R1_*.fastq.gz > merged_sample_R1.fastq.gz
Merging all fastq.gz files matching a pattern into one.
Method 2: Merging Paired-End Reads and Handling Multiple Samples
For paired-end sequencing, you'll typically have two files per lane/batch: one for forward reads (R1) and one for reverse reads (R2). It's critical to merge R1 files with other R1 files, and R2 files with other R2 files, to maintain read pairing. This often involves using wildcards or loops for automation.
# Example for a single sample with multiple lanes/batches
# Merge all R1 files for SampleA
cat SampleA_L001_R1.fastq.gz SampleA_L002_R1.fastq.gz > SampleA_merged_R1.fastq.gz
# Merge all R2 files for SampleA
cat SampleA_L001_R2.fastq.gz SampleA_L002_R2.fastq.gz > SampleA_merged_R2.fastq.gz
Merging paired-end reads for a single sample across multiple lanes.
# Using a loop for multiple samples
for sample in SampleA SampleB SampleC;
do
echo "Merging R1 files for $sample..."
cat ${sample}_*_R1.fastq.gz > ${sample}_merged_R1.fastq.gz
echo "Merging R2 files for $sample..."
cat ${sample}_*_R2.fastq.gz > ${sample}_merged_R2.fastq.gz
done
Automating the merging process for multiple samples using a bash loop.
gzip -t merged_file.fastq.gz
to check for gzip
errors, and zcat merged_file.fastq.gz | head
to inspect the file content.Advanced Merging with Parallel Processing (GNU Parallel)
For very large datasets or many samples, sequential merging can be time-consuming. GNU Parallel
can significantly speed up the process by executing merge commands concurrently. This is particularly useful when merging many independent sample sets.
# First, create a list of sample prefixes (e.g., SampleA, SampleB)
ls *_R1.fastq.gz | sed 's/_R1.fastq.gz//' | sort -u > sample_prefixes.txt
# Use GNU Parallel to merge R1 and R2 files for each sample
cat sample_prefixes.txt | parallel -j 4 'cat {}*_R1.fastq.gz > {}_merged_R1.fastq.gz && cat {}*_R2.fastq.gz > {}_merged_R2.fastq.gz'
Using GNU Parallel to merge paired-end reads for multiple samples concurrently. The -j 4
option runs 4 jobs in parallel.
GNU Parallel
, adjust the -j
option (number of parallel jobs) based on your system's CPU cores and I/O capabilities to avoid overloading the system.