How to create full compressed tar file using Python?
Categories:
Creating Compressed Tar Archives with Python's tarfile
Module

Learn how to efficiently create full compressed tar archives (tar.gz, tar.bz2, tar.xz) using Python's built-in tarfile
module, covering various compression methods and best practices.
Archiving and compressing files are fundamental operations in many software applications and system administration tasks. Python's standard library provides the powerful tarfile
module, which allows you to read and write tar archives, including those compressed with gzip, bzip2, or lzma. This article will guide you through the process of creating full compressed tar files, explaining the different compression types and providing practical code examples.
Understanding Tar and Compression Formats
Before diving into the code, it's important to understand the distinction between tar and compression. Tar (Tape Archive) is an archiving utility that bundles multiple files and directories into a single file, preserving file system metadata like permissions, timestamps, and directory structures. It does not inherently compress the data. Compression, on the other hand, reduces the size of the data. When you create a 'compressed tar file,' you are essentially creating a tar archive and then compressing that archive using a separate compression algorithm.
The tarfile
module supports several compression formats:
- gzip (.tar.gz or .tgz): A widely used compression format offering a good balance between compression ratio and speed. It's generally faster than bzip2 but achieves slightly less compression.
- bzip2 (.tar.bz2 or .tbz): Provides better compression than gzip, but at the cost of slower compression and decompression times. Ideal for situations where file size is critical and speed is less of a concern.
- lzma (.tar.xz or .txz): Offers the highest compression ratios among the three, often significantly reducing file sizes. However, it is also the slowest for both compression and decompression. Best for long-term storage or distribution where maximum compression is desired.
flowchart TD A[Start] --> B{Choose Compression Type?} B -->|gzip| C[Create .tar.gz] B -->|bzip2| D[Create .tar.bz2] B -->|lzma| E[Create .tar.xz] C --> F[Add Files/Directories] D --> F E --> F F --> G[Close Archive] G --> H[End]
Process flow for creating compressed tar archives
Creating a Compressed Tar Archive
The tarfile
module makes creating compressed archives straightforward. The key is to specify the correct mode when opening the tar file. The mode string determines both the operation (read/write) and the compression type. For writing, you'll typically use modes like 'w:gz'
, 'w:bz2'
, or 'w:xz'
.
import tarfile
import os
def create_compressed_tar(output_filename, source_paths, compression_mode='gz'):
"""
Creates a compressed tar archive from a list of source files/directories.
Args:
output_filename (str): The name of the output tar file (e.g., 'archive.tar.gz').
source_paths (list): A list of file or directory paths to add to the archive.
compression_mode (str): The compression mode ('gz', 'bz2', 'xz', or None for uncompressed).
"""
mode = f'w:{compression_mode}' if compression_mode else 'w'
try:
with tarfile.open(output_filename, mode) as tar:
for path in source_paths:
if os.path.exists(path):
# arcname is the name of the file/dir inside the tar archive.
# By default, it's the same as the path, but you can change it.
tar.add(path, arcname=os.path.basename(path))
print(f"Added '{path}' to '{output_filename}'")
else:
print(f"Warning: Path '{path}' not found, skipping.")
print(f"Successfully created '{output_filename}' with {compression_mode or 'no'} compression.")
except tarfile.TarError as e:
print(f"Error creating tar archive: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
# --- Example Usage ---
# 1. Create some dummy files and a directory for archiving
if not os.path.exists('my_files'):
os.makedirs('my_files')
with open('my_files/file1.txt', 'w') as f: f.write('This is file 1.')
with open('my_files/file2.log', 'w') as f: f.write('Log entry 1\nLog entry 2.')
with open('single_file.txt', 'w') as f: f.write('This is a standalone file.')
# Define source paths to be archived
sources = ['my_files', 'single_file.txt']
# Create a .tar.gz archive
create_compressed_tar('my_archive.tar.gz', sources, 'gz')
# Create a .tar.bz2 archive
create_compressed_tar('my_archive.tar.bz2', sources, 'bz2')
# Create a .tar.xz archive
create_compressed_tar('my_archive.tar.xz', sources, 'xz')
# Clean up dummy files (optional)
os.remove('my_files/file1.txt')
os.remove('my_files/file2.log')
os.rmdir('my_files')
os.remove('single_file.txt')
Python function to create compressed tar archives with different compression modes.
arcname
parameter in tar.add()
is crucial. It specifies the name of the file or directory inside the archive. If you add a path like /home/user/documents/report.pdf
, arcname
defaults to home/user/documents/report.pdf
. Often, you only want report.pdf
inside the archive, so you'd use arcname=os.path.basename(path)
.Adding Specific Files and Directories
The tar.add()
method is versatile. You can add individual files, entire directories, or even symbolic links. When adding a directory, tarfile
recursively adds all its contents. The arcname
argument allows you to control how the path appears within the archive, which is particularly useful for maintaining a clean archive structure.
import tarfile
import os
# Setup: Create a temporary directory structure
if not os.path.exists('project_data'):
os.makedirs('project_data/src')
os.makedirs('project_data/docs')
with open('project_data/src/main.py', 'w') as f: f.write('print("Hello")')
with open('project_data/docs/README.md', 'w') as f: f.write('# Project Readme')
with open('project_data/config.ini', 'w') as f: f.write('[settings]\nversion=1.0')
output_archive = 'project_archive.tar.gz'
with tarfile.open(output_archive, 'w:gz') as tar:
# Add the entire 'project_data' directory, but make it appear as 'my_project' inside the archive
tar.add('project_data', arcname='my_project')
# Alternatively, add specific files/directories with custom names
# tar.add('project_data/src/main.py', arcname='code/app.py')
# tar.add('project_data/docs', arcname='documentation')
print(f"Created '{output_archive}' with 'project_data' archived as 'my_project'.")
# Clean up
os.remove('project_data/src/main.py')
os.rmdir('project_data/src')
os.remove('project_data/docs/README.md')
os.rmdir('project_data/docs')
os.remove('project_data/config.ini')
os.rmdir('project_data')
Example demonstrating the use of arcname
to control the internal path within the tar archive.
tar.add(path)
, if path
is an absolute path, the archive will store the absolute path unless arcname
is specified. It's generally safer to use relative paths or explicitly set arcname
to avoid unintended directory structures within your archive.Best Practices for Archiving
To ensure your archiving process is robust and efficient, consider these best practices:
- Error Handling: Always wrap your
tarfile.open()
calls intry...except
blocks to catchtarfile.TarError
or other potential exceptions during file operations. - Resource Management: Use
with tarfile.open(...) as tar:
to ensure the tar file is properly closed, even if errors occur. This prevents resource leaks. - Path Management: Use
os.path.basename()
oros.path.join()
andos.path.relpath()
to construct and manage paths effectively, especially when dealing witharcname
. - Compression Choice: Select the compression algorithm based on your needs:
gzip
for speed,bzip2
for better compression,lzma
for maximum compression. - Large Files: For very large files or directories, consider streaming or chunking if memory becomes an issue, though
tarfile
generally handles large files well by default.
1. Import tarfile
and os
Begin by importing the necessary modules: tarfile
for archive operations and os
for path manipulation.
2. Prepare Source Files/Directories
Ensure the files and directories you intend to archive exist and are accessible. Create a list of their paths.
3. Open Tar Archive with Compression Mode
Use tarfile.open(output_filename, mode='w:gz')
(or w:bz2
, w:xz
) within a with
statement to create the archive. This handles file closing automatically.
4. Add Files/Directories to Archive
Iterate through your list of source paths and use tar.add(path, arcname=...)
to add each item. Customize arcname
for cleaner archive structures.
5. Verify Archive (Optional)
After creation, you can optionally open the archive in read mode ('r:gz'
) and list its contents to verify that all files were added correctly.