Convert XLSX to CSV correctly using python

Learn convert xlsx to csv correctly using python with practical examples, diagrams, and best practices. Covers python, excel, python-2.7 development techniques with visual explanations.

Converting XLSX to CSV Correctly with Python

Hero image for Convert XLSX to CSV correctly using python

Learn how to accurately convert Excel XLSX files to CSV format using Python, addressing common pitfalls like encoding, delimiters, and data types.

Converting data from Excel's proprietary XLSX format to the universally accessible CSV (Comma Separated Values) format is a common task in data processing. While it might seem straightforward, ensuring data integrity, correct encoding, and proper handling of various data types can be challenging. This article will guide you through the process using Python, focusing on robust methods to avoid common conversion errors.

Understanding the Challenges of XLSX to CSV Conversion

Excel files (XLSX) are complex structures that can contain multiple sheets, rich formatting, formulas, and various data types. CSV files, on the other hand, are plain text files where values are separated by a delimiter (commonly a comma). The conversion process needs to address several key challenges:

  1. Multiple Sheets: An XLSX file can have many sheets, but a CSV file typically represents data from a single sheet.
  2. Encoding: Ensuring the output CSV uses the correct character encoding (e.g., UTF-8) to prevent data corruption, especially with non-ASCII characters.
  3. Delimiters: Choosing an appropriate delimiter that doesn't conflict with data within the cells.
  4. Data Types and Formatting: Excel stores numbers, dates, and text with specific formatting. CSVs store everything as text, so proper conversion is crucial.
  5. Formulas: Formulas in Excel are not directly transferable to CSV; only their calculated values can be exported.
  6. Large Files: Efficiently handling large Excel files without running out of memory.
flowchart TD
    A[Start: XLSX File] --> B{Select Sheet}
    B --> C[Read Data from Sheet]
    C --> D{Handle Data Types & Encoding}
    D --> E[Choose Delimiter]
    E --> F[Write to CSV File]
    F --> G[End: CSV File]
    B -- Optional --> H[Iterate through Sheets]
    H --> C

General workflow for converting XLSX to CSV

Using pandas for Robust Conversion

The pandas library is the de facto standard for data manipulation in Python and provides excellent tools for handling Excel and CSV files. It simplifies the conversion process significantly by abstracting away many of the complexities.

import pandas as pd

def convert_xlsx_to_csv_pandas(xlsx_file_path, csv_file_path, sheet_name=0, encoding='utf-8', sep=','):
    """
    Converts a specified sheet from an XLSX file to a CSV file using pandas.

    Args:
        xlsx_file_path (str): Path to the input XLSX file.
        csv_file_path (str): Path for the output CSV file.
        sheet_name (str or int): Name or index of the sheet to convert. Defaults to the first sheet (0).
        encoding (str): Encoding for the output CSV file. Defaults to 'utf-8'.
        sep (str): Delimiter for the output CSV file. Defaults to ','.
    """
    try:
        # Read the specified sheet from the XLSX file
        df = pd.read_excel(xlsx_file_path, sheet_name=sheet_name)

        # Write the DataFrame to a CSV file
        df.to_csv(csv_file_path, index=False, encoding=encoding, sep=sep)
        print(f"Successfully converted '{xlsx_file_path}' (sheet: {sheet_name}) to '{csv_file_path}'")
    except FileNotFoundError:
        print(f"Error: XLSX file not found at '{xlsx_file_path}'")
    except ValueError as e:
        print(f"Error: {e}. Check sheet_name or file format.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# Example Usage:
# Create a dummy XLSX file for demonstration
# data = {'Col1': [1, 2, 3], 'Col2': ['A', 'B', 'C'], 'Col3': ['你好', '世界', 'Python']}
# df_dummy = pd.DataFrame(data)
# df_dummy.to_excel('example.xlsx', index=False, sheet_name='Sheet1')

# convert_xlsx_to_csv_pandas('example.xlsx', 'output.csv', sheet_name='Sheet1')
# convert_xlsx_to_csv_pandas('example.xlsx', 'output_tab_delimited.csv', sheet_name=0, sep='\t')

Python code using pandas to convert an XLSX sheet to CSV.

Handling Specific Scenarios

Sometimes, you might encounter specific requirements or issues during conversion. Here's how to address some common ones.

Converting All Sheets to Separate CSVs

If your XLSX file contains multiple sheets and you need each sheet to be its own CSV file, you can modify the pandas approach.

import pandas as pd
import os

def convert_all_sheets_to_csv(xlsx_file_path, output_dir, encoding='utf-8', sep=','):
    """
    Converts all sheets from an XLSX file to separate CSV files.

    Args:
        xlsx_file_path (str): Path to the input XLSX file.
        output_dir (str): Directory where the output CSV files will be saved.
        encoding (str): Encoding for the output CSV files. Defaults to 'utf-8'.
        sep (str): Delimiter for the output CSV files. Defaults to ','.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    try:
        xls = pd.ExcelFile(xlsx_file_path)
        for sheet_name in xls.sheet_names:
            df = xls.parse(sheet_name)
            csv_file_path = os.path.join(output_dir, f"{sheet_name}.csv")
            df.to_csv(csv_file_path, index=False, encoding=encoding, sep=sep)
            print(f"Converted sheet '{sheet_name}' to '{csv_file_path}'")
    except FileNotFoundError:
        print(f"Error: XLSX file not found at '{xlsx_file_path}'")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# Example Usage:
# convert_all_sheets_to_csv('example.xlsx', 'output_csvs')

Python code to convert all sheets from an XLSX file to individual CSVs.

Dealing with Large Files and Memory Efficiency

For very large Excel files, reading the entire file into memory at once can lead to MemoryError. pandas offers ways to read files in chunks, which can be more memory-efficient.

import pandas as pd

def convert_large_xlsx_to_csv_chunked(xlsx_file_path, csv_file_path, sheet_name=0, chunksize=10000, encoding='utf-8', sep=','):
    """
    Converts a specified sheet from a large XLSX file to a CSV file in chunks.

    Args:
        xlsx_file_path (str): Path to the input XLSX file.
        csv_file_path (str): Path for the output CSV file.
        sheet_name (str or int): Name or index of the sheet to convert. Defaults to the first sheet (0).
        chunksize (int): Number of rows to read at a time. Defaults to 10000.
        encoding (str): Encoding for the output CSV file. Defaults to 'utf-8'.
        sep (str): Delimiter for the output CSV file. Defaults to ','.
    """
    try:
        # Use iterator=True and get_chunksize to read in chunks
        reader = pd.read_excel(xlsx_file_path, sheet_name=sheet_name, chunksize=chunksize)

        # Initialize a flag to write header only once
        header_written = False

        for i, chunk in enumerate(reader):
            if not header_written:
                chunk.to_csv(csv_file_path, index=False, encoding=encoding, sep=sep, mode='w')
                header_written = True
            else:
                chunk.to_csv(csv_file_path, index=False, encoding=encoding, sep=sep, mode='a', header=False)
            print(f"Processed chunk {i+1}...")
        print(f"Successfully converted large '{xlsx_file_path}' to '{csv_file_path}'")

    except FileNotFoundError:
        print(f"Error: XLSX file not found at '{xlsx_file_path}'")
    except ValueError as e:
        print(f"Error: {e}. Check sheet_name or file format.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# Example Usage:
# convert_large_xlsx_to_csv_chunked('large_example.xlsx', 'large_output.csv', sheet_name=0)

Python code for chunked conversion of large XLSX files to CSV.