Specifying package data in pyproject.toml

Learn specifying package data in pyproject.toml with practical examples, diagrams, and best practices. Covers python, pip, setuptools development techniques with visual explanations.

Mastering Package Data Specification in pyproject.toml

Mastering Package Data Specification in pyproject.toml

Learn how to effectively define and manage package data, including non-code files and assets, using pyproject.toml for modern Python projects.

The pyproject.toml file has become the central configuration point for modern Python projects, replacing various older files like setup.py and setup.cfg for many common tasks. While it's widely used for build system configuration and dependency management, specifying package data—non-code files like images, static assets, configuration files, and documentation—is crucial for distributing functional packages. This article delves into how pyproject.toml handles package data, focusing on best practices and common pitfalls, ensuring your Python packages include all necessary assets.

Understanding Package Data and its Importance

Package data refers to any files that are not Python source code but are essential for your package to function correctly or be fully usable. This can include:

  • Static assets: CSS, JavaScript, images for web frameworks.
  • Templates: HTML templates for web applications.
  • Configuration files: Default settings or examples.
  • Data files: CSVs, JSONs, or other data formats used by your package.
  • Documentation: Markdown or reStructuredText files distributed with the package.

Failing to include these files means your package might not work as expected when installed by others, leading to FileNotFoundError or incomplete functionality. Modern Python packaging, particularly with setuptools, provides robust mechanisms to declare these files within pyproject.toml.

A detailed architecture diagram showing the flow of package data from a project directory through 'pyproject.toml' and into the final Python package. Illustrate 'Source Files (Python, Assets)' on the left, an arrow pointing to 'pyproject.toml (Configuration)', an arrow pointing to 'Build Process (e.g., pip build)', and finally to 'Installed Package (with Data)'. Use distinct sections for code vs. data files within the source and installed package. Show how 'pyproject.toml' acts as the manifest.

Conceptual flow of package data through pyproject.toml into an installed package.

Specifying Package Data with [tool.setuptools.packages.data]

For projects using setuptools (which is common even with pyproject.toml), the primary way to specify package data is through the [tool.setuptools.packages.data] table. This table allows you to associate data files with specific Python packages within your project. The keys in this table correspond to your Python package names, and the values are lists of glob patterns or file paths relative to the package directory.

Consider a project structure like this:

my_package/
├── __init__.py
├── core.py
├── data/
│   └── config.json
└── static/
    ├── image.png
    └── style.css
README.md
pyproject.toml

To include config.json, image.png, and style.css in my_package, your pyproject.toml would look like the example below. Note that paths are relative to the package root (e.g., my_package/), not the project root.

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "my-package"
version = "0.1.0"

[tool.setuptools.packages.data]
"my_package" = ["data/*.json", "static/*"]

Example pyproject.toml showing package data specification for my_package.

Including Top-Level Files with include and exclude

Sometimes you need to include files that are not directly inside a Python package directory but are at the project's root level, such as README.md, LICENSE, or CHANGELOG.md. For these, setuptools provides include and exclude options within the [tool.setuptools] table, which apply to the source distribution (sdist).

These fields accept a list of glob patterns relative to the project root. Files included via include are added to the source distribution, and subsequently, typically installed in the top-level site-packages/my_package-X.Y.Z.dist-info/ directory, making them accessible via importlib.resources or similar mechanisms after installation.

Here's how you might include README.md and LICENSE:

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"

[project]
name = "my-package"
version = "0.1.0"

[tool.setuptools]
include = ["README.md", "LICENSE"]

[tool.setuptools.packages.data]
"my_package" = ["data/*.json", "static/*"]

Including README.md and LICENSE at the project root level.

Accessing Package Data at Runtime

Once your package is installed, you can't simply open files using relative paths like open('data/config.json') because the installation location might be different from your development setup (e.g., site-packages). Python's importlib.resources module (or importlib.resources.files for Python 3.9+) provides a standard, cross-platform way to access data files within installed packages.

This method works reliably whether your package is installed from a wheel, an editable install, or directly from a source distribution.

Here’s an example of how to read the config.json file we specified earlier:

import importlib.resources
import json

def load_config():
    # Use .joinpath() for constructing paths within the package
    config_path = importlib.resources.files('my_package').joinpath('data/config.json')
    with open(config_path, 'r') as f:
        return json.load(f)

if __name__ == '__main__':
    config = load_config()
    print(f"Loaded configuration: {config}")

Python code to access config.json using importlib.resources.files.

Best Practices for Package Data

Adhering to best practices ensures maintainable and robust package data handling:

  1. Keep Data with Code: Store data files logically alongside the Python modules that use them. This improves readability and makes it easier to manage related assets.
  2. Use Glob Patterns Carefully: While * is convenient, be specific with your glob patterns (e.g., data/*.json instead of data/*) to avoid accidentally including unwanted files.
  3. Test Your Package Installation: Always install your package in a clean virtual environment (e.g., pip install . or pip install -e . for editable installs) and verify that all data files are present and accessible using importlib.resources.
  4. Avoid Absolute Paths: Never hardcode absolute paths for data files; always rely on importlib.resources for runtime access.
  5. Document Data Files: Clearly document which data files your package expects and where they should be located within the package structure for users who might want to inspect or modify them.

1. Step 1

Step 1: Define Project Structure: Organize your project with a clear separation of Python code and data files. For example, place configuration files in a data/ subdirectory within your package.

2. Step 2

Step 2: Configure pyproject.toml: Use [tool.setuptools.packages.data] to specify data files relative to your Python package directory. For top-level files like README.md, use [tool.setuptools].include.

3. Step 3

Step 3: Access Data at Runtime: Implement importlib.resources in your Python code to reliably load package data, ensuring your package works correctly after installation.

4. Step 4

Step 4: Build and Test: Create a source distribution (python -m build) and install it in a clean virtual environment. Test all functionalities that rely on package data to confirm everything is included and accessible.