Creating datasets

Subset vs dataset. A subset is one importable unit (e.g. OSCARNHC) — one CSV (or several) plus optional structure files, mapped onto molecules, reactions, or fragments. A dataset is a higher-level grouping of subsets that share citations and metadata (e.g. OSCAR groups OSCARNHC + OSCARDHBD + OSCARSEED). You almost always want a subset; this page covers what every subset shares, and the per-entity pages have the working examples.

The pipeline

Every subset declares a four-stage pipeline. Each stage is composed from typed steps with the >> operator.

pipeline(
    fetch=...,   # download CSV/archives → ctx.data_dir (cached)
    parse=...,   # read CSV(s) → ctx.rows (Polars frame), attach structure files
    derive=...,  # compute SMILES-derived fields (InChI, formula, MW…)
    load=...,    # write Molecule / Reaction / Fragment rows in one transaction
)

The runner enforces stage ordering at the type level — you cannot wire load before parse.

Register and test

Each subset lives in its own module under apps/backend/lcmd_db/registry/subsets/ and is registered in that package's __init__.py:

apps/backend/lcmd_db/registry/subsets/__init__.py

from .my_subset import my_subset

registry.register("MySubset", my_subset)

The string passed to register(...) is the key --subset matches on, and it is case-sensitive — --subset MySubset works, --subset mysubset does not.

Then run it locally:

Run with a row limit

just manage import_subset --subset MySubset --limit 5

--limit 5 is the iteration default — fast, exercises every stage. Drop it for a full import once you're happy.

Try it without writing

just manage import_subset --subset MySubset --dry-run

Wraps the import in a transaction that rolls back, and no-ops file storage (no XYZ uploads). Use it when you've changed pipeline shape and don't want a half-imported subset on disk or in the DB.

Verify the data landed

just manage shell

from lcmd_db.apps.subsets.models import Subset
Subset.objects.get(name="MySubset").molecules.count()

Fetched files cache under ctx.data_dir, a per-subset directory keyed by the SHA-256 of the Source URL (or name). The base path comes from lcmd_app_settings.data_dir and can be overridden with --data-dir. Re-runs reuse the cache, so iteration is fast.

Re-importing a subset is idempotent at the schema level but additive at the row level. The Source / Subset / property-definition / fragment-type rows use update_or_create, but molecules, reactions, and fragments go through bulk_create with no dedup. Wipe the subset's data via the admin (or a fresh DB) before re-running, or stick to --dry-run while iterating.

import_subset --all runs every registered subset in registry insertion order (the order of register(...) calls in subsets/__init__.py). There's no inter-subset dependency resolution; if you need a particular order, register accordingly.

Tests

Every registered subset is automatically picked up by apps/backend/lcmd_db/registry/subsets/tests/test_registry_slugs.py, which parametrizes over the registry to assert property-slug uniqueness and identifier validity. Your subset will run there as soon as it's registered — no extra wiring needed. Run with:

just manage test apps.subsets

Pushing to production

Local imports hit your dev database. To run a subset against the production DB, a maintainer launches a Kubernetes Job — see Dataset imports. Open a PR with the subset module, the register(...) call, and (if grouping into a dataset) the BaseDataset subclass with its register_dataset(...) call — see Advanced pipelines. Then ping a maintainer.

Creating datasets

The pipeline

Register and test

Run with a row limit

Try it without writing

Verify the data landed

What to read next

Molecules

Reactions

Fragments

Advanced pipelines

Tests

Pushing to production

On this page