Creating datasets
Define a subset and import it into the LCMD DB
Subset vs dataset. A subset is one importable unit
(e.g. OSCARNHC) — one CSV (or several) plus optional structure files,
mapped onto molecules, reactions, or fragments. A dataset is a
higher-level grouping of subsets that share citations and metadata (e.g.
OSCAR groups OSCARNHC + OSCARDHBD + OSCARSEED). You almost always
want a subset; this page covers what every subset shares, and the
per-entity pages have the working examples.
The pipeline
Every subset declares a four-stage pipeline. Each stage is composed from typed
steps with the >> operator.
pipeline(
fetch=..., # download CSV/archives → ctx.data_dir (cached)
parse=..., # read CSV(s) → ctx.rows (Polars frame), attach structure files
derive=..., # compute SMILES-derived fields (InChI, formula, MW…)
load=..., # write Molecule / Reaction / Fragment rows in one transaction
)The runner enforces stage ordering at the type level — you cannot wire load before parse.
Register and test
Each subset lives in its own module under apps/backend/lcmd_db/registry/subsets/
and is registered in that package's __init__.py:
from .my_subset import my_subset
registry.register("MySubset", my_subset)The string passed to register(...) is the key --subset matches on, and it
is case-sensitive — --subset MySubset works, --subset mysubset does
not.
Then run it locally:
Run with a row limit
just manage import_subset --subset MySubset --limit 5--limit 5 is the iteration default — fast, exercises every stage. Drop it for a full import once you're happy.
Try it without writing
just manage import_subset --subset MySubset --dry-runWraps the import in a transaction that rolls back, and no-ops file storage (no XYZ uploads). Use it when you've changed pipeline shape and don't want a half-imported subset on disk or in the DB.
Verify the data landed
from lcmd_db.apps.subsets.models import Subset
Subset.objects.get(name="MySubset").molecules.count()Fetched files cache under ctx.data_dir, a per-subset directory keyed by the
SHA-256 of the Source URL (or name). The base path comes from
lcmd_app_settings.data_dir and can be overridden with --data-dir. Re-runs
reuse the cache, so iteration is fast.
Re-importing a subset is idempotent at the schema level but
additive at the row level. The Source / Subset / property-definition /
fragment-type rows use update_or_create, but molecules, reactions, and
fragments go through bulk_create with no dedup. Wipe the subset's data via
the admin (or a fresh DB) before re-running, or stick to --dry-run while
iterating.
import_subset --all runs every registered subset in registry insertion
order (the order of register(...) calls in
subsets/__init__.py).
There's no inter-subset dependency resolution; if you need a particular order,
register accordingly.
What to read next
Pick the entity you're importing.
Molecules
CSV + optional XYZ folder. The common case.
Reactions
Multiple participants per row, optional energy profile, display config.
Fragments
Pools of building blocks plus an assembler function that composes them.
Advanced pipelines
Multi-CSV joins, custom steps, error modes, dataset grouping.
Tests
Every registered subset is automatically picked up by
apps/backend/lcmd_db/registry/subsets/tests/test_registry_slugs.py, which
parametrizes over the registry to assert property-slug uniqueness and
identifier validity. Your subset will run there as soon as it's registered —
no extra wiring needed. Run with:
just manage test apps.subsetsPushing to production
Local imports hit your dev database. To run a subset against the production DB, a maintainer launches a Kubernetes Job — see Dataset imports. Open a PR with the subset module, the register(...) call, and (if grouping into a dataset) the BaseDataset subclass with its register_dataset(...) call — see Advanced pipelines. Then ping a maintainer.