Molecules

Properties

Each property maps one CSV column to one typed value stored against the molecule.

from lcmd_db.core.lib.importers import (
    bool_prop, float_prop, int_prop, str_prop,
)

molecule_properties = [
    str_prop("Structure ID", col="structure_id", required=True),
    float_prop("HOMO", col="HOMO_eV", units="eV", required=True),
    float_prop("LUMO", col="LUMO_eV", units="eV"),
    int_prop("Num atoms", col="n_atoms"),
    bool_prop("Is stable", col="stable"),
]

Argument	What it does
first arg	Display name shown in the UI and the Python client
`col`	CSV column to read. Defaults to the display name when omitted
`units`	Free-form string surfaced in the UI (e.g. `"eV"`, `"kcal/mol"`)
`doc`	Long description for the property page
`required`	Fail the import if the column is missing or a row's value is null
`default`	Substitute when the column value is null (only with `required=False`)
`slug`	Override the auto-generated identifier (defaults to a slugified version of the display name). When you set it explicitly, it must be a valid `snake_case` Python identifier
`validation`	`ValidationRules` instance for value bounds (see `lcmd_db.core.properties`)

required=True means every row must have a value. Use it sparingly — one null cell will abort the whole import.

Fetching the source data

Pick the constructor matching where the data lives. All three return the same shape, so the rest of the pipeline doesn't care.

from lcmd_db.core.lib.importers.steps import http

fetch = http(
    url="https://example.com/data.csv",
    filename="data.csv",  # name to save as in ctx.data_dir
)

from lcmd_db.core.lib.importers.steps import github

fetch = github(
    repo="lcmd-epfl/my-data",
    ref="main",
    filename="my-data.zip",  # local zip filename — the whole repo is fetched as a zipball
)

from lcmd_db.core.lib.importers.steps import materials_cloud

fetch = materials_cloud(
    record_id="abcde-12345",
    files=["data.csv", "structures.tar.gz"],
)

from lcmd_db.core.lib.importers.steps import local

# Files and directories both work — use directories for vendored XYZ folders.
fetch = local(paths=["/absolute/path/to/data.csv", "/absolute/path/to/xyz/"])

Archives (.tar.gz, .zip, ...) auto-extract into ctx.data_dir for materials_cloud and github. For http, pass auto_unarchive=True explicitly. The fetched-and-extracted state is cached, so re-running an import skips the download.

Attaching structure files

attach_structures resolves a per-row file path from a Python format-style pattern interpolated with row values:

from lcmd_db.core.lib.importers.steps import attach_structures, read_csv

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
)

Given a row with structure_id="mol_0042", the step looks for xyz/mol_0042.xyz under ctx.data_dir. The path is stored in a column (_structure by default) consumed by load_molecules.

Argument	Default	Purpose
`pattern`	required	Path template. `{col}` placeholders are filled from each row
`column`	`_structure`	Where to store the resolved path. Override when attaching multiple structures
`required`	`True`	Missing files report an error per row. Drops the row under `on_error="collect"` (default) or aborts under `on_error="raise"`. Set `False` for sparsely-available structures

The directory is listed once and rows look up the prefix in a set — the step is O(rows) regardless of folder size.

Deriving chemistry

derive_chemistry populates the columns load_molecules expects, working from a SMILES column, an XYZ structure column, or both. Run it whenever the row has either — it lets the bulk insert skip Molecule.save(), which is roughly an order of magnitude faster on large imports.

from lcmd_db.core.lib.importers.steps import derive_chemistry

derive = derive_chemistry(smiles_col="SMILES")              # SMILES-driven
derive = derive_chemistry(structure_col="_structure")       # XYZ-driven
derive = derive_chemistry(                                  # both — SMILES wins
    smiles_col="SMILES",
    structure_col="_structure",
)

Produces: standardized_smiles, canonical_smiles, inchi, inchi_key, selfies, molecular_formula, molecular_weight.

XYZ → SMILES inference

When SMILES is missing for a row but structure_col resolves, derive_chemistry runs xyz_to_smiles on the XYZ block first, then feeds the resulting SMILES into the rest of the chain. The default is RDKitXyzToSmiles() (RDKit DetermineBonds, neutral charge) — best-effort: bond perception recovers connectivity but not stereochemistry. Rows where conversion fails report a per-row error and proceed with smiles=None.

Override xyz_to_smiles to swap the inference strategy. The contract is a single callable — pass a function or any class with __call__:

from lcmd_db.apps.molecules.services.conversion import RDKitXyzToSmiles

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=RDKitXyzToSmiles(charge=-1),     # anionic species
)

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=lambda xyz: my_inference(xyz),   # custom strategy
)

To run the resolution standalone (e.g. inspect the filled SMILES column before deriving the rest), compose resolve_smiles_from_xyz directly:

from lcmd_db.core.lib.importers.steps import resolve_smiles_from_xyz

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
    >> resolve_smiles_from_xyz(structure_col="_structure")
)

Loading

load_molecules does the bulk insert. Defaults assume the upstream pipeline ran derive_chemistry, so the only argument you usually pass is smiles_col:

from lcmd_db.core.lib.importers.steps import load_molecules

load = load_molecules(smiles_col="SMILES")

When you need to override more fields:

load = load_molecules(
    smiles_col="SMILES",
    name_col="structure_id",   # use a CSV column as the molecule's display name
    structure_col=None,        # disable structure attachment for this subset
)

Putting it together

Reference examples:

apps/backend/lcmd_db/registry/subsets/oscar/oscar_nhc.py — SMILES-driven: 30 properties, archive fetch, single-CSV parse, full chemistry derive, straightforward load.
apps/backend/lcmd_db/registry/subsets/spahm/spahm_qm7.py — XYZ-only QM7: no SMILES column in the source data, structures come from XYZ files. Pair attach_structures with derive_chemistry(structure_col="_structure") to get the full chemistry suite (SMILES, InChI, formula, …) inferred from the XYZ blocks.