LCMD db logoLCMD[db]

Molecules

Define a subset that imports molecules from a CSV plus optional XYZ structures

Properties

Each property maps one CSV column to one typed value stored against the molecule.

from lcmd_db.core.lib.importers import (
    bool_prop, float_prop, int_prop, str_prop,
)

molecule_properties = [
    str_prop("Structure ID", col="structure_id", required=True),
    float_prop("HOMO", col="HOMO_eV", units="eV", required=True),
    float_prop("LUMO", col="LUMO_eV", units="eV"),
    int_prop("Num atoms", col="n_atoms"),
    bool_prop("Is stable", col="stable"),
]
ArgumentWhat it does
first argDisplay name shown in the UI and the Python client
colCSV column to read. Defaults to the display name when omitted
unitsFree-form string surfaced in the UI (e.g. "eV", "kcal/mol")
docLong description for the property page
requiredFail the import if the column is missing or a row's value is null
defaultSubstitute when the column value is null (only with required=False)
slugOverride the auto-generated identifier (defaults to a slugified version of the display name). When you set it explicitly, it must be a valid snake_case Python identifier
validationValidationRules instance for value bounds (see lcmd_db.core.properties)

required=True means every row must have a value. Use it sparingly — one null cell will abort the whole import.

Fetching the source data

Pick the constructor matching where the data lives. All three return the same shape, so the rest of the pipeline doesn't care.

from lcmd_db.core.lib.importers.steps import http

fetch = http(
    url="https://example.com/data.csv",
    filename="data.csv",  # name to save as in ctx.data_dir
)
from lcmd_db.core.lib.importers.steps import github

fetch = github(
    repo="lcmd-epfl/my-data",
    ref="main",
    filename="my-data.zip",  # local zip filename — the whole repo is fetched as a zipball
)
from lcmd_db.core.lib.importers.steps import materials_cloud

fetch = materials_cloud(
    record_id="abcde-12345",
    files=["data.csv", "structures.tar.gz"],
)
from lcmd_db.core.lib.importers.steps import local

# Files and directories both work — use directories for vendored XYZ folders.
fetch = local(paths=["/absolute/path/to/data.csv", "/absolute/path/to/xyz/"])

Archives (.tar.gz, .zip, ...) auto-extract into ctx.data_dir for materials_cloud and github. For http, pass auto_unarchive=True explicitly. The fetched-and-extracted state is cached, so re-running an import skips the download.

Attaching structure files

attach_structures resolves a per-row file path from a Python format-style pattern interpolated with row values:

from lcmd_db.core.lib.importers.steps import attach_structures, read_csv

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
)

Given a row with structure_id="mol_0042", the step looks for xyz/mol_0042.xyz under ctx.data_dir. The path is stored in a column (_structure by default) consumed by load_molecules.

ArgumentDefaultPurpose
patternrequiredPath template. {col} placeholders are filled from each row
column_structureWhere to store the resolved path. Override when attaching multiple structures
requiredTrueMissing files report an error per row. Drops the row under on_error="collect" (default) or aborts under on_error="raise". Set False for sparsely-available structures

The directory is listed once and rows look up the prefix in a set — the step is O(rows) regardless of folder size.

Deriving chemistry

derive_chemistry populates the columns load_molecules expects, working from a SMILES column, an XYZ structure column, or both. Run it whenever the row has either — it lets the bulk insert skip Molecule.save(), which is roughly an order of magnitude faster on large imports.

from lcmd_db.core.lib.importers.steps import derive_chemistry

derive = derive_chemistry(smiles_col="SMILES")              # SMILES-driven
derive = derive_chemistry(structure_col="_structure")       # XYZ-driven
derive = derive_chemistry(                                  # both — SMILES wins
    smiles_col="SMILES",
    structure_col="_structure",
)

Produces: standardized_smiles, canonical_smiles, inchi, inchi_key, selfies, molecular_formula, molecular_weight.

XYZ → SMILES inference

When SMILES is missing for a row but structure_col resolves, derive_chemistry runs xyz_to_smiles on the XYZ block first, then feeds the resulting SMILES into the rest of the chain. The default is RDKitXyzToSmiles() (RDKit DetermineBonds, neutral charge) — best-effort: bond perception recovers connectivity but not stereochemistry. Rows where conversion fails report a per-row error and proceed with smiles=None.

Override xyz_to_smiles to swap the inference strategy. The contract is a single callable — pass a function or any class with __call__:

from lcmd_db.apps.molecules.services.conversion import RDKitXyzToSmiles

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=RDKitXyzToSmiles(charge=-1),     # anionic species
)

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=lambda xyz: my_inference(xyz),   # custom strategy
)

To run the resolution standalone (e.g. inspect the filled SMILES column before deriving the rest), compose resolve_smiles_from_xyz directly:

from lcmd_db.core.lib.importers.steps import resolve_smiles_from_xyz

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
    >> resolve_smiles_from_xyz(structure_col="_structure")
)

Loading

load_molecules does the bulk insert. Defaults assume the upstream pipeline ran derive_chemistry, so the only argument you usually pass is smiles_col:

from lcmd_db.core.lib.importers.steps import load_molecules

load = load_molecules(smiles_col="SMILES")

When you need to override more fields:

load = load_molecules(
    smiles_col="SMILES",
    name_col="structure_id",   # use a CSV column as the molecule's display name
    structure_col=None,        # disable structure attachment for this subset
)

Putting it together

Reference examples:

  • apps/backend/lcmd_db/registry/subsets/oscar/oscar_nhc.py — SMILES-driven: 30 properties, archive fetch, single-CSV parse, full chemistry derive, straightforward load.
  • apps/backend/lcmd_db/registry/subsets/spahm/spahm_qm7.py — XYZ-only QM7: no SMILES column in the source data, structures come from XYZ files. Pair attach_structures with derive_chemistry(structure_col="_structure") to get the full chemistry suite (SMILES, InChI, formula, …) inferred from the XYZ blocks.

On this page