Molecules
Define a subset that imports molecules from a CSV plus optional XYZ structures
Properties
Each property maps one CSV column to one typed value stored against the molecule.
from lcmd_db.core.lib.importers import (
bool_prop, float_prop, int_prop, str_prop,
)
molecule_properties = [
str_prop("Structure ID", col="structure_id", required=True),
float_prop("HOMO", col="HOMO_eV", units="eV", required=True),
float_prop("LUMO", col="LUMO_eV", units="eV"),
int_prop("Num atoms", col="n_atoms"),
bool_prop("Is stable", col="stable"),
]| Argument | What it does |
|---|---|
| first arg | Display name shown in the UI and the Python client |
col | CSV column to read. Defaults to the display name when omitted |
units | Free-form string surfaced in the UI (e.g. "eV", "kcal/mol") |
doc | Long description for the property page |
required | Fail the import if the column is missing or a row's value is null |
default | Substitute when the column value is null (only with required=False) |
slug | Override the auto-generated identifier (defaults to a slugified version of the display name). When you set it explicitly, it must be a valid snake_case Python identifier |
validation | ValidationRules instance for value bounds (see lcmd_db.core.properties) |
required=True means every row must have a value. Use it sparingly — one
null cell will abort the whole import.
Fetching the source data
Pick the constructor matching where the data lives. All three return the same shape, so the rest of the pipeline doesn't care.
from lcmd_db.core.lib.importers.steps import http
fetch = http(
url="https://example.com/data.csv",
filename="data.csv", # name to save as in ctx.data_dir
)from lcmd_db.core.lib.importers.steps import github
fetch = github(
repo="lcmd-epfl/my-data",
ref="main",
filename="my-data.zip", # local zip filename — the whole repo is fetched as a zipball
)from lcmd_db.core.lib.importers.steps import materials_cloud
fetch = materials_cloud(
record_id="abcde-12345",
files=["data.csv", "structures.tar.gz"],
)from lcmd_db.core.lib.importers.steps import local
# Files and directories both work — use directories for vendored XYZ folders.
fetch = local(paths=["/absolute/path/to/data.csv", "/absolute/path/to/xyz/"])Archives (.tar.gz, .zip, ...) auto-extract into ctx.data_dir for
materials_cloud and github. For http, pass auto_unarchive=True
explicitly. The fetched-and-extracted state is cached, so re-running an
import skips the download.
Attaching structure files
attach_structures resolves a per-row file path from a Python format-style
pattern interpolated with row values:
from lcmd_db.core.lib.importers.steps import attach_structures, read_csv
parse = (
read_csv(path="data.csv")
>> attach_structures(pattern="xyz/{structure_id}.xyz")
)Given a row with structure_id="mol_0042", the step looks for
xyz/mol_0042.xyz under ctx.data_dir. The path is stored in a column
(_structure by default) consumed by load_molecules.
| Argument | Default | Purpose |
|---|---|---|
pattern | required | Path template. {col} placeholders are filled from each row |
column | _structure | Where to store the resolved path. Override when attaching multiple structures |
required | True | Missing files report an error per row. Drops the row under on_error="collect" (default) or aborts under on_error="raise". Set False for sparsely-available structures |
The directory is listed once and rows look up the prefix in a set — the step is O(rows) regardless of folder size.
Deriving chemistry
derive_chemistry populates the columns load_molecules expects, working from
a SMILES column, an XYZ structure column, or both. Run it whenever the row has
either — it lets the bulk insert skip Molecule.save(), which is roughly an
order of magnitude faster on large imports.
from lcmd_db.core.lib.importers.steps import derive_chemistry
derive = derive_chemistry(smiles_col="SMILES") # SMILES-driven
derive = derive_chemistry(structure_col="_structure") # XYZ-driven
derive = derive_chemistry( # both — SMILES wins
smiles_col="SMILES",
structure_col="_structure",
)Produces: standardized_smiles, canonical_smiles, inchi, inchi_key,
selfies, molecular_formula, molecular_weight.
XYZ → SMILES inference
When SMILES is missing for a row but structure_col resolves, derive_chemistry
runs xyz_to_smiles on the XYZ block first, then feeds the resulting SMILES into
the rest of the chain. The default is RDKitXyzToSmiles() (RDKit
DetermineBonds, neutral charge) — best-effort: bond perception recovers
connectivity but not stereochemistry. Rows where conversion fails report a
per-row error and proceed with smiles=None.
Override xyz_to_smiles to swap the inference strategy. The contract is a
single callable — pass a function or any class with __call__:
from lcmd_db.apps.molecules.services.conversion import RDKitXyzToSmiles
derive_chemistry(
structure_col="_structure",
xyz_to_smiles=RDKitXyzToSmiles(charge=-1), # anionic species
)
derive_chemistry(
structure_col="_structure",
xyz_to_smiles=lambda xyz: my_inference(xyz), # custom strategy
)To run the resolution standalone (e.g. inspect the filled SMILES column before
deriving the rest), compose resolve_smiles_from_xyz directly:
from lcmd_db.core.lib.importers.steps import resolve_smiles_from_xyz
parse = (
read_csv(path="data.csv")
>> attach_structures(pattern="xyz/{structure_id}.xyz")
>> resolve_smiles_from_xyz(structure_col="_structure")
)Loading
load_molecules does the bulk insert. Defaults assume the upstream pipeline ran
derive_chemistry, so the only argument you usually pass is smiles_col:
from lcmd_db.core.lib.importers.steps import load_molecules
load = load_molecules(smiles_col="SMILES")When you need to override more fields:
load = load_molecules(
smiles_col="SMILES",
name_col="structure_id", # use a CSV column as the molecule's display name
structure_col=None, # disable structure attachment for this subset
)Putting it together
Reference examples:
apps/backend/lcmd_db/registry/subsets/oscar/oscar_nhc.py— SMILES-driven: 30 properties, archive fetch, single-CSV parse, full chemistry derive, straightforward load.apps/backend/lcmd_db/registry/subsets/spahm/spahm_qm7.py— XYZ-only QM7: no SMILES column in the source data, structures come from XYZ files. Pairattach_structureswithderive_chemistry(structure_col="_structure")to get the full chemistry suite (SMILES, InChI, formula, …) inferred from the XYZ blocks.