Molecules
Define a subset that imports molecules from a CSV plus optional XYZ structures
Properties
Each property maps one CSV column to one typed value stored against the molecule.
from lcmd_db.core.lib.importers import (
bool_prop, float_prop, int_prop, str_prop,
)
molecule_properties = [
str_prop("Structure ID", col="structure_id", required=True),
float_prop("HOMO", col="HOMO_eV", units="eV", required=True),
float_prop("LUMO", col="LUMO_eV", units="eV"),
int_prop("Num atoms", col="n_atoms"),
bool_prop("Is stable", col="stable"),
]| Argument | What it does |
|---|---|
| first arg | Display name shown in the UI and the Python client |
col | CSV column to read. Defaults to the display name when omitted |
units | Free-form string surfaced in the UI (e.g. "eV", "kcal/mol") |
doc | Long description for the property page |
required | Fail the import if the column is missing or a row's value is null |
default | Substitute when the column value is null (only with required=False) |
slug | Override the auto-generated identifier (defaults to a slugified version of the display name). When you set it explicitly, it must be a valid snake_case Python identifier |
validation | ValidationRules instance for value bounds (see lcmd_db.core.properties) |
levels | Level(s) of theory the value was computed at — one level_of_theory(...) spec or a sequence. Optional; see Levels of theory |
required=True means every row must have a value. Use it sparingly — one
null cell will abort the whole import.
Levels of theory
A property can record the computational level of theory its value was
produced at. Define a spec once with level_of_theory(...) and attach it to any
property via levels= — the same spec is meant to be shared across many
properties, and across a whole dataset family.
from lcmd_db.core.lib.importers import level_of_theory
OSCAR_OPT = level_of_theory(
"B97-D/Def2-TZVP",
theory_type="dft",
functional="B97-D",
basis_set="Def2-TZVP",
software="gaussian",
software_version="16",
is_optimization=True,
doc="Geometry optimization",
)
OSCAR_SP = level_of_theory(
"ωB97X-D/Def2-TZVP",
theory_type="td_dft",
functional="ωB97X-D",
basis_set="Def2-TZVP",
software="gaussian",
software_version="16",
is_single_point=True,
doc="Excited-state / single-point calculations",
)
OSCAR_DFT_LEVELS = (OSCAR_OPT, OSCAR_SP)Pass a single spec or a sequence to levels= on any property factory:
float_prop("HOMO", col="HOMO_eV", units="eV", levels=OSCAR_DFT_LEVELS)
float_prop("pKa", col="pKa", levels=OSCAR_OPT)level_of_theory() takes a positional name plus keyword-only fields:
| Argument | Default | Purpose |
|---|---|---|
name | — | Concise method label, e.g. "B3LYP/6-31G*" (positional, required) |
theory_type | — | Required. One of dft, semiempirical, wft, td_dft, tda_tddft, mrci |
software | — | Required. One of gaussian, orca, xtb, psi4, qchem, turbomole, molpro, other |
software_version | "" | Version string, e.g. "16", "6.4.0" — part of the spec's identity |
doc | "" | Detailed description (stored as the level's description) |
functional | "" | DFT functional, e.g. B97-D, ωB97X-D |
basis_set | "" | Basis set, e.g. Def2-TZVP, 6-31G(d) |
auxiliary_basis | "" | Auxiliary basis for density fitting |
dispersion_correction | "" | Dispersion correction, e.g. D3, D3BJ, D4 |
solvation_model | "" | Implicit solvation model, e.g. PCM, SMD, COSMO |
is_optimization | False | Method was used for geometry optimization |
is_single_point | False | Method was used for single-point energy calculations |
metadata | None | Free-form dict of extra method parameters |
Two rules are checked when the spec is defined, so a bad spec fails at
module-import time rather than mid-import: theory_type="dft" requires a
functional, and theory_type="semiempirical" must not set a basis_set.
A spec's identity is its natural key (name, software, software_version).
Specs sharing that key — within a subset or across a dataset family — converge
on a single stored level, so define once and reuse. Re-importing converges:
drop levels= from a property and its link is removed; remove a level from
every property and the assignment is pruned. Declaring two different specs
under the same natural key is a loud error, not a silent first-wins.
Levels attach identically to molecule, reaction, and fragment properties.
Fetching the source data
Pick the constructor matching where the data lives. All three return the same shape, so the rest of the pipeline doesn't care.
from lcmd_db.core.lib.importers.steps import http
fetch = http(
url="https://example.com/data.csv",
filename="data.csv", # name to save as in ctx.data_dir
)from lcmd_db.core.lib.importers.steps import github
fetch = github(
repo="lcmd-epfl/my-data",
ref="main",
filename="my-data.zip", # local zip filename — the whole repo is fetched as a zipball
)from lcmd_db.core.lib.importers.steps import materials_cloud
fetch = materials_cloud(
record_id="abcde-12345",
files=["data.csv", "structures.tar.gz"],
)from lcmd_db.core.lib.importers.steps import local
# Files and directories both work — use directories for vendored XYZ folders.
fetch = local(paths=["/absolute/path/to/data.csv", "/absolute/path/to/xyz/"])Archives (.tar.gz, .zip, ...) auto-extract into ctx.data_dir for
materials_cloud and github. For http, pass auto_unarchive=True
explicitly. The fetched-and-extracted state is cached, so re-running an
import skips the download.
Attaching structure files
attach_structures resolves a per-row file path from a Python format-style
pattern interpolated with row values:
from lcmd_db.core.lib.importers.steps import attach_structures, read_csv
parse = (
read_csv(path="data.csv")
>> attach_structures(pattern="xyz/{structure_id}.xyz")
)Given a row with structure_id="mol_0042", the step looks for
xyz/mol_0042.xyz under ctx.data_dir. The path is stored in a column
(_structure by default) consumed by load_molecules.
| Argument | Default | Purpose |
|---|---|---|
pattern | required | Path template. {col} placeholders are filled from each row |
column | _structure | Where to store the resolved path. Override when attaching multiple structures |
required | True | Missing files report an error per row. Drops the row under on_error="collect" (default) or aborts under on_error="raise". Set False for sparsely-available structures |
The directory is listed once and rows look up the prefix in a set — the step is O(rows) regardless of folder size.
Deriving chemistry
derive_chemistry populates the columns load_molecules expects, working from
a SMILES column, an XYZ structure column, or both. Run it whenever the row has
either — it lets the bulk insert skip Molecule.save(), which is roughly an
order of magnitude faster on large imports.
from lcmd_db.core.lib.importers.steps import derive_chemistry
derive = derive_chemistry(smiles_col="SMILES") # SMILES-driven
derive = derive_chemistry(structure_col="_structure") # XYZ-driven
derive = derive_chemistry( # both — SMILES wins
smiles_col="SMILES",
structure_col="_structure",
)Produces: standardized_smiles, canonical_smiles, inchi, inchi_key,
selfies, molecular_formula, molecular_weight.
XYZ → SMILES inference
When SMILES is missing for a row but structure_col resolves, derive_chemistry
runs xyz_to_smiles on the XYZ block first, then feeds the resulting SMILES into
the rest of the chain. The default is RDKitXyzToSmiles() (RDKit
DetermineBonds, neutral charge) — best-effort: bond perception recovers
connectivity but not stereochemistry. Rows where conversion fails report a
per-row error and proceed with smiles=None.
Override xyz_to_smiles to swap the inference strategy. The contract is a
single callable — pass a function or any class with __call__:
from lcmd_db.apps.molecules.services.conversion import RDKitXyzToSmiles
derive_chemistry(
structure_col="_structure",
xyz_to_smiles=RDKitXyzToSmiles(charge=-1), # anionic species
)
derive_chemistry(
structure_col="_structure",
xyz_to_smiles=lambda xyz: my_inference(xyz), # custom strategy
)To run the resolution standalone (e.g. inspect the filled SMILES column before
deriving the rest), compose resolve_smiles_from_xyz directly:
from lcmd_db.core.lib.importers.steps import resolve_smiles_from_xyz
parse = (
read_csv(path="data.csv")
>> attach_structures(pattern="xyz/{structure_id}.xyz")
>> resolve_smiles_from_xyz(structure_col="_structure")
)Loading
load_molecules does the bulk insert. Defaults assume the upstream pipeline ran
derive_chemistry, so the only argument you usually pass is smiles_col:
from lcmd_db.core.lib.importers.steps import load_molecules
load = load_molecules(smiles_col="SMILES")When you need to override more fields:
load = load_molecules(
smiles_col="SMILES",
name_col="structure_id", # use a CSV column as the molecule's display name
structure_col=None, # disable structure attachment for this subset
)Putting it together
Reference examples:
apps/backend/lcmd_db/registry/subsets/oscar/oscar_nhc.py— SMILES-driven: 30 properties, archive fetch, single-CSV parse, full chemistry derive, straightforward load.apps/backend/lcmd_db/registry/subsets/spahm/spahm_qm7.py— XYZ-only QM7: no SMILES column in the source data, structures come from XYZ files. Pairattach_structureswithderive_chemistry(structure_col="_structure")to get the full chemistry suite (SMILES, InChI, formula, …) inferred from the XYZ blocks.