LCMD db logoLCMD[db]

Molecules

Define a subset that imports molecules from a CSV plus optional XYZ structures

Properties

Each property maps one CSV column to one typed value stored against the molecule.

from lcmd_db.core.lib.importers import (
    bool_prop, float_prop, int_prop, str_prop,
)

molecule_properties = [
    str_prop("Structure ID", col="structure_id", required=True),
    float_prop("HOMO", col="HOMO_eV", units="eV", required=True),
    float_prop("LUMO", col="LUMO_eV", units="eV"),
    int_prop("Num atoms", col="n_atoms"),
    bool_prop("Is stable", col="stable"),
]
ArgumentWhat it does
first argDisplay name shown in the UI and the Python client
colCSV column to read. Defaults to the display name when omitted
unitsFree-form string surfaced in the UI (e.g. "eV", "kcal/mol")
docLong description for the property page
requiredFail the import if the column is missing or a row's value is null
defaultSubstitute when the column value is null (only with required=False)
slugOverride the auto-generated identifier (defaults to a slugified version of the display name). When you set it explicitly, it must be a valid snake_case Python identifier
validationValidationRules instance for value bounds (see lcmd_db.core.properties)
levelsLevel(s) of theory the value was computed at — one level_of_theory(...) spec or a sequence. Optional; see Levels of theory

required=True means every row must have a value. Use it sparingly — one null cell will abort the whole import.

Levels of theory

A property can record the computational level of theory its value was produced at. Define a spec once with level_of_theory(...) and attach it to any property via levels= — the same spec is meant to be shared across many properties, and across a whole dataset family.

apps/backend/lcmd_db/registry/subsets/oscar/shared.py
from lcmd_db.core.lib.importers import level_of_theory

OSCAR_OPT = level_of_theory(
    "B97-D/Def2-TZVP",
    theory_type="dft",
    functional="B97-D",
    basis_set="Def2-TZVP",
    software="gaussian",
    software_version="16",
    is_optimization=True,
    doc="Geometry optimization",
)
OSCAR_SP = level_of_theory(
    "ωB97X-D/Def2-TZVP",
    theory_type="td_dft",
    functional="ωB97X-D",
    basis_set="Def2-TZVP",
    software="gaussian",
    software_version="16",
    is_single_point=True,
    doc="Excited-state / single-point calculations",
)
OSCAR_DFT_LEVELS = (OSCAR_OPT, OSCAR_SP)

Pass a single spec or a sequence to levels= on any property factory:

float_prop("HOMO", col="HOMO_eV", units="eV", levels=OSCAR_DFT_LEVELS)
float_prop("pKa", col="pKa", levels=OSCAR_OPT)

level_of_theory() takes a positional name plus keyword-only fields:

ArgumentDefaultPurpose
nameConcise method label, e.g. "B3LYP/6-31G*" (positional, required)
theory_typeRequired. One of dft, semiempirical, wft, td_dft, tda_tddft, mrci
softwareRequired. One of gaussian, orca, xtb, psi4, qchem, turbomole, molpro, other
software_version""Version string, e.g. "16", "6.4.0" — part of the spec's identity
doc""Detailed description (stored as the level's description)
functional""DFT functional, e.g. B97-D, ωB97X-D
basis_set""Basis set, e.g. Def2-TZVP, 6-31G(d)
auxiliary_basis""Auxiliary basis for density fitting
dispersion_correction""Dispersion correction, e.g. D3, D3BJ, D4
solvation_model""Implicit solvation model, e.g. PCM, SMD, COSMO
is_optimizationFalseMethod was used for geometry optimization
is_single_pointFalseMethod was used for single-point energy calculations
metadataNoneFree-form dict of extra method parameters

Two rules are checked when the spec is defined, so a bad spec fails at module-import time rather than mid-import: theory_type="dft" requires a functional, and theory_type="semiempirical" must not set a basis_set.

A spec's identity is its natural key (name, software, software_version). Specs sharing that key — within a subset or across a dataset family — converge on a single stored level, so define once and reuse. Re-importing converges: drop levels= from a property and its link is removed; remove a level from every property and the assignment is pruned. Declaring two different specs under the same natural key is a loud error, not a silent first-wins.

Levels attach identically to molecule, reaction, and fragment properties.

Fetching the source data

Pick the constructor matching where the data lives. All three return the same shape, so the rest of the pipeline doesn't care.

from lcmd_db.core.lib.importers.steps import http

fetch = http(
    url="https://example.com/data.csv",
    filename="data.csv",  # name to save as in ctx.data_dir
)
from lcmd_db.core.lib.importers.steps import github

fetch = github(
    repo="lcmd-epfl/my-data",
    ref="main",
    filename="my-data.zip",  # local zip filename — the whole repo is fetched as a zipball
)
from lcmd_db.core.lib.importers.steps import materials_cloud

fetch = materials_cloud(
    record_id="abcde-12345",
    files=["data.csv", "structures.tar.gz"],
)
from lcmd_db.core.lib.importers.steps import local

# Files and directories both work — use directories for vendored XYZ folders.
fetch = local(paths=["/absolute/path/to/data.csv", "/absolute/path/to/xyz/"])

Archives (.tar.gz, .zip, ...) auto-extract into ctx.data_dir for materials_cloud and github. For http, pass auto_unarchive=True explicitly. The fetched-and-extracted state is cached, so re-running an import skips the download.

Attaching structure files

attach_structures resolves a per-row file path from a Python format-style pattern interpolated with row values:

from lcmd_db.core.lib.importers.steps import attach_structures, read_csv

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
)

Given a row with structure_id="mol_0042", the step looks for xyz/mol_0042.xyz under ctx.data_dir. The path is stored in a column (_structure by default) consumed by load_molecules.

ArgumentDefaultPurpose
patternrequiredPath template. {col} placeholders are filled from each row
column_structureWhere to store the resolved path. Override when attaching multiple structures
requiredTrueMissing files report an error per row. Drops the row under on_error="collect" (default) or aborts under on_error="raise". Set False for sparsely-available structures

The directory is listed once and rows look up the prefix in a set — the step is O(rows) regardless of folder size.

Deriving chemistry

derive_chemistry populates the columns load_molecules expects, working from a SMILES column, an XYZ structure column, or both. Run it whenever the row has either — it lets the bulk insert skip Molecule.save(), which is roughly an order of magnitude faster on large imports.

from lcmd_db.core.lib.importers.steps import derive_chemistry

derive = derive_chemistry(smiles_col="SMILES")              # SMILES-driven
derive = derive_chemistry(structure_col="_structure")       # XYZ-driven
derive = derive_chemistry(                                  # both — SMILES wins
    smiles_col="SMILES",
    structure_col="_structure",
)

Produces: standardized_smiles, canonical_smiles, inchi, inchi_key, selfies, molecular_formula, molecular_weight.

XYZ → SMILES inference

When SMILES is missing for a row but structure_col resolves, derive_chemistry runs xyz_to_smiles on the XYZ block first, then feeds the resulting SMILES into the rest of the chain. The default is RDKitXyzToSmiles() (RDKit DetermineBonds, neutral charge) — best-effort: bond perception recovers connectivity but not stereochemistry. Rows where conversion fails report a per-row error and proceed with smiles=None.

Override xyz_to_smiles to swap the inference strategy. The contract is a single callable — pass a function or any class with __call__:

from lcmd_db.apps.molecules.services.conversion import RDKitXyzToSmiles

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=RDKitXyzToSmiles(charge=-1),     # anionic species
)

derive_chemistry(
    structure_col="_structure",
    xyz_to_smiles=lambda xyz: my_inference(xyz),   # custom strategy
)

To run the resolution standalone (e.g. inspect the filled SMILES column before deriving the rest), compose resolve_smiles_from_xyz directly:

from lcmd_db.core.lib.importers.steps import resolve_smiles_from_xyz

parse = (
    read_csv(path="data.csv")
    >> attach_structures(pattern="xyz/{structure_id}.xyz")
    >> resolve_smiles_from_xyz(structure_col="_structure")
)

Loading

load_molecules does the bulk insert. Defaults assume the upstream pipeline ran derive_chemistry, so the only argument you usually pass is smiles_col:

from lcmd_db.core.lib.importers.steps import load_molecules

load = load_molecules(smiles_col="SMILES")

When you need to override more fields:

load = load_molecules(
    smiles_col="SMILES",
    name_col="structure_id",   # use a CSV column as the molecule's display name
    structure_col=None,        # disable structure attachment for this subset
)

Putting it together

Reference examples:

  • apps/backend/lcmd_db/registry/subsets/oscar/oscar_nhc.py — SMILES-driven: 30 properties, archive fetch, single-CSV parse, full chemistry derive, straightforward load.
  • apps/backend/lcmd_db/registry/subsets/spahm/spahm_qm7.py — XYZ-only QM7: no SMILES column in the source data, structures come from XYZ files. Pair attach_structures with derive_chemistry(structure_col="_structure") to get the full chemistry suite (SMILES, InChI, formula, …) inferred from the XYZ blocks.

On this page