Molecules#

The OSCAR!(NHC) subset contains N-heterocyclic carbenes with 30+ computed stereoelectronic properties.

from lcmd_db import load_dataset
import polars as pl

data = load_dataset("oscar_nhc")
molecules = data.as_dataset("molecules")

mol = molecules[0]
mol.properties["smiles"]           # str
mol.properties["energy"]           # float
mol.properties["cation_energy"]    # float

# Filter and split
heavy = molecules.filter(pl.col("molecular_weight") > 300)
train, test = molecules.train_test_split(test_size=0.2)

Restrict columns to speed up downloads:

data = load_dataset(
    "oscar_nhc",
    molecule_properties=["smiles", "energy", "homo", "lumo"],
)

Tip

Restricting molecule_properties to only the columns you need significantly reduces download size.

Export#

df = molecules.to_polars()
df.filter(pl.col("energy") < -100).select("smiles", "energy")
df = molecules.to_pandas()
df[df["energy"] < -100][["smiles", "energy"]]
# Requires: uv add ase
# Include structures in the download
data = load_dataset("oscar_nhc", include=["molecules", "structures"])
molecules = data.as_dataset("molecules")
atoms_list = molecules.to_ase()

Structures#

Include XYZ structure files in the download:

data = load_dataset("oscar_nhc", include=["molecules", "structures"])
mol = data.as_dataset("molecules")[0]
mol.structure_path  # Path to .xyz file

See also

MoleculeDataset — full API reference, load_dataset() — all loading options, Typed Stubs — IDE autocomplete for property keys