Molecules#
The OSCAR!(NHC) subset contains N-heterocyclic carbenes with 30+ computed stereoelectronic properties.
from lcmd_db import load_dataset
import polars as pl
data = load_dataset("oscar_nhc")
molecules = data.as_dataset("molecules")
mol = molecules[0]
mol.properties["smiles"] # str
mol.properties["energy"] # float
mol.properties["cation_energy"] # float
# Filter and split
heavy = molecules.filter(pl.col("molecular_weight") > 300)
train, test = molecules.train_test_split(test_size=0.2)
Restrict columns to speed up downloads:
data = load_dataset(
"oscar_nhc",
molecule_properties=["smiles", "energy", "homo", "lumo"],
)
Tip
Restricting molecule_properties to only the columns you need
significantly reduces download size.
Export#
df = molecules.to_polars()
df.filter(pl.col("energy") < -100).select("smiles", "energy")
df = molecules.to_pandas()
df[df["energy"] < -100][["smiles", "energy"]]
# Requires: uv add ase
# Include structures in the download
data = load_dataset("oscar_nhc", include=["molecules", "structures"])
molecules = data.as_dataset("molecules")
atoms_list = molecules.to_ase()
Structures#
Include XYZ structure files in the download:
data = load_dataset("oscar_nhc", include=["molecules", "structures"])
mol = data.as_dataset("molecules")[0]
mol.structure_path # Path to .xyz file
See also
MoleculeDataset — full API reference,
load_dataset() — all loading options,
Typed Stubs — IDE autocomplete for property keys