Molecules
=========

The **OSCAR!(NHC)** subset contains N-heterocyclic carbenes with 30+
computed stereoelectronic properties.

.. code-block:: python

   from lcmd_db import load_dataset
   import polars as pl

   data = load_dataset("oscar_nhc")
   molecules = data.as_dataset("molecules")

   mol = molecules[0]
   mol.properties["smiles"]           # str
   mol.properties["energy"]           # float
   mol.properties["cation_energy"]    # float

   # Filter and split
   heavy = molecules.filter(pl.col("molecular_weight") > 300)
   train, test = molecules.train_test_split(test_size=0.2)

Restrict columns to speed up downloads:

.. code-block:: python

   data = load_dataset(
       "oscar_nhc",
       molecule_properties=["smiles", "energy", "homo", "lumo"],
   )

.. tip::

   Restricting ``molecule_properties`` to only the columns you need
   significantly reduces download size.

Export
------

.. tab-set::

   .. tab-item:: Polars
      :sync: polars

      .. code-block:: python

         df = molecules.to_polars()
         df.filter(pl.col("energy") < -100).select("smiles", "energy")

   .. tab-item:: Pandas
      :sync: pandas

      .. code-block:: python

         df = molecules.to_pandas()
         df[df["energy"] < -100][["smiles", "energy"]]

   .. tab-item:: ASE
      :sync: ase

      .. code-block:: python

         # Requires: uv add ase
         # Include structures in the download
         data = load_dataset("oscar_nhc", include=["molecules", "structures"])
         molecules = data.as_dataset("molecules")
         atoms_list = molecules.to_ase()

Structures
----------

Include XYZ structure files in the download:

.. code-block:: python

   data = load_dataset("oscar_nhc", include=["molecules", "structures"])
   mol = data.as_dataset("molecules")[0]
   mol.structure_path  # Path to .xyz file

.. seealso::

   :class:`~lcmd_db.MoleculeDataset` --- full API reference,
   :func:`~lcmd_db.load_dataset` --- all loading options,
   :doc:`typed-stubs` --- IDE autocomplete for property keys