Dataset#

Base Class#

class Dataset[source]#

Bases: Generic[~E]

Generic lazy dataset backed by a tabular data source.

Provides lazy loading, integer/slice indexing, polars-based filtering, column selection, train/test splitting, and export to polars, pandas, or ASE formats.

__init__(source, *, metadata=None)[source]#
property df: polars.DataFrame#

Materialised polars DataFrame (collected on first access).

property lazy: polars.LazyFrame#

Underlying polars LazyFrame for deferred computation.

property columns: list[str]#

Column names available in the dataset.

property properties: list[PropertyInfo] | None#

Property metadata from the API, if available.

select(*columns)[source]#

Return a new dataset restricted to the given columns (id is always kept).

Parameters:

*columns (str) – Column names to keep.

Return type:

Dataset[E]

filter(expr)[source]#

Return a new dataset containing only rows matching the expression.

Parameters:

expr (polars.Expr) – A polars expression, e.g. pl.col("weight") > 100.

Return type:

Dataset[E]

train_test_split(test_size=0.2, *, seed=42)[source]#

Split into train and test datasets by random shuffling.

Parameters:
  • test_size (float (default: 0.2)) – Fraction of data to use for the test set.

  • seed (int (default: 42)) – Random seed for reproducibility.

Return type:

tuple[Dataset[E], Dataset[E]]

to_polars()[source]#

Return the dataset as a polars DataFrame.

Return type:

polars.DataFrame

to_pandas()[source]#

Return the dataset as a pandas DataFrame.

Return type:

DataFrame

to_ase()[source]#

Convert each molecule entry to an ase.Atoms object.

Requires ase to be installed and structure files to be available.

Return type:

list[Atoms]

Specialized Datasets#

class MoleculeDataset[source]#

Bases: Dataset[Molecule[~Properties]], Generic[~Properties]

class ReactionDataset[source]#

Bases: Dataset[Reaction[~Properties]], Generic[~Properties]

class FragmentDataset[source]#

Bases: Dataset[Fragment[~Properties, ~FType]], Generic[~Properties, ~FType]