Dataset#

Base Class#

class Dataset[source]#

Bases: Generic[~E]

Generic lazy dataset backed by a tabular data source.

Provides lazy loading, integer/slice indexing, polars-based filtering, column selection, train/test splitting, and export to polars, pandas, or ASE formats.

__init__(source, *, metadata=None)[source]#

property df: polars.DataFrame#: Materialised polars DataFrame (collected on first access).

property lazy: polars.LazyFrame#: Underlying polars LazyFrame for deferred computation.

property columns: list[str]#: Column names available in the dataset.

property properties: list[PropertyInfo] | None#: Property metadata from the API, if available.

select(*columns)[source]#

Return a new dataset restricted to the given columns (id is always kept).

Parameters:: *columns (str) – Column names to keep.
Return type:: Dataset[E]

filter(expr)[source]#

Return a new dataset containing only rows matching the expression.

Parameters:: expr (polars.Expr) – A polars expression, e.g. pl.col("weight") > 100.
Return type:: Dataset[E]

train_test_split(test_size=0.2, *, seed=42)[source]#

Split into train and test datasets by random shuffling.

Parameters:

test_size (float (default: 0.2)) – Fraction of data to use for the test set.
seed (int (default: 42)) – Random seed for reproducibility.

Return type:

tuple[Dataset[E], Dataset[E]]

to_polars()[source]#

Return the dataset as a polars DataFrame.

Return type:: polars.DataFrame

to_pandas()[source]#

Return the dataset as a pandas DataFrame.

Return type:: DataFrame

to_ase()[source]#

Convert each molecule entry to an ase.Atoms object.

Requires ase to be installed and structure files to be available.

Return type:: list[Atoms]

Specialized Datasets#

class MoleculeDataset[source]#: Bases: Dataset[Molecule[~Properties]], Generic[~Properties]

class ReactionDataset[source]#: Bases: Dataset[Reaction[~Properties]], Generic[~Properties]

class FragmentDataset[source]#: Bases: Dataset[Fragment[~Properties, ~FType]], Generic[~Properties, ~FType]

Dataset#

Base Class#

Specialized Datasets#

This Page