Fragments

A fragment subset has four moving pieces:

Fragment types — the named pools (e.g. "Lewis Acid", "Backbone")
Slots — what gets filled at assembly time, each typed by one fragment type
An assembler — a Python function (FragmentsModel) -> str returning the assembled SMILES
A pipeline that loads the pool contents

Fragment types

from lcmd_db.core.lib.importers import FragmentTypeDefinition

fragment_types = [
    FragmentTypeDefinition(
        name="Lewis Acid",
        slug="lewis_acid",                  # identifier used in slots and load_fragments
        description="Boron-attaching fragments",
        color="#EF4444",                    # used by the UI
    ),
    FragmentTypeDefinition(
        name="Lewis Base",
        slug="lewis_base",
        description="Nitrogen-attaching fragments",
        color="#3B82F6",
    ),
]

Slots

Slots describe the assembly recipe — how many of each type the assembler takes, and which are optional.

from lcmd_db.apps.fragments.assembly_types import Slot

slots = [
    Slot(id="LA1", fragment_type="lewis_acid", description="First Lewis acid"),
    Slot(id="LA2", fragment_type="lewis_acid", description="Second Lewis acid"),
    Slot(id="LB",  fragment_type="lewis_base", description="Lewis base"),
    Slot(
        id="extra",
        fragment_type="lewis_base",
        description="Optional second base",
        required=False,
        default=None,
    ),
]

fragment_type must match a FragmentTypeDefinition.slug.

The assembler runs in an isolated subprocess via uv run --script, not inside the Django process. That's how the backend invokes user-provided chemistry without polluting its own environment. The assembler module must therefore be a PEP 723 inline script with its own shebang and dependency block:

apps/backend/lcmd_db/registry/subsets/my_fragments/assembler/default.py

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = ["pydantic>=2.0", "rdkit"]
# ///

from pydantic import BaseModel


class MyFragments(BaseModel):
    LA1: str
    LA2: str
    LB: str
    extra: str | None = None

    model_config = {"frozen": True}


def my_assembler(fragments: MyFragments) -> str:
    """Assemble a SMILES string from the fragments."""
    return f"B({fragments.LA1})({fragments.LA2}).N({fragments.LB})"

The first parameter must be type-annotated with a Pydantic model whose fields mirror the slot ids — that model is auto-derived from the function signature for validation.

Field names in the Pydantic model must equal the slot ids. Optional slots (required=False) become str | None fields with a default. Mismatch raises at subset construction time.

Declare every third-party import in the dependencies list. The first run populates the uv cache; thereafter each invocation reuses it. On a fresh checkout, pre-warm the cache for every registered assembler — otherwise the first import-time invocation pays the install cost in the request path:

just manage cache_assembler_deps

Loading the pools

load_fragments writes one pool. When several pools share the same source CSV (one column per pool), re-run read_csv per column. A small per-subset helper deduplicates the column and renames it to smiles:

import polars as pl

from lcmd_db.core.lib.importers import HasRows, ImportContext, ParseError, Step
from lcmd_db.core.lib.importers.steps import (
    http, load_fragments, read_csv,
)

def select_unique(column: str) -> Step[HasRows, HasRows]:
    """Replace ctx.rows with the unique non-empty values of `column`,
    renamed to 'smiles'. Honors ctx.limit *after* dedup so --limit N
    yields N fragments per pool (not N raw rows)."""
    async def _step(ctx: ImportContext) -> ImportContext:
        assert ctx.rows is not None
        if column not in ctx.rows.columns:
            raise ParseError(f"Column {column!r} not found")
        unique = (
            ctx.rows.select(pl.col(column))
            .drop_nulls()
            .filter(pl.col(column).str.len_chars() > 0)
            .unique()
            .rename({column: "smiles"})
        )
        if ctx.limit is not None:
            unique = unique.head(ctx.limit)
        ctx.rows = unique
        return ctx
    return Step(_step)


def pool(column: str, slug: str):
    return (
        read_csv(path="fragments.csv", apply_limit=False)
        >> select_unique(column)
        >> load_fragments(type_slug=slug, smiles_col="smiles")
    )

pipeline = (
    http(url="...", filename="fragments.csv")
    >> pool("LA", "lewis_acid")
    >> pool("LB", "lewis_base")
)

Note apply_limit=False on read_csv — --limit N should yield N fragments per pool, not N raw rows. The limit moves inside select_unique and applies after deduplication. Reference: flp/__init__.py defines the same helper (named _select_unique_column) and uses it across four pools.

Wiring the subset

apps/backend/lcmd_db/registry/subsets/my_fragments.py

from lcmd_db.core.lib.importers import Subset, Source, SourceType

my_fragments = Subset(
    name="MyFragments",
    description="...",
    source=Source(name="...", type=SourceType.GITHUB, url="..."),
    fragment_types=fragment_types,
    slots=slots,
    assembler=my_assembler,
    pipeline=pipeline,
)

The subset constructor validates that:

fragment_types is non-empty when slots is set
assembler is set whenever slots is set
The first parameter of assembler is annotated with a BaseModel subclass
The assembler returns str (when an explicit return annotation is present)
Every Pydantic field name matches a slot.id, with required/optional matching slot.required
Every slot.fragment_type matches a FragmentTypeDefinition.slug
The assembler script file exists on disk (it's resolved at subset construction)

Putting it together

Reference example: apps/backend/lcmd_db/registry/subsets/flp/__init__.py — four pools from a single CSV, eight slots (some optional), and a multi-step assembler that implements the chromosome_to_smiles rule from the source paper.