Fragments
Define a subset of building-block fragments and an assembler that composes them
A fragment subset has four moving pieces:
- Fragment types — the named pools (e.g. "Lewis Acid", "Backbone")
- Slots — what gets filled at assembly time, each typed by one fragment type
- An assembler — a Python function
(FragmentsModel) -> strreturning the assembled SMILES - A pipeline that loads the pool contents
Fragment types
from lcmd_db.core.lib.importers import FragmentTypeDefinition
fragment_types = [
FragmentTypeDefinition(
name="Lewis Acid",
slug="lewis_acid", # identifier used in slots and load_fragments
description="Boron-attaching fragments",
color="#EF4444", # used by the UI
),
FragmentTypeDefinition(
name="Lewis Base",
slug="lewis_base",
description="Nitrogen-attaching fragments",
color="#3B82F6",
),
]Slots
Slots describe the assembly recipe — how many of each type the assembler takes, and which are optional.
from lcmd_db.apps.fragments.assembly_types import Slot
slots = [
Slot(id="LA1", fragment_type="lewis_acid", description="First Lewis acid"),
Slot(id="LA2", fragment_type="lewis_acid", description="Second Lewis acid"),
Slot(id="LB", fragment_type="lewis_base", description="Lewis base"),
Slot(
id="extra",
fragment_type="lewis_base",
description="Optional second base",
required=False,
default=None,
),
]fragment_type must match a FragmentTypeDefinition.slug.
Assembler
The assembler runs in an isolated subprocess via uv run --script, not
inside the Django process. That's how the backend invokes user-provided
chemistry without polluting its own environment. The assembler module must
therefore be a PEP 723 inline script
with its own shebang and dependency block:
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
# dependencies = ["pydantic>=2.0", "rdkit"]
# ///
from pydantic import BaseModel
class MyFragments(BaseModel):
LA1: str
LA2: str
LB: str
extra: str | None = None
model_config = {"frozen": True}
def my_assembler(fragments: MyFragments) -> str:
"""Assemble a SMILES string from the fragments."""
return f"B({fragments.LA1})({fragments.LA2}).N({fragments.LB})"The first parameter must be type-annotated with a Pydantic model whose fields mirror the slot ids — that model is auto-derived from the function signature for validation.
Field names in the Pydantic model must equal the slot ids. Optional
slots (required=False) become str | None fields with a default. Mismatch
raises at subset construction time.
Declare every third-party import in the dependencies list. The first run
populates the uv cache; thereafter each invocation reuses it. On a fresh
checkout, pre-warm the cache for every registered assembler — otherwise the
first import-time invocation pays the install cost in the request path:
just manage cache_assembler_depsLoading the pools
load_fragments writes one pool. When several pools share the same source CSV
(one column per pool), re-run read_csv per column. A small per-subset helper
deduplicates the column and renames it to smiles:
import polars as pl
from lcmd_db.core.lib.importers import HasRows, ImportContext, ParseError, Step
from lcmd_db.core.lib.importers.steps import (
http, load_fragments, read_csv,
)
def select_unique(column: str) -> Step[HasRows, HasRows]:
"""Replace ctx.rows with the unique non-empty values of `column`,
renamed to 'smiles'. Honors ctx.limit *after* dedup so --limit N
yields N fragments per pool (not N raw rows)."""
async def _step(ctx: ImportContext) -> ImportContext:
assert ctx.rows is not None
if column not in ctx.rows.columns:
raise ParseError(f"Column {column!r} not found")
unique = (
ctx.rows.select(pl.col(column))
.drop_nulls()
.filter(pl.col(column).str.len_chars() > 0)
.unique()
.rename({column: "smiles"})
)
if ctx.limit is not None:
unique = unique.head(ctx.limit)
ctx.rows = unique
return ctx
return Step(_step)
def pool(column: str, slug: str):
return (
read_csv(path="fragments.csv", apply_limit=False)
>> select_unique(column)
>> load_fragments(type_slug=slug, smiles_col="smiles")
)
pipeline = (
http(url="...", filename="fragments.csv")
>> pool("LA", "lewis_acid")
>> pool("LB", "lewis_base")
)Note apply_limit=False on read_csv — --limit N should yield N
fragments per pool, not N raw rows. The limit moves inside select_unique
and applies after deduplication. Reference: flp/__init__.py defines the
same helper (named _select_unique_column) and uses it across four pools.
Wiring the subset
from lcmd_db.core.lib.importers import Subset, Source, SourceType
my_fragments = Subset(
name="MyFragments",
description="...",
source=Source(name="...", type=SourceType.GITHUB, url="..."),
fragment_types=fragment_types,
slots=slots,
assembler=my_assembler,
pipeline=pipeline,
)The subset constructor validates that:
fragment_typesis non-empty whenslotsis setassembleris set wheneverslotsis set- The first parameter of
assembleris annotated with aBaseModelsubclass - The assembler returns
str(when an explicit return annotation is present) - Every Pydantic field name matches a
slot.id, with required/optional matchingslot.required - Every
slot.fragment_typematches aFragmentTypeDefinition.slug - The assembler script file exists on disk (it's resolved at subset construction)
Putting it together
Reference example:
apps/backend/lcmd_db/registry/subsets/flp/__init__.py — four pools from a
single CSV, eight slots (some optional), and a multi-step assembler that
implements the chromosome_to_smiles rule from the source paper.