Dataset imports

The launch_import_job management command submits a Kubernetes Job that runs manage.py import_subset against the prod database. The Job reuses the backend image and the secrets the backend Deployment already consumes, so there's no separate credential plumbing on the developer laptop. Jobs always land in the prod namespace.

Prerequisites

kubectl context pointing at the target cluster
Membership in the lcmd:dataset-importers group (talk to a cluster admin)
The lcmd-import-secrets Secret exists in prod (one-time setup, see below)

Usage

# Import OSCARDHBD with a 5-row limit, tail the logs, exit when the Job finishes
manage.py launch_import_job --subset OSCARDHBD --limit 5

# Submit and exit immediately
manage.py launch_import_job --all --no-wait

# Re-attach to a previously submitted Job
manage.py launch_import_job --attach import-oscardhbd-20260429-091500-abcd

Ctrl-C disconnects on the next log line — quiet pods may take a moment. The Job keeps running in the cluster regardless. Re-attach with --attach <name> or kubectl logs -f job/<name>.

Validating the manifest before submission

--print-manifest emits the Job manifest to stdout and exits without submitting. Pipe to kubectl apply --dry-run=server -f - for apiserver-level validation (catches malformed fields, RBAC failures, admission-controller rejections):

manage.py launch_import_job --subset OSCARDHBD --print-manifest \
  | kubectl apply --dry-run=server -f -

Useful when iterating on the manifest, after a kubernetes upgrade, or to debug a Job that fails on submission.

Collision protection

If another Job for the same subset is already active, submission is refused:

CommandError: Active Job(s) for subset='oscardhbd' already running:
import-oscardhbd-20260429-091500-abcd. Wait for them to finish or pass --force.

--force overrides the check. Use it only when you know the prior Job is stuck/wedged and won't write to the same rows.

Image resolution

Picked in this order, first match wins:

--image <ref> — explicit override
The container image of the running Deployment/backend in prod — keeps the Job in lockstep with what's deployed
Hard error — fail loud rather than guess

Default behavior on a developer laptop: nothing to configure; the Job runs the same image as production.

What happens in the cluster

The Job pod has two containers:

Init container (alpine/git) does a sparse clone of the repo and git lfs pull --include="apps/backend/data/**", dropping the data into a shared volume. Authenticates via SSH using a deploy key mounted from lcmd-import-secrets/ssh-private-key — repo-owned, not tied to any user, survives personnel changes.
Import container runs manage.py import_subset <args> with the same env as the backend Deployment — a single envFrom on backend-secrets, which carries both the database and the S3 object-storage credentials.

The init container takes ~10 minutes and leaves a ~2.4G emptyDir on the node until the Job is deleted. Subsets whose importers only download remote sources don't need it — skip it with --no-pull-lfs:

uv run manage.py launch_import_job --subset FORMED --reload --no-pull-lfs

Subsets registered with local(...) fetchers do need the pull (currently Hydroformylation, Pincer, TMGSspinPlus and the SPAHM family — grep the registry for local( for the live list). This applies to --all too: combine it with --no-pull-lfs only when no registered subset reads local data.

backoffLimit: 0 — partial imports are surprising; let the operator decide whether to re-run.

ttlSecondsAfterFinished: 86400 — completed Jobs auto-clean after a day.

One-time setup

`lcmd-import-secrets`

The init container authenticates via an SSH deploy key — uploaded to lcmd-epfl/db as a repo-scoped read-only key, owned by the repo rather than any user. The Secret manifest lives encrypted in the prod overlay alongside the other SOPS-managed secrets, so ArgoCD reconciles it like everything else.

The kustomize plumbing ships with the repo: infrastructure/kubernetes/app/overlays/prod/import-secrets.enc.yaml is a placeholder file with ssh-private-key: REPLACE_VIA_SOPS_BEFORE_FIRST_RUN. Populate it once before the first real run:

# 1. Generate a fresh ed25519 keypair (no passphrase — read by automation):
ssh-keygen -t ed25519 -N "" -C "lcmd-import-job" -f /tmp/lcmd-import-key

# 2. Upload the public half to GitHub:
#    lcmd-epfl/db → Settings → Deploy keys → Add deploy key
#    Title: lcmd-import-job
#    Key:   contents of /tmp/lcmd-import-key.pub
#    Allow write access: NO (leave unchecked — read-only)

# 3. Paste the private half into the encrypted Secret:
sops infrastructure/kubernetes/app/overlays/prod/import-secrets.enc.yaml
# Replace the placeholder with the contents of /tmp/lcmd-import-key
# (full PEM block, starting with -----BEGIN OPENSSH PRIVATE KEY-----)

# 4. Commit and push. ArgoCD picks up the new value on the next sync.
git add infrastructure/kubernetes/app/overlays/prod/import-secrets.enc.yaml
git commit -m "chore(secrets): populate lcmd-import-secrets deploy key"

# 5. Securely delete the local plaintext:
shred -u /tmp/lcmd-import-key /tmp/lcmd-import-key.pub

To rotate later: regenerate the keypair, replace the deploy key on github.com, repeat steps 3–5.

Group membership

infrastructure/kubernetes/app/base/import-job-rbac.yaml binds the dataset-importer Role to the lcmd:dataset-importers Group. Grant a developer access by adding them to that group through the cluster's user mapping (OIDC claim, kubectl user, etc. — depends on the cluster setup).

Troubleshooting

"Cannot resolve import-job image"

The CLI couldn't read Deployment/backend and you didn't pass --image. Either:

Check your kubectl context: kubectl config current-context
Confirm group membership: kubectl auth can-i get deployment/backend -n prod
Pass --image ghcr.io/lcmd-epfl/db/backend:<tag> explicitly

Init container fails

If the Job fails immediately and the import-container logs are empty, the init container most likely failed (deploy key revoked, repo unreachable, LFS storage down). The CLI prints a hint pointing at:

kubectl describe pod -n prod <pod-name>
kubectl logs -n prod <pod-name> -c lfs-pull

The events section of describe pod carries the actual error.

Job pod stuck in `Pending`

Likely a resource shortage. Check kubectl describe pod <name> for the scheduler's reason. Resource requests are hardcoded (500m CPU / 1Gi memory) and limits are 2 CPU / 4Gi — open an issue if you see real OOMs and we'll bump them upstream.

Import takes forever

Large subsets (FORMED ~117k rows) take ~30 min on the bulk path. kubectl logs -f job/<name> (or --attach) shows progress. If it's actually stuck, check the Job pod's events.

The underlying manage.py import_subset command is documented in the import-pipeline architecture page.
Performance work that made one-shot Jobs practical: closed in #72.

Dataset imports

On this page