Prepare Inputs
This page defines the current input contract for the runner and the core analysis workflow.
Input root layout
Inputs are discovered from:
<input-root>/<project_id>/01_sources/
With the published chart defaults, that usually maps to:
/data/input/<project_id>/01_sources/
The required entry point is:
<input-root>/<project_id>/01_sources/sources.json
Minimum required files
At minimum, provide:
sources.jsonportfolio_selected.csv
The portfolio_selected.csv file is resolved through sources.json, so both must be present and consistent.
High-level intake checklist
Before you worry about the runtime folder layout, make sure the submission itself is intelligible. A strong package usually starts with:
- target identifiers such as gene symbols, protein names, aliases, and stable IDs
- compound identifiers or chemical structures such as SMILES
- indication and context-of-use framing
- modality, delivery, or exclusion constraints
- citations or source links when they are available
This is the practical handoff checklist teams use before they normalize the materials into the runner-facing file contract.
Submission hygiene
The intake material should also follow a few simple packaging rules:
- provide one target per request unless a multi-target workflow has been explicitly approved
- bundle supporting context in a short, structured memo rather than scattering it across unrelated files
- keep filenames stable and self-explanatory
- include known liabilities, historical failure modes, or important exclusions when they are already known
- include specific questions the run should help answer when the submission is exploratory rather than purely operational
Optional files
Add these when they are available for your project:
assays.csvcompounds.csvtargets.csvstructures/*.pdb
Missing optional files do not necessarily prevent a run, but they can reduce downstream module coverage and interpretation confidence.
Inputs that change category routing
Some optional-looking files have policy significance because they change category resolution rather than just adding extra context.
Use this rough mapping:
- no structure files and no assay table usually keeps the run in a ligand-only category
- staged receptor structure can move the run into a structure-backed category
- staged labeled assay data can move the run into an assay-aware category
- physics-audit paths require more than a structure file; they also depend on admissibility and preflight constraints
That is why input preparation should be reviewed against Category Policy and Routing, not just against the file checklist.
sources.json expectations
sources.json should reference files located under the same 01_sources/ directory. Typical keys include:
portfolio_selected_csvprimary_candidate_idassays_csvcompounds_csvtargets_csvpdbs
The exact key names can vary slightly by pipeline generation path, but the operational rule is stable: the manifest must point to real files under 01_sources/ that the runner can resolve at runtime.
For customer-facing intake flows, sources.json can also carry higher-level context such as:
indicationcontext_of_useevidence_positiveevidence_negative
That intake-style context is useful for onboarding and interpretation, but it does not replace the runner-facing file references required at execution time.
targets.csv in onboarding flows
targets.csv is optional in the deployed runner contract, but it is commonly used in onboarding and target-diligence intake flows.
Typical columns include:
target_symboluniprot_idtarget_namehgnc_idorganismmodalitymechanism_of_action
Use it to make the target context explicit early, then keep the staged runtime bundle aligned with the manifest-driven contract.
portfolio_selected.csv expectations
The selected portfolio file must include one of the following columns:
smilescanonical_smiles
Recommended additional columns:
candidate_idcompound_idname- Any customer metadata columns you want carried forward into downstream context
If you are using multiple candidates, make sure the row identifiers and primary_candidate_id in sources.json do not conflict.
Input provenance expectations
Prepare inputs so an operator or auditor can answer these questions without guesswork:
- Which files were supplied by the customer
- Which file is the authoritative selection file
- Which structure files belong to which candidate or target
- Which identifiers should be preserved in exported reports
Keeping that mapping clean improves reproducibility and reduces manual support work later in the run.
Common failure mode
If the runner fails with an error similar to:
Project directory not found: <input-root>/<project_id>
the usual causes are:
config.projectIddoes not match the staged folder name- Files were copied into the wrong directory level
- The runner does not have read access to the staged path
Quick check:
ls -la /data/input/<project_id>/01_sources/
Output scoping note
Input lookup is project-scoped, but outputs are run-scoped:
<output-root>/<run_id>/
This is expected behavior and prevents collisions when the same project is run multiple times.
Recommended validation path
If PreFlight UI is part of your workflow, validate the package before submission:
Then confirm the staged package is compatible with the intended category path in Category Policy and Routing.