Configuration reference¶
Every key of the config.toml, by section. For a task-oriented walkthrough of
how to customize a config, start with Writing configs;
this page is the exhaustive listing.
A config is one TOML file. Each stage reads its parameters through a typed
accessor whose defaults are the single source of truth (the frozen *Params
dataclasses in src/deeperfly/config.py); the packaged
default_config.toml mirrors them exactly. An unknown key in a stage table is a
hard error that names the allowed keys. Performance-only knobs (batch_size,
decode_buffer, [io.image]) never invalidate a stage's cache; everything else
that affects a result does.
The top-level layout:
[[sources]] # footage globs (shared input)
[io.image] # image-sequence decode
[skeleton] # tracked points and limbs
[cameras.defaults] # rig geometry: shared defaults
[cameras.<name>] # rig geometry: per-view overrides
[pipeline] # which stages run
[pose2d] # 2D detection: knobs + detection plan sub-tables
[bundle_adjustment] # camera refinement
[pictorial_structures] # opt-in peak recovery
[triangulation] # 2D -> 3D
[visualization] # output videos
[[sources]] — footage¶
An array of tables; each names a footage glob matched inside the recording
directory. A source can feed several pathways and a visualization imshow panel.
| Key | Type | Default | Description |
|---|---|---|---|
name |
str | required | Source identifier (referenced by pathways and views). |
filename |
str | the name |
Glob inside the recording dir: a named file (camera_0.mp4), a bare prefix (camera_1 → camera_1*, a video or image sequence), or a wildcard. |
A source's footage is one video file or a naturally-sorted image sequence. A directory is a valid recording only when every source matches footage with the same file/frame count.
[io.image] — image decode¶
Video files use PyAV; image sequences use OpenCV. The only knob:
| Key | Type | Default | Description |
|---|---|---|---|
workers |
int | 0 |
Image-decode threads. 0 = auto (one per CPU). |
[skeleton] — tracked points¶
The tracked points and their structure. Omit the section entirely to use the default 38-point fly skeleton.
| Key | Type | Default | Description |
|---|---|---|---|
name |
str | "skeleton" |
Skeleton identifier (e.g. "fly38"). |
point_names |
list[str] | required | Ordered tracked-point names; the length is P. |
limb_points |
table | {} |
[skeleton.limb_points]: each limb name → its points in kinematic-chain order. |
limb_palette |
table | {} |
[skeleton.limb_palette]: each limb name → a hex plotting color. Limbs without an entry fall back to a default colormap. |
[skeleton]
name = "fly38"
point_names = ["lf_thorax_coxa", "lf_coxa_trochanter", "..."]
[skeleton.limb_points]
lf_leg = ["lf_thorax_coxa", "lf_coxa_trochanter", "lf_femur_tibia", "lf_tibia_tarsus", "lf_claw"]
[skeleton.limb_palette]
lf_leg = "#0f7399"
Which view sees which point is not set here — it is the union of the
[pose2d.output_points] tables.
[cameras.*] — rig geometry¶
Each [cameras.<name>] is a geometric view: pure intrinsics + extrinsics, no
footage. [cameras.defaults] is merged into every view; per-view tables override
it (the default rig sets just azimuth_deg per view). A view's intrinsics
describe the raw frame of the source feeding it.
Intrinsics:
| Key | Type | Default | Description |
|---|---|---|---|
focal_length_px |
float or [float, float] | required | [fx, fy] in raw-frame pixels (a scalar is allowed when fx == fy). |
principal_point_px |
[float, float] | image center ((w-1)/2, (h-1)/2) |
Principal point [cx, cy]. Omit to use each view's image center. |
distortion_coefficients |
list[float] | [] |
OpenCV-ordered distortion coefficients; empty means no distortion. |
Extrinsics (orbit / look-at): the cameras orbit a target near the world
origin. World up is +z.
| Key | Type | Default | Description |
|---|---|---|---|
look_at |
[float, float, float] | [0, 0, 0] |
World point the camera looks at. |
distance |
float | required | Distance from look_at to the camera center. |
azimuth_deg |
float | 0.0 |
Longitude around look_at. |
elevation_deg |
float | 0.0 |
Latitude above the horizon (±90 is undefined — the roll becomes ambiguous). |
roll_deg |
float | 0.0 |
Rotation about the optical axis. |
Explicit rvec / tvec / rotation_matrix / position keys are not
accepted in the config (they are rejected with a pointer to the orbit keys); use
the orbit parameters. The internal CameraGroup still uses rvec / tvec.
[cameras.defaults]
focal_length_px = [22388.125, 22388.125]
distortion_coefficients = []
look_at = [0.0, 0.0, 0.0]
distance = 107.463
elevation_deg = 0.0
roll_deg = 0.0
[cameras.rh]
azimuth_deg = -120
[pipeline] — which stages run¶
One do_<stage> boolean per stage. Each enabled stage reads its own [<stage>]
table.
| Key | Type | Default | Description |
|---|---|---|---|
do_pose2d |
bool | true |
Detect 2D pose in every view. |
do_bundle_adjustment |
bool | true |
Refine the cameras. |
do_pictorial_structures |
bool | false |
DeepFly3D-style peak recovery (opt-in). |
do_triangulation |
bool | true |
Triangulate 2D → 3D. |
do_visualization |
bool | true |
Render the videos. |
[pose2d] — 2D detection¶
The [pose2d] table holds the detector's performance knobs and (as sub-tables)
the detection plan — what to detect and how.
Performance knobs:
| Key | Type | Default | Description |
|---|---|---|---|
precision |
str | "bfloat16" |
Forward precision: "float32" (reference), "float16" (CUDA autocast, ~1.5–2× faster), "bfloat16" (default, wider range). Ignored on CPU/MPS. |
batch_size |
int | 16 |
GPU forward batch (images per forward). Clamped to ≥ 1; throughput plateaus by ~16 on a fast GPU. |
decode_buffer |
int | 4 |
Decode queue depth, in multiples of batch_size. Clamped to ≥ 1. Peak frames/camera ≈ (decode_buffer + 2) * batch_size. |
[[pose2d.preprocessors]]¶
Named, reusable frame-op pipelines, referenced by a pathway's preprocessor.
| Key | Type | Description |
|---|---|---|
name |
str | Preprocessor identifier. |
ops |
list[table] | Ordered frame ops (below); [] = identity. |
Ops (run in written order; flips/rotations do not commute):
| Op | Fields | Effect |
|---|---|---|
fliplr |
— | Left–right flip. |
flipud |
— | Up–down flip. |
rot90 |
k (int) |
k counter-clockwise quarter-turns (any sign). |
crop |
x, y, width, height |
Keep a window. |
resize |
scale, or width/height; optional interpolation ("bilinear"/"nearest") |
Rescale. |
Detections are mapped back into the raw frame by inverting these ops, so a preprocessor never moves the stored 2D or the reconstructed 3D.
[[pose2d.models]]¶
A detector network and its input contract.
| Key | Type | Description |
|---|---|---|
name |
str | Model identifier (referenced by pathways). |
class |
str | Network registry key ("hourglass" = DeepFly2D). |
weights |
str | Checkpoint path; "" / omitted uses the auto-provisioned cache. |
input_size |
[int, int] | (height, width) the network expects; frames are resized to it and peaks scaled back. |
mean |
float | Scalar subtracted after /255 normalization. |
n_out_channels |
int | Output heatmap count (validated against the weights). |
[[pose2d.pathways]]¶
A named source → preprocessor → model inference run. Says what to detect on.
| Key | Type | Required | Description |
|---|---|---|---|
name |
str | yes | Unique pathway identifier (referenced by output_points). |
source |
str | yes | The [[sources]] name to detect on. |
model |
str | yes | The [[pose2d.models]] name to use. |
preprocessor |
str | no | A [[pose2d.preprocessors]] name; omit for identity. |
[pose2d.output_points.<view>]¶
For each view, where every tracked point's data comes from. A table keyed by point name:
point = { pathway, out_channel } fills that point of the view from output
channel out_channel of the named pathway. Keying by (view, point) means each
point has exactly one source (a repeat is an error); a (view, point) left out
stays unobserved (NaN). That union is the visibility.
[bundle_adjustment] — camera refinement¶
Fly-as-target bundle adjustment over scipy.optimize.least_squares.
| Key | Type | Default | Description |
|---|---|---|---|
points_to_use |
list[str] or omitted | the 30 leg points | Skeleton point names that drive BA. Omit the key to use all keypoints. |
fixed |
list[str] | [] |
Parameters held constant (grammar below); anchors the world gauge. |
shared |
list[list[str]] | [] |
Groups of parameters tied together, e.g. [["lf.tvec[2]", "rf.tvec[2]"]]. |
weigh_by_confidence |
bool | true |
Scale each reprojection residual by sqrt(confidence); zero/non-finite confidences drop the observation (all-zero falls back to uniform). |
max_frames |
int or omitted | 100 |
Bundle-adjust on at most this many frames (subsampled). Omit / null for all. |
frame_sampling |
str | "even" |
Which frames to keep (below). |
| other keys | — | — | Any remaining flat key (max_nfev, loss, f_scale, tr_solver, …) is forwarded to scipy.optimize.least_squares. |
fixed / shared grammar — a reference is <camera>.<param> with optional
indexing, and * wildcards the camera:
"*.intr"— every camera's intrinsics."f.rvec","f.tvec"— the front camera's orientation / position."rm.tvec[2]"— one component (the z distance) of a camera's translation.
frame_sampling strategies:
| Value | Keeps |
|---|---|
"even" |
Evenly spaced over the recording (temporal spread). |
"confidence" |
The highest-confidence frame in each time bin. |
"coverage" |
The frame in each bin with the most points seen by ≥ 2 cameras. |
"diversity" |
Frames whose postures are most spread apart. |
[pictorial_structures] — peak recovery¶
Runs only when do_pictorial_structures = true. Operates on the detector's top-K
candidates (extracted and cached during detection).
| Key | Type | Default | Description |
|---|---|---|---|
k |
int | 5 |
Candidate peaks per joint. |
temporal |
bool | false |
Add a temporal-consistency term. |
lam |
float | 1.0 |
Bone-length prior weight. |
[triangulation] — 2D → 3D¶
How the per-view 2D points become one 3D point.
| Key | Type | Default | Description |
|---|---|---|---|
method |
str | "ransac" |
"ransac" (largest multi-view consensus, robust), "greedy" (drop the worst-reprojecting view), or "dlt" (plain least-squares). |
ransac_threshold |
float | 15.0 |
Inlier reprojection cutoff (px) for method = "ransac". |
min_inliers |
int | 2 |
Minimum agreeing views to accept a point (ransac). |
reproj_threshold |
float | 40.0 |
Per-view reprojection cutoff (px) for method = "greedy". |
max_drops |
int | 5 |
Max views dropped per offending point (greedy). |
weigh_by_confidence |
bool | false |
Scale the DLT by sqrt(confidence) (the mirror of the BA knob, which defaults true). |
[visualization] — output videos¶
Global settings plus one [[visualization.videos]] per output MP4.
Global ([visualization]):
| Key | Type | Default | Description |
|---|---|---|---|
background |
str or [r, g, b] | "black" |
Canvas fill (overridable per video / per panel). |
output_fps |
float | input fps | Explicit output frame rate for every video. |
speed |
float | 1.0 |
Scale the input fps instead (0.5 = slow motion). output_fps wins if both are set. |
[visualization.kwargs] — draw-op defaults shared by every video, keyed by
the plot op name (imshow, skeleton_2d, skeleton_3d). Kwargs merge across
three levels — global → per-video kwargs → per-panel extra keys — most specific
winning.
[[visualization.videos]]:
| Key | Type | Default | Description |
|---|---|---|---|
video_name |
str | required | Output filename (<video_name>.mp4). |
panels |
list[table] | required | Ordered panels (below); they draw in order, so a skeleton panel over an imshow at the same offset overlays it. |
width, height |
int | auto-size | Canvas size in pixels; omit to fit all panels. |
background |
str or [r, g, b] | inherits global | Per-video canvas fill. |
kwargs |
table | {} |
Per-video draw-op kwargs (merges over the global). |
Panel — one draw op for one view at a pixel offset:
| Key | Type | Default | Description |
|---|---|---|---|
plot |
str | required | "imshow" (the view's frame), "skeleton_2d" (its 2D detections), or "skeleton_3d" (the 3D skeleton reprojected into the view). |
view |
str | required | Camera/view name. |
x0, y0 |
int | 0 |
Top-left pixel of the panel. |
scale |
float | 1.0 |
Uniform scale. |
width, height |
int | from scale |
Target box (priority over scale); one given → the other follows to keep aspect. |
background |
str or [r, g, b] | inherits | Per-panel fill. |
| extra keys | — | — | Forwarded as draw-op kwargs (point_radius, line_thickness, palette, …). |
A skeleton_3d panel needs a 3D pose; a video that requires 3D is skipped (with
a logged reason) when the result has none. Videos are encoded H.264 / libx264 via
PyAV on the CPU.