Skip to content

Writing configs

A run is driven by a single self-contained config.toml. deeperfly init config.toml writes a fully commented copy to edit in place; deeperfly run recording/ with no -c falls back to the packaged defaults. A single file carries everything a run needs — the camera rig, which file belongs to which camera, the detector, the pipeline and the visualization.

This guide walks through customizing a config, ordered roughly by how often you'll touch each section: the first few you'll set for almost every recording, the last few you can usually leave at their defaults. For an exhaustive parameter-by-parameter listing (every key, its type and default), see the configuration reference.

The detection plan

2D detection is described by the top-level [[sources]] footage list plus the detector's own machinery under [pose2d][[pose2d.preprocessors]], [[pose2d.models]], [[pose2d.pathways]] — and a [pose2d.output_points] mapping table. A neural network turns a preprocessed image into output channels; the plan says which footage feeds which model (the pathways) and where each output channel lands in the skeleton ([pose2d.output_points]). The default fly rig is 7 sources → 8 pathways → 7 views (the front camera is read twice, once mirrored).

Sources name the footage, the one setting almost every recording needs. Each filename is a glob matched inside the recording directory:

[[sources]]
name     = "vid_rh"
filename = "camera_0.mp4"   # a named file, used as-is
[[sources]]
name     = "vid_rm"
filename = "camera_1"       # a bare prefix -> "camera_1*": a video or an image sequence

A source's footage is one video file or a naturally-sorted image sequence (camera_1_000123.jpg ...). A directory is a valid recording only when every source matches footage with the same file and frame count. A source with no filename defaults to its own name.

Preprocessors are named, reusable frame-op pipelines that a pathway references by name (full op grammar in the Preprocessor op grammar section below):

[[pose2d.preprocessors]]
name = "plain"
ops  = []
[[pose2d.preprocessors]]
name = "mirror"
ops  = [{ op = "fliplr" }]

Models select a detector network: class is the registry key ("hourglass" = DeepFly2D), weights a checkpoint (""/omitted uses the cached download), input_size the (height, width) it expects, mean the scalar subtracted after /255, and n_out_channels the output heatmap count.

[[pose2d.models]]
name = "deepfly2d"
class = "hourglass"
weights = ""
input_size = [256, 512]
mean = 0.22
n_out_channels = 19

Pathways are named source -> preprocessor -> model inference runs. A pathway only says what to detect on; each needs a unique name:

[[pose2d.pathways]]
name = "rh"; source = "vid_rh"; model = "deepfly2d"            # no preprocessor = identity
[[pose2d.pathways]]                                            # the front source, mirrored pass
name = "f_flip";  source = "vid_f";  preprocessor = "flip"; model = "deepfly2d"

[pose2d.output_points.<view>] says where the outputs land: for each view, a table keyed by point name where point = { pathway, out_channel } fills that point from output channel out_channel of the named pathway. Keying on (view, point) makes every point's data come from exactly one place (a duplicate is a config error); a (view, point) no entry names is left unobserved (NaN) — that union is the visibility, with no separate table.

[pose2d.output_points.rh]                 # right-side view: 19 channels of one pathway
rf_thorax_coxa = { pathway = "rh", out_channel = 0 }
# ... through ...
r_abdomen2     = { pathway = "rh", out_channel = 18 }

[pose2d.output_points.f]                  # one view fed by two pathways, disjoint points
rf_femur_tibia = { pathway = "f",      out_channel = 2 }   # right, un-flipped
lf_femur_tibia = { pathway = "f_flip", out_channel = 2 }   # left, mirrored

This modularity supports a range of setups: a single front model predicting both legs, per-view or per-side specialized models, or a different model per pathway.

Choose which stages run — [pipeline]

The pipeline is a linear sequence of stages, each an on/off do_<stage> switch:

[pipeline]
do_pose2d               = true   # detect 2D pose in every camera view
do_bundle_adjustment    = true   # refine the cameras (bundle adjustment)
do_pictorial_structures = false  # DeepFly3D-style peak recovery (opt-in)
do_triangulation        = true   # triangulate 2D -> 3D
do_visualization        = true   # render the videos

Each enabled stage has its own top-level [<stage>] parameter table (below). Pictorial structures is the opt-in stage most commonly flipped on.

Editing the config and re-running recomputes exactly the stages you changed (and the ones after them); the slow pose2d cache is reused untouched. That resume/recompute behavior — and --overwrite — is covered in the CLI guide.

Tune the opt-in stage — pictorial structures

This runs only when its do_pictorial_structures switch is on.

[pictorial_structures]   # peak recovery before triangulation
k        = 5       # candidate peaks per joint
temporal = false   # add a temporal-consistency term
lam      = 1.0     # bone-length prior weight

Candidate peaks are extracted during detection and cached in results.h5 when this stage is enabled. Enabling it on an existing output directory therefore re-runs pose2d once (announced loudly); after that, tweaking temporal / lam re-runs only the recovery from the cached candidates. Resuming with do_pose2d = false from a 2D result that stored no candidates skips the stage with a notice.

Output videos — [visualization]

Each [[visualization.videos]] is one output MP4, composited from an ordered list of panels; each panel draws one op (imshow, skeleton_2d, skeleton_3d) for one camera view at a pixel offset. Common edits:

[visualization]
background  = "black"
# output_fps = 30    # explicit output fps for every video
# speed      = 0.5   # or scale the input fps instead (0.5 = slow motion)

[visualization.kwargs]   # draw-op defaults shared by every video
imshow      = { width = 480, height = 240 }
skeleton_2d = { line_thickness = 2, width = 480, height = 240 }
skeleton_3d = { line_thickness = 2, width = 480, height = 240 }

The generated config ships two montage videos (pose2d, pose3d) wired to the 7-camera rig; reorder, drop, or add panels to change the layout. Draw-op kwargs merge across three levels (global → per-video → per-panel), most specific winning. Video frames are read and written with PyAV. See the configuration reference for the full panel and kwargs schema.

Triangulation — [triangulation]

How the per-view 2D points become one 3D point:

[triangulation]
method              = "ransac"   # ransac (default, robust) | greedy | dlt
ransac_threshold    = 15.0       # inlier reprojection cutoff (px), method = ransac
min_inliers         = 2          # min agreeing views to accept a point (ransac)
# reproj_threshold  = 40.0       # method = greedy: per-view reprojection cutoff (px)
# max_drops         = 5          # method = greedy: max views dropped per point
weigh_by_confidence = false      # weight the DLT by detector confidence

ransac keeps the largest multi-view consensus; greedy drops the worst-reprojecting view; dlt is plain least-squares with no outlier handling (the pipeline explainer compares them).

weigh_by_confidence scales each view's contribution to the DLT by sqrt(confidence), so surer detections pull the 3D point harder (non-positive or non-finite confidences drop the view). For ransac it weights the candidate fits and the final refit but not the inlier vote, which stays a geometric reprojection test so a confidently-wrong detection cannot vote itself into the consensus.

Detector precision and memory — [pose2d]

[pose2d]
precision     = "bfloat16"  # the default: as fast as float16 under CUDA autocast
                            # (~1.5-2x over float32) with a wider range (no overflow).
                            # "float32" is the reference; ignored on CPU/MPS
batch_size    = 16          # GPU forward batch (images/forward); throughput plateaus
                            # by ~16 on a fast GPU
decode_buffer = 4           # decode queue depth, in multiples of batch_size

These are the [pose2d] table's performance knobs; what to detect (sources, models, pathways — including per-model weights) is the detection plan, which shares the same [pose2d] table (and the top-level [[sources]]) and is documented above. batch_size is the GPU forward batch; decode_buffer is a memory knob (peak frames per camera is ~(decode_buffer + 2) * batch_size) — raise it to keep the GPU fed when decode is jittery, lower it to shave memory. These knobs never invalidate a cache.

Frame I/O — [io]

Video files are read and written with PyAV (in-process FFmpeg, on the CPU); image sequences are decoded with OpenCV. The only knob is the image-decode thread count:

[io.image]
workers = 0   # decode threads (0 = one per CPU)

The reader/writer API is in the library guide.

Preprocessor op grammar — [[pose2d.preprocessors]] ops

A preprocessor is an ordered list of frame ops applied to a pathway's frames before the model — to feed the detector a mirrored/cropped/rotated view. Steps run in the order written (flips and rotations do not commute, so the order is yours):

[[pose2d.preprocessors]]
name = "corrected"
ops  = [
    { op = "rot90", k = 1 },                                  # k CCW quarter-turns (any sign)
    { op = "fliplr" },                                         # left-right flip; also: flipud
    { op = "crop", x = 10, y = 10, width = 80, height = 80 },  # keep a window
    { op = "resize", scale = 0.5 },                            # or width/height; optional
]                                                              # interpolation = "bilinear"|"nearest"

A pathway's detections are mapped back into its view frame by inverting these ops (plus the model's resize to its input_size), so the points always land in the raw source frame the view's intrinsics describe. The flip is therefore a detector-input concern only — it never reflects the reconstructed 3D skeleton.

Bundle adjustment — [bundle_adjustment]

Bundle adjustment uses the fly itself as the target, solved with scipy.optimize.least_squares — its kwargs (max_nfev, loss, ...) sit directly in the table. The defaults suit the standard rig; you rarely need to change them.

[bundle_adjustment]
points_to_use       = [ "..." ]   # skeleton point names that drive bundle adjustment (default: the 30 leg points)
fixed               = ["*.intr", "f.rvec", "f.tvec", "rm.tvec[2]"]   # held constant; fixes the world gauge
shared              = []          # e.g. [["lf.tvec[2]", "rf.tvec[2]"]] to tie cameras' z distances
weigh_by_confidence = false       # scale each reprojection residual by sqrt(confidence)
max_frames          = 200         # bundle-adjust on at most this many frames (subsampled)
frame_sampling      = "even"      # even | confidence | coverage | diversity
max_nfev            = 2000         # forwarded to scipy.optimize.least_squares
loss                = "linear"

The fixed / shared grammar ("*.intr", "f.rvec", "rm.tvec[2]", tying [["lf.tvec[2]", "rf.tvec[2]"]]) anchors the world gauge and ties parameters between cameras; the reference gives the full grammar and the frame_sampling strategies. See the library guide for calling the bundle adjuster directly.

Camera rig geometry — [cameras.defaults] and [cameras.*]

A [cameras.<name>] is a geometric view that a pathway maps its points back into — pure geometry (intrinsics + orbit extrinsics), no footage or preprocessing. The cameras orbit an object near the world origin; [cameras.defaults] is merged into every view, and each [cameras.<name>] overrides it (the default rig sets just azimuth_deg per view). A view's intrinsics describe the raw frame of the source feeding it. The shipped values describe the standard DeepFly3D 7-camera rig — leave them unless your rig differs.

[cameras.defaults]
focal_length_px = [22388.125, 22388.125]
distance        = 107.463
elevation_deg   = 0.0
# principal_point_px = [479.5, 239.5]   # omit to use each view's image center

[cameras.f]
azimuth_deg = 0

The orbit parameters (look_at, distance, azimuth_deg, elevation_deg, roll_deg) and intrinsics are detailed in the reference.

Skeleton — [skeleton]

The tracked points and their structure (38-point, 7-camera Drosophila rig): point_names, limb_points kinematic chains (each a list of point names), and the plotting limb_palette. Which view sees which point is not set here — it is the union of the [pose2d.output_points] tables. Edit this only to track a different animal — see the reference and the pipeline explainer.