Conventions & glossary¶
The conventions every part of deeperfly shares — array layouts, the missing-data encoding, and the coordinate frames — plus a glossary of the terms used across the docs and the config.
Array layouts¶
Arrays are view-leading: the camera/view axis comes first.
| Array | Shape | Meaning |
|---|---|---|
pts2d |
(V, T, P, 2) |
2D keypoints: per view, per frame, per point, (x, y) in raw-frame pixels. |
conf |
(V, T, P) |
Detector confidence for each 2D observation. |
pts3d |
(T, P, 3) |
3D keypoints: per frame, per point, (x, y, z) in world units. |
reproj_error |
(V, T, P) |
Per-view reprojection error of the 3D point, in pixels. |
The axes are referred to throughout by these letters:
V— camera views (7 in the default rig).T— frames (time).P— skeleton points / keypoints (38 in the default skeleton).
Single-image helpers (e.g. CameraGroup.project) drop the T axis and use
(V, N, 2) / (N, 3), where N is the number of points.
NaN means missing¶
There is no separate visibility mask. A keypoint that a view does not observe is
stored as NaN, and the same convention carries through:
- The detector's
[pose2d.output_points]scatter leaves an unfilled(view, point)asNaN— the union of the per-view tables is the visibility. - Triangulation ignores
NaNviews and returnsNaNfor a point seen by fewer thanmin_inliersviews. - The float64 HDF5 datasets preserve
NaN, so it round-trips throughresults.h5.
When you read pts3d, treat NaN as "not reconstructed for this frame/point".
Use np.nanmedian / np.nanmax and friends, as deeperfly inspect does.
Coordinate frames¶
- Pixels are in the raw source frame that a view's intrinsics describe. Any per-pathway preprocessing (flip, crop, resize) is inverted before the points are stored, so a mirror fed to the detector never moves the stored 2D or the reconstructed 3D.
- World units are whatever the rig's
distance/ intrinsics imply (the default rig is metric-like but unitless). World up is+z. - Cameras use the orbit (look-at) parameterization in the config:
look_at,distance,azimuth_deg,elevation_deg,roll_deg. Internally a camera is the usualrvec(Rodrigues rotation),tvec, intrinsics[fx, fy, cx, cy], and OpenCV-ordered distortion coefficients.
Numerics¶
The geometry core — projection, triangulation, and the bundle-adjustment
residual and Jacobian — is JAX in float64 on the CPU; the arrays are tiny, so
a GPU never helps. The 2D detector is PyTorch and uses the GPU (CUDA or
Metal/MPS) automatically. Detector forward precision is configurable
([pose2d].precision), but everything geometric stays float64.
Confidence¶
conf is the detector's heatmap-peak confidence for each 2D observation.
weigh_by_confidence (in [bundle_adjustment] and [triangulation]) scales each
observation's least-squares contribution by sqrt(confidence), so surer
detections pull harder; non-positive or non-finite confidences drop the
observation. For RANSAC the weighting affects the candidate fits and the final
refit but not the inlier vote, which stays a pure geometric reprojection test.
Glossary¶
Source — a named footage glob ([[sources]]), decoded once. Decoupled from
cameras and pathways, which reference it by name, so one source can feed several
pathways.
Pathway — one source → preprocessor → model inference run
([[pose2d.pathways]]). It says what to detect on; where its outputs land is in
[pose2d.output_points].
Preprocessor — a named, reusable list of frame ops (flip/crop/rotate/resize)
applied to a pathway's frames before the model ([[pose2d.preprocessors]]).
Model — a detector network plus its weights and input contract
([[pose2d.models]]); class = "hourglass" is the DeepFly2D stacked hourglass.
Detection plan — the parsed whole of [[sources]] + the [pose2d]
sub-tables: the mapping of footage through pathways into the skeleton's per-view
2D points.
View / camera — a geometric camera in the rig ([cameras.<name>]): pure
intrinsics + extrinsics. A pathway maps its 2D points back into a view's raw
frame.
Rig / CameraGroup — the set of named cameras as one object.
Skeleton — the tracked points and their structure ([skeleton]):
point_names, the limb_points kinematic chains, and the plotting palette.
Limb — a named chain of points (e.g. a 5-joint leg) used for the bone-length prior and for drawing.
Candidates — the detector's top-k heatmap peaks per joint, cached by
pose2d when pictorial_structures is enabled; the input the peak-recovery stage
reconsiders.
Stage — one step of the linear pipeline (pose2d, bundle_adjustment,
pictorial_structures, triangulation, visualization), toggled by
[pipeline].do_<stage> and configured by its [<stage>] table.
Fingerprint — the result-affecting config subset recorded per stage in
run.json; a stage's cache is reused only while its fingerprint still matches
(see caching).