Dokumentation (english)

Keypoint Detection

Detecting specific points of interest such as joints, landmarks, and structural features in images

Keypoint detection is a specialized computer vision task that identifies and localizes specific points of interest in images. Unlike object detection which uses bounding boxes, keypoint detection pinpoints exact pixel locations of meaningful features—most commonly applied to human pose estimation where it detects body joints to understand posture and movement.

📚 Training Keypoint Detection Models

Looking to train keypoint detection models? Check out our comprehensive Keypoint Detection Training Guide with detailed parameter documentation for all available models and training techniques.

What is Keypoint Detection?

Keypoint detection identifies the precise locations of semantically meaningful points in an image. For each detected keypoint, the model outputs:

  • Coordinates: (x, y) pixel location in the image
  • Visibility flag: Whether the keypoint is visible, occluded, or not present
  • Confidence score: The model's certainty in the detection (0-1)

The primary application is human pose estimation, which detects body joints (shoulders, elbows, wrists, hips, knees, ankles, etc.) to understand human posture, movement, and activities.

Key differences from other tasks:

  • vs. Object Detection: Keypoints are precise points, not bounding boxes; represent structural relationships
  • vs. Image Segmentation: Sparse point locations, not dense pixel-level masks
  • vs. Landmark Detection: More general; includes body pose, not just facial features

Examples:

  • Sports video → detect athlete joints to analyze form and technique
  • Fitness app → track exercise movements and count repetitions
  • Animation → capture human motion for character animation
  • Medical → assess patient mobility and rehabilitation progress

Key Concepts

Keypoints

Specific points of interest with semantic meaning:

Human pose keypoints (most common):

  • COCO format (17 keypoints): Nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
  • Body25 format (25 keypoints): COCO + additional hand and foot points
  • Face keypoints (68-106 points): Facial landmarks for expression and alignment
  • Hand keypoints (21 points per hand): Finger joints for gesture recognition

Other applications:

  • Animal pose: Similar to human, adapted for different species
  • Object keypoints: Corners, handles, specific parts of objects
  • Document keypoints: Corners for perspective correction
  • Anatomical landmarks: Medical imaging feature points

Visibility states:

  • Visible (v=2): Keypoint is clearly visible
  • Occluded (v=1): Present but hidden by another object
  • Not present (v=0): Not in the image (e.g., person turned away)

Skeleton Structure

Keypoints are connected in meaningful ways to form a skeleton:

Purpose:

  • Encodes anatomical relationships (which joints connect)
  • Enables pose reasoning (arm angles, leg positions)
  • Helps with consistency (connected keypoints should be near each other)

Example (COCO skeleton):

Connections:
- Nose → Left Eye, Right Eye
- Left Shoulder → Left Elbow → Left Wrist
- Right Shoulder → Right Elbow → Right Wrist
- Left Hip → Left Knee → Left Ankle
- Right Hip → Right Knee → Right Ankle

Use in models:

  • Structural losses enforce skeleton constraints
  • Graph neural networks leverage connectivity
  • Post-processing validates anatomical plausibility

Detection Paradigms

Two fundamentally different approaches to multi-person keypoint detection:

Top-Down approach:

  1. First, detect all people in the image (using object detector)
  2. Then, detect keypoints for each person separately
  3. Each person gets independent keypoint predictions

Advantages:

  • Higher accuracy per person
  • Well-isolated keypoint predictions
  • Easier to train and understand
  • Benefits from advances in object detection

Disadvantages:

  • Slower: runtime scales with number of people
  • Requires good person detector
  • Redundant computation for crowded scenes

Bottom-Up approach:

  1. Detect all keypoints in the image simultaneously
  2. Then, group keypoints that belong to the same person
  3. Single forward pass regardless of number of people

Advantages:

  • Faster: constant time regardless of crowd size
  • Efficient for crowded scenes
  • No dependency on person detector

Disadvantages:

  • More complex grouping problem
  • Can struggle with overlapping people
  • Generally lower accuracy than top-down

Heatmaps vs. Direct Regression

Two main methods for predicting keypoint locations:

Heatmap-based detection (most common):

  • Output: Spatial heatmap for each keypoint type (e.g., 17 heatmaps for COCO)
  • Heatmap values: Probability/confidence at each pixel location
  • Peak detection: Maximum value indicates keypoint location
  • Gaussian encoding: Ground-truth keypoints → Gaussian blobs

Formula for ground-truth heatmap at location (x,y)(x, y):

Hx,y=exp((xxk)2+(yyk)22σ2)H_{x,y} = \exp\left(-\frac{(x - x_k)^2 + (y - y_k)^2}{2\sigma^2}\right)

where (xk,yk)(x_k, y_k) is the keypoint location and σ\sigma controls the spread.

Advantages:

  • Spatially aware: convolutional structure natural for this representation
  • Implicit uncertainty: heatmap spread indicates confidence
  • Sub-pixel accuracy: interpolate around peak
  • Handles ambiguity: multiple peaks for uncertain cases

Direct regression:

  • Output: Direct (x, y) coordinates
  • Loss: L1 or L2 distance to ground truth
  • Simpler: Fewer parameters, easier to understand

Advantages:

  • Compact output
  • No heatmap resolution limitations
  • Faster inference (no peak detection)

Disadvantages:

  • Harder to train: regression more difficult than classification
  • No spatial structure
  • Single point prediction (no uncertainty)

Part Affinity Fields (PAFs)

Used in bottom-up methods for grouping keypoints:

Concept: For each limb (connection between keypoints), predict a 2D vector field that points along the limb direction at each pixel.

Example: For the connection between left shoulder and left elbow:

  • Vector field points from shoulder toward elbow
  • Strong vectors exist along the actual limb
  • Weak or zero vectors elsewhere

Grouping process:

  1. Detect all keypoints (regardless of which person)
  2. For each pair of keypoint candidates, integrate PAF along the line between them
  3. Strong line integral → likely same person
  4. Weak integral → different people

Advantages:

  • Handles overlapping people better than simple distance
  • Encodes both location and orientation
  • Robust to partial occlusion

Approaches and Architectures

Top-Down Methods

Detect people first, then estimate pose for each person:

OpenPose (top-down variant):

  • Two-stage: Person detection + pose estimation
  • CMU's influential early work
  • VGG or ResNet backbone
  • Multi-stage refinement for keypoints

ViTPose (Vision Transformer for Pose):

  • Transformer-based backbone (ViT)
  • Simple deconvolution decoder
  • Excellent performance with large pretrained models
  • Scales well with model size
  • State-of-the-art on multiple benchmarks

HRNet (High-Resolution Net):

  • Maintains high-resolution representations throughout
  • Multi-scale parallel branches with repeated fusion
  • Excellent for precise localization
  • Strong baseline for top-down pose estimation
  • Widely used in practice

SimpleBaseline:

  • ResNet backbone
  • Simple deconvolution layers for upsampling
  • Surprisingly effective despite simplicity
  • Good starting point for custom applications

Key components:

  • Backbone: Extracts features (ResNet, HRNet, ViT)
  • Neck: Upsamples and refines features
  • Head: Predicts keypoint heatmaps
  • Person detector: Often Faster R-CNN or YOLO

Bottom-Up Methods

Detect all keypoints first, then group by person:

OpenPose (original):

  • Pioneering bottom-up approach
  • Part Affinity Fields (PAFs) for grouping
  • Multi-stage architecture with intermediate supervision
  • Real-time performance
  • Handles variable number of people efficiently

PersonLab:

  • Hough voting for keypoint detection
  • Short-range and mid-range offsets for grouping
  • ResNet backbone
  • Single-stage detection and grouping

HigherHRNet:

  • Based on HRNet
  • Multi-resolution supervision
  • Associative embedding for grouping
  • Scale-aware keypoint detection
  • Strong performance on crowded scenes

Key components:

  • Keypoint detection: Heatmaps for all keypoints in image
  • Grouping mechanism: PAFs, embeddings, or offset fields
  • Association: Algorithm to assign keypoints to people

3D Pose Estimation

Extending 2D keypoints to 3D space:

Approaches:

  • 2D → 3D lifting: First detect 2D keypoints, then lift to 3D
  • Direct 3D prediction: Directly predict 3D coordinates from image
  • Multi-view fusion: Use multiple camera views for triangulation

Output formats:

  • Camera coordinates: (X, Y, Z) relative to camera
  • Root-relative: Normalized around pelvis or similar root joint
  • Absolute positions: World coordinates (requires calibration)

Challenges:

  • Depth ambiguity: 2D image doesn't directly provide depth
  • Scale ambiguity: Distance to person affects apparent size
  • Occlusion: More severe in 3D reasoning
  • Limited training data: 3D ground truth harder to obtain

Applications:

  • Sports analysis and biomechanics
  • VR/AR avatar animation
  • Medical gait analysis
  • 3D human reconstruction

Transformer-Based Approaches

Modern attention mechanisms for pose estimation:

ViTPose family:

  • Vision Transformer (ViT) backbone
  • Leverages large-scale pretrained models
  • Simple architecture, strong performance
  • Scales with model capacity

PRTR (Pose Recognition Transformer):

  • End-to-end transformer for pose
  • Keypoint queries with cross-attention
  • No heatmap intermediate representation
  • Direct coordinate prediction

TokenPose:

  • Token-based representation of keypoints
  • Transformer refines keypoint tokens
  • Combines heatmap and token approaches

Advantages of transformers:

  • Global context understanding
  • Long-range dependencies (useful for full-body reasoning)
  • Benefits from large-scale pretraining
  • Strong performance on benchmarks

Disadvantages:

  • Higher computational cost
  • More data hungry
  • Less interpretable than heatmap-based methods

Evaluation Metrics

Object Keypoint Similarity (OKS)

The primary metric for keypoint detection, analogous to IoU for object detection:

Formula:

OKS=iexp(di22s2ki2)δ(vi>0)iδ(vi>0)\text{OKS} = \frac{\sum_i \exp\left(-\frac{d_i^2}{2s^2k_i^2}\right) \cdot \delta(v_i > 0)}{\sum_i \delta(v_i > 0)}

where:

  • did_i = Euclidean distance between predicted and ground-truth keypoint ii
  • ss = square root of object area (person bounding box area)
  • kik_i = per-keypoint constant that controls falloff (larger for harder keypoints)
  • viv_i = visibility flag for keypoint ii
  • δ()\delta(\cdot) = indicator function

Interpretation:

  • OKS = 1.0: Perfect prediction (all keypoints exactly correct)
  • OKS = 0.0: No correct keypoints
  • Scale normalization: Divided by person scale ss, so accuracy is relative to person size
  • Visibility aware: Only evaluates visible keypoints
  • Keypoint weights: kik_i values account for keypoint difficulty (e.g., eyes easier than elbows)

COCO keypoint constants (kik_i):

  • Facial keypoints (eyes, ears): smaller kk (easier, require tighter accuracy)
  • Torso keypoints (shoulders, hips): medium kk
  • Limb endpoints (wrists, ankles): larger kk (harder, more tolerance)

Average Precision (AP) at OKS Thresholds

Similar to object detection mAP, but using OKS instead of IoU:

Common variants:

  • AP^50 (AP@OKS=0.5): Lenient threshold, easier to achieve
  • AP^75 (AP@OKS=0.75): Stricter threshold, requires accurate keypoints
  • AP (average over 0.5:0.05:0.95): Primary COCO metric, most comprehensive

Calculation:

  1. For each person, compute OKS between prediction and ground truth
  2. Match predictions to ground truth using OKS threshold
  3. Compute precision-recall curve
  4. Calculate area under curve (AP)
  5. Report at different OKS thresholds

Additional COCO metrics:

  • AP^M (medium persons): 32² < area < 96² pixels
  • AP^L (large persons): area > 96² pixels
  • AR (Average Recall): Maximum recall given detections per image

Percentage of Correct Keypoints (PCK)

Simpler metric: percentage of keypoints within a threshold distance:

Variants:

  • PCK@0.2: Distance < 20% of torso diameter
  • PCKh@0.5: Distance < 50% of head size (head-normalized)
  • PCK@α: Distance < α × scale factor

Formula:

PCK@α=1Ni=1Nδ(diαs)\text{PCK@}\alpha = \frac{1}{N} \sum_{i=1}^{N} \delta(d_i \leq \alpha \cdot s)

where:

  • NN = total number of keypoints
  • did_i = distance between predicted and ground-truth keypoint ii
  • ss = scale factor (torso diameter, head size, or bounding box size)

Interpretation:

  • Higher is better (1.0 = 100% of keypoints correct)
  • Threshold-based: Binary correct/incorrect decision
  • Simpler than OKS: Single value, easier to understand
  • Less common now: OKS-based metrics preferred for standardized evaluation

3D Pose Metrics

For 3D keypoint detection:

MPJPE (Mean Per-Joint Position Error):

MPJPE=1Ni=1Npip^i2\text{MPJPE} = \frac{1}{N} \sum_{i=1}^{N} \|p_i - \hat{p}_i\|_2
  • Average Euclidean distance between predicted and ground-truth 3D keypoints
  • Measured in millimeters or centimeters
  • Lower is better

PA-MPJPE (Procrustes-Aligned MPJPE):

  • First align predicted pose to ground truth using Procrustes analysis
  • Removes global rotation and translation differences
  • Focuses on pose structure, not absolute position
  • More robust metric for many applications

PCK3D:

  • 3D variant of PCK
  • Percentage of keypoints within threshold distance in 3D space

Data Requirements

Annotation Format

Keypoint annotations typically follow COCO format:

{
  "annotations": [
    {
      "image_id": 1,
      "category_id": 1,
      "keypoints": [x1, y1, v1, x2, y2, v2, ..., x17, y17, v17],
      "bbox": [x, y, width, height],
      "area": width * height,
      "id": 1,
      "num_keypoints": 15
    }
  ]
}

Keypoint format:

  • Flat array: [x₁, y₁, v₁, x₂, y₂, v₂, ...]
  • v (visibility):
    • 0 = not labeled (not visible, not annotated)
    • 1 = labeled but not visible (occluded)
    • 2 = labeled and visible

Dataset Size

Keypoint detection is challenging and data-hungry:

Recommended sizes:

  • Fine-tuning pretrained model: 500-2000 annotated person instances
  • Training from scratch: 10,000+ person instances
  • High accuracy: 50,000+ instances (like COCO Keypoints)
  • Novel keypoint types: More data needed for new skeleton structures

Person instance = one person in one image with keypoint annotations.

Data Diversity

Critical for robust keypoint detection:

Pose diversity:

  • Standing, sitting, lying, crouching
  • Various arm and leg positions
  • Natural and extreme poses
  • Action-specific poses (sports, dance, yoga)

Viewpoint diversity:

  • Front, side, back, diagonal views
  • Various camera angles (low, high, eye-level)
  • Close-up and distant views
  • Multiple people with different orientations

Occlusion scenarios:

  • Self-occlusion (arm behind body)
  • Object occlusion (person behind furniture)
  • Person-person occlusion (crowds)
  • Partial visibility (only upper body in frame)

Environmental diversity:

  • Indoor and outdoor scenes
  • Various lighting conditions
  • Different backgrounds and clutter levels
  • Multiple clothing styles and body types

Annotation Quality

High-quality keypoint annotations are crucial:

Accuracy requirements:

  • Pixel-level precision: Keypoints should be exact joint locations
  • Consistency: Same rules across all annotations
  • Visibility flags: Correctly mark visible, occluded, and not present
  • Complete skeletons: Annotate all visible keypoints, not just easy ones

Common issues:

  • Inconsistent joint definitions (e.g., shoulder location)
  • Missing occluded keypoints
  • Incorrect visibility flags
  • Annotation drift over time (multiple annotators)

Best practices:

  • Clear annotation guidelines with examples
  • Multiple annotator review for quality
  • Validation checks for anatomical plausibility
  • Regular calibration sessions for annotators

Common Challenges

Occlusion and Self-Occlusion

Keypoints hidden by objects or by the person's own body:

Problems:

  • Partial visibility of joints
  • Ambiguous keypoint locations
  • Missing visual information
  • Difficult to distinguish between "not present" and "occluded"

Solutions:

  • Train on heavily occluded examples
  • Use context from visible keypoints (skeleton constraints)
  • Multi-stage refinement to infer occluded keypoints
  • Attention mechanisms to focus on visible parts
  • Temporal consistency in video (use previous frames)

Evaluation considerations:

  • OKS only evaluates visible keypoints
  • Report per-keypoint accuracy to identify weak points
  • Separate metrics for different occlusion levels

Overlapping People

Multiple people in close proximity:

Problems:

  • Keypoint assignment ambiguity
  • Similar appearances confuse the model
  • Difficult to separate overlapping limbs
  • Crowded scenes with many people

Solutions (Top-down):

  • Accurate person detection to separate individuals
  • Large enough person boxes to include full pose
  • Process each person independently

Solutions (Bottom-up):

  • Strong grouping mechanisms (PAFs, embeddings)
  • Context from skeleton structure
  • Handle overlapping predictions explicitly
  • Multi-person parsing networks

Trade-offs:

  • Top-down: Better accuracy but slower in crowds
  • Bottom-up: Faster but more grouping errors

Small Persons

People occupying few pixels in the image:

Problems:

  • Low resolution keypoint locations
  • Fewer visual features per keypoint
  • Heatmap peaks less distinct
  • Higher relative error impact

Solutions:

  • Multi-scale feature extraction
  • Higher input resolution
  • Specialized small-person detectors (for top-down)
  • Attention to high-resolution features
  • Training with small-person oversampling

Evaluation:

  • COCO provides separate metrics for medium and large persons
  • Small persons (area < 32²) often excluded from standard evaluation
  • Report performance by person scale

Extreme Poses

Unusual or difficult body configurations:

Problems:

  • Rare in training data (long-tail distribution)
  • Unusual limb angles and relationships
  • Self-occlusion more common
  • May violate typical pose priors

Examples:

  • Gymnastics, yoga, martial arts
  • Crouching or crawling
  • Reaching or stretching
  • Dancing or athletic movements

Solutions:

  • Augment data with extreme pose examples
  • Relaxed skeleton constraints to allow flexibility
  • Domain-specific training for applications (e.g., sports)
  • Synthetic data generation with varied poses
  • Test-time augmentation for robustness

Computational Efficiency for Real-Time

Speed requirements for interactive applications:

Real-time targets:

  • Video analysis: 30 FPS
  • Live fitness apps: 30+ FPS
  • AR applications: 60+ FPS
  • Offline analysis: Speed less critical

Optimization strategies:

  • Model selection: Lightweight architectures (MobileNet backbone)
  • Input resolution: Lower resolution with accuracy trade-off
  • Top-down vs. bottom-up: Bottom-up faster for crowds
  • Model compression: Quantization, pruning, distillation
  • Hardware acceleration: TensorRT, ONNX Runtime, CoreML
  • Pipeline optimization: Efficient preprocessing and postprocessing

Trade-offs:

  • Smaller models: Faster but lower accuracy
  • Lower resolution: Faster but less precise keypoints
  • Single-scale inference: Faster but struggles with scale variation

2D vs. 3D Ambiguity

Depth information loss from 3D world to 2D image:

Problems:

  • Same 2D pose can correspond to different 3D poses
  • Forward-facing vs. backward-facing ambiguity
  • Limb orientation ambiguity (arm extended forward vs. sideways)
  • Scale and distance confusion

Approaches:

  • Accept 2D limitation for applications where sufficient
  • Use multi-view systems for 3D reconstruction
  • Learn 3D priors from data (2D→3D lifting networks)
  • Temporal consistency in video for depth reasoning
  • Additional sensors (depth cameras, IMUs)

Applications requiring 3D:

  • Biomechanics and sports science
  • Medical rehabilitation assessment
  • 3D animation and motion capture
  • VR/AR avatar control

Practical Applications

Sports Analytics and Performance Tracking

Analyze athlete movements and technique:

Use cases:

  • Form analysis: Compare athlete pose to ideal technique
  • Injury prevention: Detect risky movement patterns
  • Performance metrics: Measure joint angles, stride length, etc.
  • Tactical analysis: Track player positions and movements
  • Automated highlight generation: Detect key moments (goals, serves)

Requirements:

  • High accuracy for precise measurements
  • Real-time or near-real-time for live feedback
  • Robust to fast motion and occlusion
  • Multi-person tracking for team sports

Examples: Tennis serve analysis, running gait assessment, basketball shot form, gymnastics scoring.

Fitness and Exercise Applications

Interactive fitness experiences:

Use cases:

  • Exercise counting: Automatic rep counting for squats, push-ups, etc.
  • Form correction: Real-time feedback on posture and alignment
  • Virtual trainers: AI-guided workout sessions
  • Progress tracking: Monitor range of motion and improvements
  • Gamification: Motion-controlled fitness games

Requirements:

  • Real-time inference (30+ FPS)
  • Works with consumer cameras (webcams, smartphones)
  • Robust to varied body types and clothing
  • Single-person focus (user in front of camera)

Examples: Mirror fitness, Peloton Guide, AI yoga apps, VR fitness games.

Animation and Motion Capture

Capture human motion for digital content:

Use cases:

  • Character animation: Drive 3D characters with human motion
  • Video game development: Motion capture without expensive suits
  • Film and VFX: Reference motion for animators
  • Virtual avatars: Real-time avatar control for VR/AR
  • Dance and performance capture: Preserve choreography digitally

Requirements:

  • High temporal consistency across frames
  • 3D pose estimation often needed
  • Accurate fingertip and facial tracking for detailed capture
  • Multi-person support for group performances

Transition from traditional mocap: Markerless pose estimation reduces setup cost and allows outdoor capture.

Medical Rehabilitation and Physical Therapy

Objective movement assessment for patients:

Use cases:

  • Gait analysis: Evaluate walking patterns after injury or surgery
  • Range of motion measurement: Track joint flexibility over time
  • Exercise compliance: Ensure patients perform exercises correctly
  • Fall risk assessment: Analyze balance and stability
  • Remote monitoring: Telehealth physical therapy sessions
  • Progress documentation: Objective metrics for treatment effectiveness

Requirements:

  • High accuracy for medical-grade measurements
  • Privacy-preserving (on-device processing preferred)
  • Intuitive interface for patients and therapists
  • Integration with medical records systems

Benefits: Objective data vs. subjective assessment, scalable remote care, consistent measurements.

Sign Language Recognition

Understand sign language communication:

Use cases:

  • Automatic translation: Sign language to text or speech
  • Learning tools: Feedback for sign language learners
  • Accessibility: Enable sign language interfaces for technology
  • Video captioning: Add captions to sign language videos

Requirements:

  • Hand keypoint detection (21 points per hand) in addition to body
  • Facial expression tracking (important for grammar)
  • High temporal resolution for rapid hand movements
  • Real-time performance for interactive communication

Challenges: Sign language syntax differs from spoken language, requires context understanding beyond just pose.

Human-Computer Interaction

Gesture-based interfaces and controls:

Use cases:

  • Touchless interfaces: Control devices without physical contact
  • Gaming: Motion-controlled gameplay
  • Smart home control: Gesture commands for IoT devices
  • Presentation tools: Control slides with hand gestures
  • Accessibility: Alternative input methods for users with disabilities

Requirements:

  • Low-latency detection (< 100ms for responsive interaction)
  • Clear gesture vocabulary (discrete, distinguishable poses)
  • Robust to varied users and environments
  • Works at typical interaction distances

Examples: Virtual reality controllers, smart TV gesture control, touchless bathroom fixtures.

Security and Surveillance

Behavioral analysis and anomaly detection:

Use cases:

  • Fall detection: Elderly care and safety monitoring
  • Suspicious behavior: Unusual poses or movements
  • Crowd monitoring: Density and flow analysis
  • Perimeter security: Detect climbing or jumping
  • Retail analytics: Customer behavior and attention

Requirements:

  • Multi-person detection in crowded scenes
  • Privacy considerations (avoid identifying individuals)
  • Robust to varied viewpoints (e.g., overhead cameras)
  • Real-time alerting for security events

Privacy approach: Pose-based analysis can provide behavioral insights without storing identifiable imagery.

Choosing an Approach

Consider these factors when selecting a keypoint detection method:

Top-Down vs. Bottom-Up

Choose top-down if:

  • Accuracy is paramount
  • Number of people per image is typically small (1-5)
  • You have a good person detector
  • Per-person analysis is needed (identify individuals)
  • Inference time can scale with number of people

Recommended: HRNet, ViTPose

Choose bottom-up if:

  • Speed is critical, especially for crowded scenes
  • Many people per image (crowds, sports teams)
  • Constant-time inference desired regardless of people count
  • Good grouping mechanism available
  • Slight accuracy trade-off acceptable

Recommended: OpenPose, HigherHRNet

2D vs. 3D

Choose 2D if:

  • Application doesn't require depth information
  • Faster inference and simpler models desired
  • More training data available (2D annotations easier)
  • Photo/video analysis sufficient

Use cases: Exercise counting, action recognition, photo effects

Choose 3D if:

  • Depth and spatial relationships essential
  • Biomechanics and precise measurements needed
  • VR/AR avatar control
  • Willing to invest in more complex pipeline

Use cases: Sports science, medical analysis, motion capture

Model Size and Speed

For real-time interactive applications (fitness apps, AR):

  • Lightweight models with efficient backbones
  • Lower input resolution (256×256 or 384×384)
  • Top-down for single person, bottom-up for groups
  • Hardware acceleration (TensorRT, CoreML)

For offline analysis (medical, research):

  • Large models for highest accuracy (ViTPose-H, HRNet-W48)
  • High input resolution (512×512 or higher)
  • Multi-scale testing and ensemble methods
  • No speed constraints

For edge devices (mobile, embedded):

  • Quantized lightweight models
  • MobileNet or EfficientNet backbones
  • Optimize entire pipeline including preprocessing
  • Profile on target hardware extensively

Domain-Specific Considerations

General human pose (photos, videos):

  • Pretrained COCO keypoint models
  • Fine-tune on domain if available
  • Standard 17-keypoint skeleton

Sports-specific:

  • Fine-tune on sport-specific poses
  • May need custom keypoint definitions
  • Temporal models for motion analysis

Hand pose:

  • Specialized hand models (21 keypoints per hand)
  • Higher resolution for finger details
  • Often combined with hand detection

Animal pose:

  • Transfer learning from human pose
  • Adapt skeleton structure for animal anatomy
  • Limited training data availability

Next Steps

Ready to train your own keypoint detection models? Our Keypoint Detection Training Guide provides comprehensive documentation on:

  • Available architectures (HRNet, ViTPose, OpenPose)
  • Top-down vs. bottom-up training strategies
  • Data preparation and annotation formats
  • Hyperparameter configuration and optimization
  • Fine-tuning on custom keypoint definitions
  • Evaluation and performance analysis

For understanding related computer vision tasks, see:


Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items