Dokumentation (english)

Keypoint Detection

Train models to detect and localize body keypoints for human pose estimation

Keypoint detection, also known as pose estimation, identifies the spatial locations of key body joints in images. This task goes beyond simple object detection by predicting specific anatomical landmarks like shoulders, elbows, wrists, hips, knees, and ankles. Keypoint detection is fundamental to applications in sports analytics, fitness tracking, motion capture, healthcare, human-computer interaction, and augmented reality.

Learn About Keypoint Detection

New to keypoint detection? Visit our Keypoint Detection Concepts Guide to learn about pose estimation fundamentals, keypoint formats, skeleton connectivity, and evaluation metrics like OKS and PCK.

Available Models

Vision Transformer-Based Models

State-of-the-art pose estimation using transformer architectures for superior performance.

  • ViTPose - Vision Transformer for human pose estimation with multiple model sizes

Common Configuration

Data Requirements

Training Images: Directory containing images of people in various poses

Keypoint Annotations: JSON file in COCO keypoint format containing:

  • Image information (filename, dimensions)
  • Person bounding boxes
  • Keypoint coordinates (x, y) with visibility flags
  • Keypoint connections defining skeleton structure

COCO Keypoint Format Example:

{
  "images": [
    {"id": 1, "file_name": "person1.jpg", "height": 480, "width": 640}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 50, 200, 400],
      "keypoints": [
        120, 80, 2,   // nose (x, y, visibility)
        110, 70, 2,   // left_eye
        130, 70, 2,   // right_eye
        // ... 17 keypoints total for COCO format
      ],
      "num_keypoints": 17
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "keypoints": ["nose", "left_eye", "right_eye", ...],
      "skeleton": [[16, 14], [14, 12], ...]  // connections
    }
  ]
}

Visibility Flags:

  • 0: Not labeled (person is outside the image)
  • 1: Labeled but not visible (occluded)
  • 2: Labeled and visible

Key Training Parameters

Model Variant: Size of the pose estimation model

  • Small: Fast, lightweight, lower accuracy
  • Base: Balanced performance and speed
  • Large: Higher accuracy, slower inference
  • Huge: Maximum accuracy, highest computational cost

Number of Keypoints: Total keypoints to detect

  • 17 for COCO format (standard human pose)
  • Custom numbers for specialized applications
  • Face keypoints: 68+ points
  • Hand keypoints: 21 points per hand

Batch Size: Number of images processed together

  • 4-8 typical for ViTPose base
  • 2-4 for larger variants
  • 8-16 for small variant
  • Reduce if out-of-memory errors occur

Epochs: Complete passes through training data

  • 1-5 epochs typical for fine-tuning
  • More epochs for training from scratch
  • Pose estimation needs quality data over quantity

Understanding Metrics

OKS (Object Keypoint Similarity): Primary metric for keypoint detection

  • Similar to IoU but for keypoints
  • Accounts for keypoint distance from ground truth
  • Normalized by person scale (larger people = more tolerance)
  • Range: 0 to 1, higher is better
  • OKS > 0.5: Generally acceptable
  • OKS > 0.75: High-quality pose estimation

mAP (mean Average Precision): Average precision across OKS thresholds

  • mAP@0.5: Average Precision at OKS threshold 0.5
  • mAP@0.5:0.95: COCO standard, averaged over thresholds
  • Higher is better, ranges from 0 to 1 (or 0% to 100%)

PCK (Percentage of Correct Keypoints): Simpler evaluation metric

  • Percentage of keypoints within threshold distance
  • PCKh@0.5: Within 50% of head segment length
  • Easier to interpret than OKS

Per-Keypoint Accuracy: Individual keypoint performance

  • Some keypoints harder than others (e.g., hips vs shoulders)
  • Useful for identifying weaknesses
  • Wrists and ankles typically hardest

Choosing the Right Model

By Model Variant

ViTPose-Small

  • Fastest inference and training
  • Good for real-time applications
  • Edge device deployment
  • 60-70% mAP on COCO

ViTPose-Base

  • Balanced performance and speed
  • Most versatile choice
  • Good for general applications
  • 70-75% mAP on COCO

ViTPose-Large

  • High accuracy applications
  • More computational resources
  • Professional use cases
  • 75-78% mAP on COCO

ViTPose-Huge

  • Maximum accuracy
  • Research and production systems
  • Substantial computational requirements
  • 78-80% mAP on COCO

By Use Case

Fitness and Exercise Tracking

  • ViTPose-Base or ViTPose-Small
  • Need real-time feedback
  • Pose accuracy for form correction
  • Mobile or edge deployment

Sports Performance Analysis

  • ViTPose-Large or ViTPose-Huge
  • Maximum accuracy for biomechanics
  • Frame-by-frame analysis acceptable
  • Professional athlete assessment

Motion Capture for Animation

  • ViTPose-Huge for best results
  • Accuracy critical for realistic animation
  • Post-processing acceptable
  • Multi-person tracking often needed

Security and Surveillance

  • ViTPose-Base for balance
  • Need to handle occlusions
  • Multiple people in frame
  • Varying distances and angles

Healthcare and Rehabilitation

  • ViTPose-Large for clinical accuracy
  • Gait analysis and movement assessment
  • Precise measurements required
  • Patient privacy considerations

AR/VR and Gaming

  • ViTPose-Small for real-time
  • Responsiveness critical
  • Can sacrifice some accuracy
  • Mobile device compatibility

Best Practices

Data Preparation

  1. High-Quality Annotations: Accurate keypoint labeling is critical

    • Precise keypoint placement on anatomical landmarks
    • Consistent annotation guidelines across dataset
    • Mark occluded keypoints appropriately (visibility=1)
    • Use experienced annotators for medical/sports applications
  2. Pose Diversity:

    • Various body poses and activities
    • Different camera angles and viewpoints
    • Multiple body types and demographics
    • Include challenging poses (occlusions, unusual positions)
  3. Person Scale Variation:

    • Close-up views (large person in frame)
    • Distant views (small person in frame)
    • Multiple people at different distances
    • OKS is scale-normalized but training benefits from variety
  4. Lighting and Conditions:

    • Indoor and outdoor lighting
    • Various clothing types and colors
    • Different backgrounds (cluttered vs clean)
    • Weather conditions if applicable
  5. Dataset Balance:

    • At least 500 annotated people for fine-tuning
    • 2,000+ for training from scratch
    • Balanced representation of poses
    • Include edge cases and difficult examples

Training Strategy

  1. Start with Pre-trained Weights: Always use COCO pre-trained models

    • Transfer learning dramatically reduces training time
    • Better final accuracy with limited data
    • Converges faster and more reliably
  2. Choose Appropriate Model Size:

    • Start with Base variant for experimentation
    • Scale up to Large/Huge if accuracy insufficient
    • Scale down to Small if speed critical
  3. Monitor Training Metrics:

    • Watch OKS/mAP on validation set
    • Check per-keypoint accuracy for weaknesses
    • Loss should decrease steadily
    • Validate on diverse test set
  4. Adjust for Domain:

    • Sports: May need more epochs for unusual poses
    • Medical: Requires highest quality annotations
    • Fitness: Balance speed and accuracy
    • AR/VR: Prioritize low latency

Common Pitfalls

Inaccurate Keypoints on Specific Joints

  • Check annotation consistency for those joints
  • May need more training examples of that pose type
  • Some joints inherently harder (wrists, ankles)
  • Consider joint-specific confidence thresholds

Poor Performance on Occluded Keypoints

  • Ensure training data includes occlusions
  • Mark occluded keypoints with visibility=1
  • Model learns to predict occluded positions
  • Some occlusion errors inevitable

Struggles with Unusual Poses

  • Collect more diverse pose examples
  • Include athletic, dance, yoga poses
  • Train longer on specialized datasets
  • Consider data augmentation

Multiple People Confusion

  • Ensure person detection accurate
  • Check bounding box quality
  • May need better person detector
  • Consider multi-person specific models

Scale Sensitivity Issues

  • Include various person sizes in training
  • Check if small/large people detected correctly
  • Verify OKS scale normalization working
  • May need to adjust input resolution

GPU Requirements

Memory Guidelines

ViTPose-Small:

  • 4-6GB GPU sufficient
  • Batch size 8-16
  • Fast training and inference

ViTPose-Base:

  • 8-12GB GPU recommended
  • Batch size 4-8
  • Balanced performance

ViTPose-Large:

  • 12-16GB GPU recommended
  • Batch size 2-4
  • High accuracy

ViTPose-Huge:

  • 16GB+ GPU required
  • Batch size 1-2
  • Maximum accuracy

Training Time Estimates

Small Dataset (500 people):

  • ViTPose-Small: 15-30 minutes per epoch
  • ViTPose-Base: 30-60 minutes per epoch
  • ViTPose-Large/Huge: 1-2 hours per epoch

Medium Dataset (2,000 people):

  • ViTPose-Small: 1-2 hours per epoch
  • ViTPose-Base: 2-4 hours per epoch
  • ViTPose-Large/Huge: 4-8 hours per epoch

Large Dataset (10,000+ people):

  • ViTPose-Small: 4-6 hours per epoch
  • ViTPose-Base: 8-12 hours per epoch
  • ViTPose-Large/Huge: 16-24 hours per epoch

Times assume modern GPU (RTX 3080/4080 or better)

Dataset Size Guidelines

Minimum: 500 annotated people with 17 keypoints each Good: 2,000-5,000 annotated people Excellent: 10,000+ annotated people

Quality matters more than quantity - accurate keypoint annotations are crucial for good performance.

Inference Configuration

Confidence Threshold: Minimum confidence for keypoint predictions

  • 0.3 default (liberal, accepts uncertain keypoints)
  • 0.5 for higher precision (fewer false positives)
  • 0.7+ for strict applications (medical, professional sports)
  • Lower threshold: More detections, may include errors
  • Higher threshold: Fewer detections, more reliable

Advanced Considerations

Multi-Person Pose Estimation

  • Bottom-up approach: Detect all keypoints, group by person
  • Top-down approach: Detect people first, then keypoints per person
  • ViTPose uses top-down approach (requires person detector)
  • Consider computational cost of multiple people

Temporal Consistency

  • Video pose tracking needs frame-to-frame smoothing
  • Use tracking algorithms for consistent IDs
  • Temporal models reduce jitter in videos
  • Important for motion capture applications

3D Pose Estimation

  • ViTPose predicts 2D keypoints only
  • 3D reconstruction requires multiple cameras or depth info
  • Lifting 2D to 3D possible with specialized models
  • Consider if depth information needed for application

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items