Keypoint Detection

Keypoint detection, also known as pose estimation, identifies the spatial locations of key body joints in images. This task goes beyond simple object detection by predicting specific anatomical landmarks like shoulders, elbows, wrists, hips, knees, and ankles. Keypoint detection is fundamental to applications in sports analytics, fitness tracking, motion capture, healthcare, human-computer interaction, and augmented reality.

Learn About Keypoint Detection

New to keypoint detection? Visit our Keypoint Detection Concepts Guide to learn about pose estimation fundamentals, keypoint formats, skeleton connectivity, and evaluation metrics like OKS and PCK.

Available Models

Vision Transformer-Based Models

State-of-the-art pose estimation using transformer architectures for superior performance.

ViTPose - Vision Transformer for human pose estimation with multiple model sizes

Common Configuration

Data Requirements

Training Images: Directory containing images of people in various poses

Keypoint Annotations: JSON file in COCO keypoint format containing:

Image information (filename, dimensions)
Person bounding boxes
Keypoint coordinates (x, y) with visibility flags
Keypoint connections defining skeleton structure

COCO Keypoint Format Example:

{
  "images": [
    {"id": 1, "file_name": "person1.jpg", "height": 480, "width": 640}
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 50, 200, 400],
      "keypoints": [
        120, 80, 2,   // nose (x, y, visibility)
        110, 70, 2,   // left_eye
        130, 70, 2,   // right_eye
        // ... 17 keypoints total for COCO format
      ],
      "num_keypoints": 17
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "keypoints": ["nose", "left_eye", "right_eye", ...],
      "skeleton": [[16, 14], [14, 12], ...]  // connections
    }
  ]
}

Visibility Flags:

0: Not labeled (person is outside the image)
1: Labeled but not visible (occluded)
2: Labeled and visible

Key Training Parameters

Model Variant: Size of the pose estimation model

Small: Fast, lightweight, lower accuracy
Base: Balanced performance and speed
Large: Higher accuracy, slower inference
Huge: Maximum accuracy, highest computational cost

Number of Keypoints: Total keypoints to detect

17 for COCO format (standard human pose)
Custom numbers for specialized applications
Face keypoints: 68+ points
Hand keypoints: 21 points per hand

Batch Size: Number of images processed together

4-8 typical for ViTPose base
2-4 for larger variants
8-16 for small variant
Reduce if out-of-memory errors occur

Epochs: Complete passes through training data

1-5 epochs typical for fine-tuning
More epochs for training from scratch
Pose estimation needs quality data over quantity

Understanding Metrics

OKS (Object Keypoint Similarity): Primary metric for keypoint detection

Similar to IoU but for keypoints
Accounts for keypoint distance from ground truth
Normalized by person scale (larger people = more tolerance)
Range: 0 to 1, higher is better
OKS > 0.5: Generally acceptable
OKS > 0.75: High-quality pose estimation

mAP (mean Average Precision): Average precision across OKS thresholds

mAP@0.5: Average Precision at OKS threshold 0.5
mAP@0.5:0.95: COCO standard, averaged over thresholds
Higher is better, ranges from 0 to 1 (or 0% to 100%)

PCK (Percentage of Correct Keypoints): Simpler evaluation metric

Percentage of keypoints within threshold distance
PCKh@0.5: Within 50% of head segment length
Easier to interpret than OKS

Per-Keypoint Accuracy: Individual keypoint performance

Some keypoints harder than others (e.g., hips vs shoulders)
Useful for identifying weaknesses
Wrists and ankles typically hardest

Choosing the Right Model

By Model Variant

ViTPose-Small

Fastest inference and training
Good for real-time applications
Edge device deployment
60-70% mAP on COCO

ViTPose-Base

Balanced performance and speed
Most versatile choice
Good for general applications
70-75% mAP on COCO

ViTPose-Large

High accuracy applications
More computational resources
Professional use cases
75-78% mAP on COCO

ViTPose-Huge

Maximum accuracy
Research and production systems
Substantial computational requirements
78-80% mAP on COCO

By Use Case

Fitness and Exercise Tracking

ViTPose-Base or ViTPose-Small
Need real-time feedback
Pose accuracy for form correction
Mobile or edge deployment

Sports Performance Analysis

ViTPose-Large or ViTPose-Huge
Maximum accuracy for biomechanics
Frame-by-frame analysis acceptable
Professional athlete assessment

Motion Capture for Animation

ViTPose-Huge for best results
Accuracy critical for realistic animation
Post-processing acceptable
Multi-person tracking often needed

Security and Surveillance

ViTPose-Base for balance
Need to handle occlusions
Multiple people in frame
Varying distances and angles

Healthcare and Rehabilitation

ViTPose-Large for clinical accuracy
Gait analysis and movement assessment
Precise measurements required
Patient privacy considerations

AR/VR and Gaming

ViTPose-Small for real-time
Responsiveness critical
Can sacrifice some accuracy
Mobile device compatibility

Best Practices

Data Preparation

High-Quality Annotations: Accurate keypoint labeling is critical
- Precise keypoint placement on anatomical landmarks
- Consistent annotation guidelines across dataset
- Mark occluded keypoints appropriately (visibility=1)
- Use experienced annotators for medical/sports applications
Pose Diversity:
- Various body poses and activities
- Different camera angles and viewpoints
- Multiple body types and demographics
- Include challenging poses (occlusions, unusual positions)
Person Scale Variation:
- Close-up views (large person in frame)
- Distant views (small person in frame)
- Multiple people at different distances
- OKS is scale-normalized but training benefits from variety
Lighting and Conditions:
- Indoor and outdoor lighting
- Various clothing types and colors
- Different backgrounds (cluttered vs clean)
- Weather conditions if applicable
Dataset Balance:
- At least 500 annotated people for fine-tuning
- 2,000+ for training from scratch
- Balanced representation of poses
- Include edge cases and difficult examples

Training Strategy

Start with Pre-trained Weights: Always use COCO pre-trained models
- Transfer learning dramatically reduces training time
- Better final accuracy with limited data
- Converges faster and more reliably
Choose Appropriate Model Size:
- Start with Base variant for experimentation
- Scale up to Large/Huge if accuracy insufficient
- Scale down to Small if speed critical
Monitor Training Metrics:
- Watch OKS/mAP on validation set
- Check per-keypoint accuracy for weaknesses
- Loss should decrease steadily
- Validate on diverse test set
Adjust for Domain:
- Sports: May need more epochs for unusual poses
- Medical: Requires highest quality annotations
- Fitness: Balance speed and accuracy
- AR/VR: Prioritize low latency

Common Pitfalls

Inaccurate Keypoints on Specific Joints

Check annotation consistency for those joints
May need more training examples of that pose type
Some joints inherently harder (wrists, ankles)
Consider joint-specific confidence thresholds

Poor Performance on Occluded Keypoints

Ensure training data includes occlusions
Mark occluded keypoints with visibility=1
Model learns to predict occluded positions
Some occlusion errors inevitable

Struggles with Unusual Poses

Collect more diverse pose examples
Include athletic, dance, yoga poses
Train longer on specialized datasets
Consider data augmentation

Multiple People Confusion

Ensure person detection accurate
Check bounding box quality
May need better person detector
Consider multi-person specific models

Scale Sensitivity Issues

Include various person sizes in training
Check if small/large people detected correctly
Verify OKS scale normalization working
May need to adjust input resolution

GPU Requirements

Memory Guidelines

ViTPose-Small:

4-6GB GPU sufficient
Batch size 8-16
Fast training and inference

ViTPose-Base:

8-12GB GPU recommended
Batch size 4-8
Balanced performance

ViTPose-Large:

12-16GB GPU recommended
Batch size 2-4
High accuracy

ViTPose-Huge:

16GB+ GPU required
Batch size 1-2
Maximum accuracy

Training Time Estimates

Small Dataset (500 people):

ViTPose-Small: 15-30 minutes per epoch
ViTPose-Base: 30-60 minutes per epoch
ViTPose-Large/Huge: 1-2 hours per epoch

Medium Dataset (2,000 people):

ViTPose-Small: 1-2 hours per epoch
ViTPose-Base: 2-4 hours per epoch
ViTPose-Large/Huge: 4-8 hours per epoch

Large Dataset (10,000+ people):

ViTPose-Small: 4-6 hours per epoch
ViTPose-Base: 8-12 hours per epoch
ViTPose-Large/Huge: 16-24 hours per epoch

Times assume modern GPU (RTX 3080/4080 or better)

Dataset Size Guidelines

Minimum: 500 annotated people with 17 keypoints each Good: 2,000-5,000 annotated people Excellent: 10,000+ annotated people

Quality matters more than quantity - accurate keypoint annotations are crucial for good performance.

Inference Configuration

Confidence Threshold: Minimum confidence for keypoint predictions

0.3 default (liberal, accepts uncertain keypoints)
0.5 for higher precision (fewer false positives)
0.7+ for strict applications (medical, professional sports)
Lower threshold: More detections, may include errors
Higher threshold: Fewer detections, more reliable

Advanced Considerations

Multi-Person Pose Estimation

Bottom-up approach: Detect all keypoints, group by person
Top-down approach: Detect people first, then keypoints per person
ViTPose uses top-down approach (requires person detector)
Consider computational cost of multiple people

Temporal Consistency

Video pose tracking needs frame-to-frame smoothing
Use tracking algorithms for consistent IDs
Temporal models reduce jitter in videos
Important for motion capture applications

3D Pose Estimation

ViTPose predicts 2D keypoints only
3D reconstruction requires multiple cameras or depth info
Lifting 2D to 3D possible with specialized models
Consider if depth information needed for application

Keypoint Detection

Available Models

Vision Transformer-Based Models

Common Configuration

Data Requirements

Key Training Parameters

Understanding Metrics

Choosing the Right Model

By Model Variant

By Use Case

Best Practices

Data Preparation

Training Strategy

Common Pitfalls

GPU Requirements

Memory Guidelines

Training Time Estimates

Dataset Size Guidelines

Inference Configuration

Advanced Considerations

Multi-Person Pose Estimation

Temporal Consistency

3D Pose Estimation

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Keypoint Detection

Available Models

Vision Transformer-Based Models

Common Configuration

Data Requirements

Key Training Parameters

Understanding Metrics

Choosing the Right Model

By Model Variant

By Use Case

Best Practices

Data Preparation

Training Strategy

Common Pitfalls

GPU Requirements

Memory Guidelines

Training Time Estimates

Dataset Size Guidelines

Inference Configuration

Advanced Considerations

Multi-Person Pose Estimation

Temporal Consistency

3D Pose Estimation

On this page

Command Palette