Keypoint Detection
Train models to detect and localize body keypoints for human pose estimation
Keypoint detection, also known as pose estimation, identifies the spatial locations of key body joints in images. This task goes beyond simple object detection by predicting specific anatomical landmarks like shoulders, elbows, wrists, hips, knees, and ankles. Keypoint detection is fundamental to applications in sports analytics, fitness tracking, motion capture, healthcare, human-computer interaction, and augmented reality.
Learn About Keypoint Detection
New to keypoint detection? Visit our Keypoint Detection Concepts Guide to learn about pose estimation fundamentals, keypoint formats, skeleton connectivity, and evaluation metrics like OKS and PCK.
Available Models
Vision Transformer-Based Models
State-of-the-art pose estimation using transformer architectures for superior performance.
- ViTPose - Vision Transformer for human pose estimation with multiple model sizes
Common Configuration
Data Requirements
Training Images: Directory containing images of people in various poses
Keypoint Annotations: JSON file in COCO keypoint format containing:
- Image information (filename, dimensions)
- Person bounding boxes
- Keypoint coordinates (x, y) with visibility flags
- Keypoint connections defining skeleton structure
COCO Keypoint Format Example:
{
"images": [
{"id": 1, "file_name": "person1.jpg", "height": 480, "width": 640}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [100, 50, 200, 400],
"keypoints": [
120, 80, 2, // nose (x, y, visibility)
110, 70, 2, // left_eye
130, 70, 2, // right_eye
// ... 17 keypoints total for COCO format
],
"num_keypoints": 17
}
],
"categories": [
{
"id": 1,
"name": "person",
"keypoints": ["nose", "left_eye", "right_eye", ...],
"skeleton": [[16, 14], [14, 12], ...] // connections
}
]
}Visibility Flags:
- 0: Not labeled (person is outside the image)
- 1: Labeled but not visible (occluded)
- 2: Labeled and visible
Key Training Parameters
Model Variant: Size of the pose estimation model
- Small: Fast, lightweight, lower accuracy
- Base: Balanced performance and speed
- Large: Higher accuracy, slower inference
- Huge: Maximum accuracy, highest computational cost
Number of Keypoints: Total keypoints to detect
- 17 for COCO format (standard human pose)
- Custom numbers for specialized applications
- Face keypoints: 68+ points
- Hand keypoints: 21 points per hand
Batch Size: Number of images processed together
- 4-8 typical for ViTPose base
- 2-4 for larger variants
- 8-16 for small variant
- Reduce if out-of-memory errors occur
Epochs: Complete passes through training data
- 1-5 epochs typical for fine-tuning
- More epochs for training from scratch
- Pose estimation needs quality data over quantity
Understanding Metrics
OKS (Object Keypoint Similarity): Primary metric for keypoint detection
- Similar to IoU but for keypoints
- Accounts for keypoint distance from ground truth
- Normalized by person scale (larger people = more tolerance)
- Range: 0 to 1, higher is better
- OKS > 0.5: Generally acceptable
- OKS > 0.75: High-quality pose estimation
mAP (mean Average Precision): Average precision across OKS thresholds
- mAP@0.5: Average Precision at OKS threshold 0.5
- mAP@0.5:0.95: COCO standard, averaged over thresholds
- Higher is better, ranges from 0 to 1 (or 0% to 100%)
PCK (Percentage of Correct Keypoints): Simpler evaluation metric
- Percentage of keypoints within threshold distance
- PCKh@0.5: Within 50% of head segment length
- Easier to interpret than OKS
Per-Keypoint Accuracy: Individual keypoint performance
- Some keypoints harder than others (e.g., hips vs shoulders)
- Useful for identifying weaknesses
- Wrists and ankles typically hardest
Choosing the Right Model
By Model Variant
ViTPose-Small
- Fastest inference and training
- Good for real-time applications
- Edge device deployment
- 60-70% mAP on COCO
ViTPose-Base
- Balanced performance and speed
- Most versatile choice
- Good for general applications
- 70-75% mAP on COCO
ViTPose-Large
- High accuracy applications
- More computational resources
- Professional use cases
- 75-78% mAP on COCO
ViTPose-Huge
- Maximum accuracy
- Research and production systems
- Substantial computational requirements
- 78-80% mAP on COCO
By Use Case
Fitness and Exercise Tracking
- ViTPose-Base or ViTPose-Small
- Need real-time feedback
- Pose accuracy for form correction
- Mobile or edge deployment
Sports Performance Analysis
- ViTPose-Large or ViTPose-Huge
- Maximum accuracy for biomechanics
- Frame-by-frame analysis acceptable
- Professional athlete assessment
Motion Capture for Animation
- ViTPose-Huge for best results
- Accuracy critical for realistic animation
- Post-processing acceptable
- Multi-person tracking often needed
Security and Surveillance
- ViTPose-Base for balance
- Need to handle occlusions
- Multiple people in frame
- Varying distances and angles
Healthcare and Rehabilitation
- ViTPose-Large for clinical accuracy
- Gait analysis and movement assessment
- Precise measurements required
- Patient privacy considerations
AR/VR and Gaming
- ViTPose-Small for real-time
- Responsiveness critical
- Can sacrifice some accuracy
- Mobile device compatibility
Best Practices
Data Preparation
-
High-Quality Annotations: Accurate keypoint labeling is critical
- Precise keypoint placement on anatomical landmarks
- Consistent annotation guidelines across dataset
- Mark occluded keypoints appropriately (visibility=1)
- Use experienced annotators for medical/sports applications
-
Pose Diversity:
- Various body poses and activities
- Different camera angles and viewpoints
- Multiple body types and demographics
- Include challenging poses (occlusions, unusual positions)
-
Person Scale Variation:
- Close-up views (large person in frame)
- Distant views (small person in frame)
- Multiple people at different distances
- OKS is scale-normalized but training benefits from variety
-
Lighting and Conditions:
- Indoor and outdoor lighting
- Various clothing types and colors
- Different backgrounds (cluttered vs clean)
- Weather conditions if applicable
-
Dataset Balance:
- At least 500 annotated people for fine-tuning
- 2,000+ for training from scratch
- Balanced representation of poses
- Include edge cases and difficult examples
Training Strategy
-
Start with Pre-trained Weights: Always use COCO pre-trained models
- Transfer learning dramatically reduces training time
- Better final accuracy with limited data
- Converges faster and more reliably
-
Choose Appropriate Model Size:
- Start with Base variant for experimentation
- Scale up to Large/Huge if accuracy insufficient
- Scale down to Small if speed critical
-
Monitor Training Metrics:
- Watch OKS/mAP on validation set
- Check per-keypoint accuracy for weaknesses
- Loss should decrease steadily
- Validate on diverse test set
-
Adjust for Domain:
- Sports: May need more epochs for unusual poses
- Medical: Requires highest quality annotations
- Fitness: Balance speed and accuracy
- AR/VR: Prioritize low latency
Common Pitfalls
Inaccurate Keypoints on Specific Joints
- Check annotation consistency for those joints
- May need more training examples of that pose type
- Some joints inherently harder (wrists, ankles)
- Consider joint-specific confidence thresholds
Poor Performance on Occluded Keypoints
- Ensure training data includes occlusions
- Mark occluded keypoints with visibility=1
- Model learns to predict occluded positions
- Some occlusion errors inevitable
Struggles with Unusual Poses
- Collect more diverse pose examples
- Include athletic, dance, yoga poses
- Train longer on specialized datasets
- Consider data augmentation
Multiple People Confusion
- Ensure person detection accurate
- Check bounding box quality
- May need better person detector
- Consider multi-person specific models
Scale Sensitivity Issues
- Include various person sizes in training
- Check if small/large people detected correctly
- Verify OKS scale normalization working
- May need to adjust input resolution
GPU Requirements
Memory Guidelines
ViTPose-Small:
- 4-6GB GPU sufficient
- Batch size 8-16
- Fast training and inference
ViTPose-Base:
- 8-12GB GPU recommended
- Batch size 4-8
- Balanced performance
ViTPose-Large:
- 12-16GB GPU recommended
- Batch size 2-4
- High accuracy
ViTPose-Huge:
- 16GB+ GPU required
- Batch size 1-2
- Maximum accuracy
Training Time Estimates
Small Dataset (500 people):
- ViTPose-Small: 15-30 minutes per epoch
- ViTPose-Base: 30-60 minutes per epoch
- ViTPose-Large/Huge: 1-2 hours per epoch
Medium Dataset (2,000 people):
- ViTPose-Small: 1-2 hours per epoch
- ViTPose-Base: 2-4 hours per epoch
- ViTPose-Large/Huge: 4-8 hours per epoch
Large Dataset (10,000+ people):
- ViTPose-Small: 4-6 hours per epoch
- ViTPose-Base: 8-12 hours per epoch
- ViTPose-Large/Huge: 16-24 hours per epoch
Times assume modern GPU (RTX 3080/4080 or better)
Dataset Size Guidelines
Minimum: 500 annotated people with 17 keypoints each Good: 2,000-5,000 annotated people Excellent: 10,000+ annotated people
Quality matters more than quantity - accurate keypoint annotations are crucial for good performance.
Inference Configuration
Confidence Threshold: Minimum confidence for keypoint predictions
- 0.3 default (liberal, accepts uncertain keypoints)
- 0.5 for higher precision (fewer false positives)
- 0.7+ for strict applications (medical, professional sports)
- Lower threshold: More detections, may include errors
- Higher threshold: Fewer detections, more reliable
Advanced Considerations
Multi-Person Pose Estimation
- Bottom-up approach: Detect all keypoints, group by person
- Top-down approach: Detect people first, then keypoints per person
- ViTPose uses top-down approach (requires person detector)
- Consider computational cost of multiple people
Temporal Consistency
- Video pose tracking needs frame-to-frame smoothing
- Use tracking algorithms for consistent IDs
- Temporal models reduce jitter in videos
- Important for motion capture applications
3D Pose Estimation
- ViTPose predicts 2D keypoints only
- 3D reconstruction requires multiple cameras or depth info
- Lifting 2D to 3D possible with specialized models
- Consider if depth information needed for application