Random Forest
Train Random Forest to predict categorical outcomes
Ensemble of decision trees that votes on the final prediction. Each tree sees a random subset of data and features.
When to use:
- Robust baseline - works well on most problems
- Handles non-linear relationships
- Can handle missing values
- Feature importance needed
- Resistant to overfitting
Strengths: Very accurate, handles non-linearity, robust to noise, provides feature importance Weaknesses: Can be slow, large model size, less interpretable than single trees
Model Parameters
N Estimators (default: 100) Number of trees in the forest. More trees = better but slower.
- 50-100: Fast training
- 100-300: Good default
- 500+: Maximum accuracy, slower
Max Depth Maximum tree depth. Controls model complexity.
- None: Trees grow until pure (may overfit)
- Low values (3-10): Simple, fast, prevents overfitting
- High values (20-50): Complex patterns, may overfit
Min Samples Split (default: 2) Minimum samples needed to split a node. Higher values prevent overfitting.
Min Samples Leaf (default: 1) Minimum samples in a leaf node. Higher values create smoother decision boundaries.
Max Features Features to consider at each split:
- sqrt: Square root of total features (good default for classification)
- log2: Log2 of total features
- None: Use all features
Bootstrap (default: true) Whether to use bootstrap sampling. Keep true for better generalization.
Criterion Split quality measure:
- gini: Gini impurity (default, faster)
- entropy: Information gain (slightly better sometimes)
- log_loss: Log loss (for probability calibration)
Random State (default: 42) Seed for reproducibility.