Classification
Train classification models to predict categorical outcomes
Classification models predict which category or class a data point belongs to. Use classification when you want to categorize data into discrete groups - like spam vs. not spam, disease vs. healthy, or customer segments.
🎓 Learn About Classification
New to classification? Visit our Classification Concepts Guide to learn about evaluation metrics, common approaches, and when to use classification for your machine learning tasks.
Available Models
We support 14 different classification algorithms, each with its own strengths:
Linear Models
- Logistic Regression - Fast, interpretable baseline for binary and multiclass problems
- Ordinal Logistic Regression - For ordered categories (low, medium, high)
Tree-Based Models
- Decision Tree - Simple, interpretable rules-based model
- Random Forest - Ensemble of decision trees, robust and accurate
- Extra Trees - Like Random Forest but faster training
Gradient Boosting Models
- XGBoost - Industry standard, excellent performance
- LightGBM - Fast and memory efficient
- CatBoost - Handles categorical features automatically
- Gradient Boosting - Classic boosting algorithm
- AdaBoost - Adaptive boosting for weak learners
Other Models
- Support Vector Machine (SVM) - Effective for high-dimensional spaces
- K-Nearest Neighbors - Instance-based learning
- Naive Bayes - Fast probabilistic classifier
- Multi-layer Perceptron - Neural network for complex patterns
Common Configuration
All models share these common settings:
Feature Configuration
Feature Columns (required) Select which columns from your dataset to use as input features for training. These are the variables the model will learn from to make predictions.
Target Column (required) The column containing the categories you want to predict. This should be a categorical column with discrete class labels.
Hyperparameter Tuning
Enable Hyperparameter Tuning Automatically search for the best model parameters. This improves accuracy but takes longer to train.
- Disabled: Use default parameters (faster)
- Enabled: Search for optimal parameters (better accuracy)
Tuning Method (when tuning is enabled)
- Grid Search: Try all combinations systematically (slow but thorough)
- Random Search: Try random combinations (faster, good results)
- Bayesian Search: Intelligently search the parameter space (most efficient)
CV Folds (when tuning is enabled) Number of cross-validation folds (default: 5). Higher values give more reliable results but take longer.
N Iterations (for Random/Bayesian search) How many parameter combinations to try (default: 10). More iterations may find better parameters but take longer.
Scoring Metric (when tuning is enabled) How to evaluate model performance:
- Accuracy: Percentage of correct predictions
- Precision (weighted): How many predicted positives are actually positive
- Recall (weighted): How many actual positives are correctly identified
- F1 Score (weighted): Balance between precision and recall
- ROC AUC (OVR): Area under the ROC curve (good for imbalanced data)
Choosing the Right Model
Quick Start Guide
- Start simple: Try Logistic Regression first
- Go to trees: Random Forest or XGBoost for better accuracy
- Fine-tune: Use hyperparameter tuning on your best model
- Specialized: Try model-specific features (CatBoost for categories, SVM for high dimensions)
By Dataset Size
- Small (<1k rows): Logistic Regression, SVM, Decision Tree
- Medium (1k-100k): Random Forest, XGBoost, LightGBM
- Large (>100k): LightGBM, XGBoost (with GPU)
By Priority
- Accuracy: XGBoost, LightGBM, CatBoost
- Speed: Naive Bayes, Logistic Regression, LightGBM
- Interpretability: Logistic Regression, Decision Tree
- Robustness: Random Forest, Extra Trees
By Data Type
- Categorical features: CatBoost, XGBoost
- Text data: Naive Bayes, Logistic Regression
- High-dimensional: SVM, Logistic Regression
- Non-linear patterns: XGBoost, Random Forest, Neural Network
Best Practices
- Always start with a baseline - Logistic Regression trains fast and sets the bar
- Scale your features - Critical for SVM, KNN, Neural Networks (not needed for tree-based)
- Handle imbalanced data - Use class_weight='balanced' or sampling techniques
- Use cross-validation - Enable hyperparameter tuning for reliable results
- Monitor for overfitting - Check train vs. validation metrics
- Feature engineering matters - Better features > fancier models
- Start simple, iterate - Don't jump to neural networks immediately
Next Steps
Ready to train? Head to the Training page and:
- Select your dataset
- Choose a classification model
- Configure the parameters based on this guide
- Enable hyperparameter tuning for best results
- Compare multiple models to find the winner