F1 Lap Time Prediction and Feature Analysis
End-to-end machine learning framework for F1 lap time prediction using real-time telemetry data, achieving 94.8% R² through advanced feature engineering with track curvature, elevation profiles, and driver performance metrics.
Overview
This project develops an end-to-end machine learning framework for predicting Formula 1 lap times using real-time telemetry data. Through sophisticated feature engineering and statistical analysis, we achieve 94.8% R² prediction accuracy.
Repository: github.com/SidShah2953/F1-Telemetry-Analysis
Affiliation: Boston University, Department of Computer Science
Program: MS in Applied Data Analytics
Date: December 2024
Key Results
Model Performance
- R² Score: 94.8% (exceptional predictive accuracy)
- Features: 20+ engineered variables
- Methodology: Advanced feature engineering with time-series quantization
- Validation: ANOVA and statistical hypothesis testing
Top 5 Most Influential Features
| Feature | Importance | Description |
|---|---|---|
| Track Length | 56.40% | Total circuit distance |
| Elevation Std Dev | 14.58% | Vertical terrain variation |
| Total Elevation Change | 6.23% | Cumulative altitude gain/loss |
| Curvature Std Dev | 5.17% | Track corner complexity |
| Number of Corners | 4.16% | Total turning points |
Top 5 Total Contribution: 86.54% of predictive power
Project Objectives
Primary Goal
Build a robust predictive model that can forecast F1 lap times based on:
- Track characteristics (geometry, elevation, surface)
- Driver performance patterns
- Car telemetry data
- Environmental conditions
Research Questions
- Which track features most significantly impact lap times?
- How do driver styles differ in performance characteristics?
- Can we predict lap times for new circuits?
- What is the optimal feature set for prediction accuracy?
Data Sources
Telemetry Data
- Real-time metrics: Speed, throttle, brake, steering angle
- Frequency: High-resolution (millisecond-level)
- Coverage: Multiple seasons, all circuits
- Volume: Millions of data points
Track Characteristics
- Geometric Data: GPS coordinates, corner angles, straight lengths
- Elevation Profiles: Altitude changes, gradients
- Surface Data: Track temperature, weather conditions
- Configuration: Circuit layout, sector boundaries
Driver Performance
- Historical Lap Times: Race and qualifying data
- Sector Times: Granular performance breakdown
- Team Information: Constructor, car specifications
- Session Conditions: Practice, qualifying, race
Feature Engineering
Track Geometry Features
Curvature Analysis:
- Average curvature
- Standard deviation of curvature
- Maximum curvature (tightest corner)
- Curvature distribution (quantiles)
Corner Characteristics:
- Number of corners (total count)
- Corner complexity score
- Slow/medium/fast corner distribution
- Corner entry/exit angles
Elevation Features
Vertical Metrics:
- Elevation Standard Deviation (14.58% importance)
- Total Elevation Change (6.23% importance)
- Maximum gradient (steepest climb/descent)
- Elevation gain vs. loss
Impact: Elevation changes affect:
- Engine power delivery
- Aerodynamic efficiency
- Tire wear patterns
- Driver energy management
Track Length & Layout
Distance Metrics:
- Track Length (56.40% importance - dominant factor)
- Straight length (longest vs. average)
- Sector length distribution
- Track type (street vs. permanent circuit)
Time-Series Quantization
Technique: Discretizing continuous telemetry signals
- Speed binning (low, medium, high)
- Throttle application quantiles
- Brake pressure zones
- Steering angle categories
Benefits:
- Captures non-linear relationships
- Reduces noise in telemetry data
- Enables pattern recognition
- Improves model generalization
Driver-Specific Features
Performance Metrics:
- Historical lap time averages
- Qualifying vs. race pace differential
- Tire degradation patterns
- Consistency scores (lap time variance)
Driving Style Indicators:
- Aggressive vs. smooth braking
- Corner entry speed preferences
- Throttle application patterns
- Energy management strategies
Statistical Analysis
ANOVA (Analysis of Variance)
Purpose: Determine which features significantly impact lap times
Methodology:
- F-statistic calculation for each feature
- p-value analysis (significance testing)
- Effect size quantification
- Multiple comparison corrections
Key Findings:
- Track length: Highest F-statistic
- Elevation features: Statistically significant
- Curvature metrics: Strong predictive power
Hypothesis Testing
Null Hypothesis (H₀): Feature has no effect on lap time
Alternative Hypothesis (H₁): Feature significantly affects lap time
Results:
- Rejected H₀ for top 15 features (p < 0.01)
- Strong evidence for track geometry impact
- Validated feature selection methodology
Multi-Dimensional Feature Interactions
Complex Interactions Engineered:
- Track length × curvature (handling vs. straight-line speed)
- Elevation change × number of corners (energy management)
- Driver consistency × track complexity
- Temperature × tire compound × track abrasiveness
These interactions capture non-linear effects that simple features miss.
Driver Comparative Analysis
Max Verstappen vs. Lando Norris
Statistical Comparison using multiple linear regression models to analyze driving styles and performance characteristics:
Track Complexity Management
Verstappen Advantages:
- More consistent performance on complex tracks
- Lower sensitivity to number of corners
- Lower sensitivity to maximum curvature
- More efficient adaptation to track characteristics
Interpretation: Superior technical skill in handling varied circuit types
Tire Management
Verstappen Strengths:
- Lower lap time degradation with tire wear
- More consistent performance across tire compounds
- Better tire preservation in race conditions
Impact: Strategic advantage in race simulations and pit stop strategy
Performance Factors
Key Differences:
- Different sensitivity to track temperatures
- Varying responses to weather conditions
- Distinct optimal setup preferences
Application: Team strategy optimization and car setup directions
Machine Learning Model
Algorithm Selection
Model Type: Gradient Boosting Regressor (XGBoost/LightGBM)
Rationale:
- Handles non-linear feature interactions
- Feature importance extraction
- Robust to outliers
- High predictive accuracy for tabular data
Model Architecture
Input Layer: 20+ engineered features Training Strategy: K-fold cross-validation Optimization: Hyperparameter tuning (Grid Search) Validation: Hold-out test set + temporal split
Feature Importance Analysis
Method: SHAP (SHapley Additive exPlanations) values
Insights:
- Track length dominates (56.40%)
- Elevation features collectively contribute 20.81%
- Curvature metrics add 9.33%
- Driver-specific features: 8-10%
Technical Implementation
Python Stack
Core Libraries:
- Pandas: Data manipulation and time-series operations
- NumPy: Numerical computations
- Scikit-learn: Machine learning models and metrics
- XGBoost: Gradient boosting implementation
- Matplotlib/Seaborn: Data visualization
Statistical Analysis:
- SciPy: ANOVA, hypothesis testing
- Statsmodels: Regression analysis
- SHAP: Feature importance interpretation
Data Pipeline
- Data Ingestion: Load telemetry and track data
- Feature Engineering: Create 20+ derived features
- Time-Series Quantization: Discretize continuous signals
- Data Normalization: Standardize feature scales
- Train/Test Split: Temporal and circuit-based splits
- Model Training: Hyperparameter optimization
- Evaluation: R², RMSE, MAE metrics
- Interpretation: SHAP analysis and visualizations
Code Structure
# Simplified workflow
1. load_telemetry_data()
2. engineer_track_features()
3. quantize_time_series()
4. build_driver_features()
5. train_model(features, target=lap_time)
6. evaluate_performance(test_set)
7. analyze_feature_importance()
8. visualize_predictions()
Applications
Racing Teams (F1 Constructor Applications)
Strategy Optimization:
- Lap time predictions for circuit planning
- Driver-circuit matching analysis
- Tire strategy simulations
- Pit stop timing optimization
Car Development:
- Understanding which car characteristics matter most
- Prioritizing aerodynamic vs. mechanical grip
- Optimizing for specific circuit types
Sports Analytics
Broadcasting & Media:
- Real-time prediction graphics
- Performance comparison visualizations
- Insightful commentary support
Fantasy Sports:
- Driver performance forecasting
- Optimal team selection
- Risk assessment for picks
Data Science Showcase
Transferable Skills:
- Feature Engineering: Complex interaction terms
- Time-Series Analysis: Quantization techniques
- Statistical Rigor: ANOVA, hypothesis testing
- Model Interpretation: SHAP analysis
- Domain Expertise: F1 racing knowledge integration
Insights & Discoveries
Track Length Dominance (56.40%)
- Longer tracks = longer lap times (obvious but quantified)
- Serves as baseline normalization factor
- Other features explain residual variation
Elevation Impact (20.81% combined)
- Underestimated factor in lap time prediction
- Affects engine load and aerodynamics
- More important than raw corner count
Curvature Complexity (9.33%)
- Standard deviation more important than mean
- Track-to-track variation matters
- Technical circuits favor skilled drivers
Driver Differences (Verstappen vs. Norris)
- Quantified performance gaps on complex tracks
- Tire management measurably different
- Temperature/weather sensitivity varies
Challenges & Solutions
Challenge 1: High-Dimensional Data
Solution: Feature selection via ANOVA and recursive feature elimination
Challenge 2: Non-Linear Relationships
Solution: Time-series quantization and interaction terms
Challenge 3: Driver Heterogeneity
Solution: Driver-specific features and mixed-effects modeling
Challenge 4: Overfitting Risk
Solution: Cross-validation, regularization, and temporal splits
Future Enhancements
Planned Additions
- Real-time predictions: Live race lap time forecasting
- Strategy simulation: Pit stop and tire strategy optimization
- Weather integration: Rain impact on lap times
- Machine learning ensemble: Combining multiple models
Advanced Features
- Tire compound effects (soft vs. medium vs. hard)
- Fuel load degradation curves
- DRS (Drag Reduction System) impact
- Traffic and overtaking difficulty
Deep Learning Exploration
- LSTM for sequential telemetry data
- CNN for circuit image analysis
- Transformer models for attention-based predictions
Conclusion
This project demonstrates end-to-end data science expertise in a complex, real-world domain:
- 94.8% R² accuracy through rigorous feature engineering
- Statistical validation via ANOVA and hypothesis testing
- Interpretable models with SHAP feature importance
- Domain integration combining F1 knowledge with ML techniques
The framework is transferable to financial modeling (trading strategies), sports analytics (performance prediction), and any time-series regression problem requiring sophisticated feature engineering.
Key Takeaways
- Track length is the dominant predictor (56.40%)
- Elevation variation significantly impacts lap times (20.81%)
- Driver characteristics create measurable performance differences
- Advanced feature engineering unlocks predictive accuracy
Repository
GitHub: SidShah2953/F1-Telemetry-Analysis
Contents:
- Complete data pipeline code
- Feature engineering notebooks
- Statistical analysis scripts
- Model training and evaluation
- Visualization tools
- Documentation and results
This project showcases the intersection of machine learning, statistical analysis, and domain expertise—skills directly applicable to quantitative finance and data-driven decision making.