Directly predicting player box-score statistics (points, rebounds, assists) suffers from high variance due to fluctuations in playing time. A player scoring 8 points in 12 minutes is fundamentally different from scoring 8 points in 36 minutes, but naive models treat them similarly.
This project addresses that issue by using a pipeline that models availability and production separately. This means that we can aim to get an accurate prediction for minutes and then use that to infer the box scores.
Modeling Approach
The system uses a two-stage pipeline:
Minutes Model: Predicts expected minutes played based on recent usage, opponent context, and game conditions.
Production Model: Predicts per-minute rates (points, rebounds, assists), conditioned on the predicted minutes.
The final forecast is produced by combining both stages into a single probabilistic output.
Statistical Methods
- Bayesian hierarchical regression with partial pooling
- Player-level and opponent-level effects
- Weakly informative priors to stabilize estimates under sparse data
Partial pooling allows the model to “borrow” data across similar contexts while still learning player-specific behavior.
Evaluation
To avoid data leakage, all models were evaluated using a walk-forward backtesting framework: Essentially that means that we do not test the model on data from a timeframe that it was trained on, and only evaluate it on future events.
Results & Insights
I am still working on this project so there will be more to report here in the future.