Our Python physical-logic layer enforces the laws of cricket across 5.2M deliveries. To fix missing metadata, we injected the R `cricketdata` package via pre-compiled R-CRAN binaries to enrich 16,101 player registries. We built a dual-stage venue geocoding pipeline utilizing Nominatim (OpenStreetMap) followed by Gemini 2.5-Flash for high-volume grounds with naming variants. Finally, we deployed regex-based tournament taxonomy to assign Match Weights, ensuring World Cup finals are mathematically separated from bilateral dead rubbers.
Architecture
SYSTEM ARCHITECTURE.
Transparent, intricate, and honestly imperfect.
Building this prototype required a dedicated 64GB beast VM, a transition from flat files to semantic data marts, and a 17-step pipeline weaving heavy data engineering with advanced machine learning. Here is the technical breakdown of how we achieved analysis-ready state, the specific scripts and models we deployed, and our concessions regarding what the system cannot yet see.
Compute VM
64GB
Local Beast Machine
Pipeline Scripts
17
Prep to Synthesis
ML Techniques
11
WP, VAEP, WAR, etc.
LLM Synthesizers
02
Gemini Flash Engines
End-to-End Pipeline
Cricsheet Raw JSON
->Physical-Logic Layer
->Gold Record Store
->XGBoost WP Models
->VAEP Calculator
->Narrative Cage
->Next.js Edge
System Breakdown
Cricsheet JSON->
R-Registry Sync->
Dual-Stage Geocoding->
State Machine Validation->
Gold Record Store
Technique Distribution
wp_swing
32
player_vs_self
14
league_rank
34
vaep_analysis
2
advanced_modeling
14