Architecture

Transparent, intricate, and honestly imperfect.

Building this prototype required a dedicated 64GB beast VM, a transition from flat files to semantic data marts, and a 17-step pipeline weaving heavy data engineering with advanced machine learning. Here is the technical breakdown of how we achieved analysis-ready state, the specific scripts and models we deployed, and our concessions regarding what the system cannot yet see.

Compute VM

64GB

Local Beast Machine

Pipeline Scripts

17

Prep to Synthesis

ML Techniques

11

WP, VAEP, WAR, etc.

LLM Synthesizers

02

Gemini Flash Engines

End-to-End Pipeline

Cricsheet Raw JSON
Physical-Logic Layer
Gold Record Store
XGBoost WP Models
VAEP Calculator
Narrative Cage
Next.js Edge

Our Python physical-logic layer enforces the laws of cricket across 5.2M deliveries. To fix missing metadata, we injected the R `cricketdata` package via pre-compiled R-CRAN binaries to enrich 16,101 player registries. We built a dual-stage venue geocoding pipeline utilizing Nominatim (OpenStreetMap) followed by Gemini 2.5-Flash for high-volume grounds with naming variants. Finally, we deployed regex-based tournament taxonomy to assign Match Weights, ensuring World Cup finals are mathematically separated from bilateral dead rubbers.

Cricsheet JSON->
R-Registry Sync->
Dual-Stage Geocoding->
State Machine Validation->
Gold Record Store

Technique Distribution

wp_swing

32

player_vs_self

14

league_rank

34

vaep_analysis

2

advanced_modeling

14