What we’re building now — detailed
A deeper look at the current fetcher, the exact fields we collect, and how those fields turn into prop evaluation.
Current state of the fetcher
truth_poc.py is our “truth-layer” fetcher. It is intentionally small and auditable. Its job is to:
- Take a date (YYYY-MM-DD) and pull that day’s NBA games.
- For each game, pull player box score rows (minutes + key stats).
- Optionally pull play-by-play (event log) for deeper derived metrics later.
- Write everything into a local SQLite database in canonical tables.
Right now, the script is productionized enough for small test pulls (1–2 games) and is being hardened for safe re-runs and larger backfills.
What data we’re gathering (today)
The fetcher writes three canonical tables. These are the “truth spine” we’ll join against odds/props later.
1) canonical_schedule_rest
One row per game per capture timestamp. Key columns:
captured_time_utcgame_idraw_game_datestart_time_utchome_team_idaway_team_idsourcerequested_date
2) canonical_box_score
One row per player stat line per game. Key columns:
captured_time_utcgame_idteam_idplayer_idminutespointsreboundsassiststhrees_madestealsblocksturnoverssource
Why this matters for props: minutes + production are the starting point for every prop model. Minutes are the volume lever; per-minute rates are the efficiency lever.
3) canonical_pbp
One row per play-by-play event (optional). Key columns:
captured_time_utcgame_idevent_numperiodclockevent_typescoredescriptionsourceraw_json
Why PBP matters: it allows deeper features later (usage proxies, possession timing, lineup segments, foul trouble effects, etc.). We keep raw_json to stay forward-compatible as upstream schemas evolve.
Command to run (developer-friendly)
Jeffrey can paste this in any Python environment after installing deps:
python truth_poc.py --date 2025-12-15 --max-games 1 --db pilot.sqlite --pbp-mode nba_api --sleep 1.5
Tip: If you re-run often, use a new DB filename per run or apply the idempotency patch so reruns don’t collide on primary keys.
How we use these tables to build a prop model
The canonical tables are not the model themselves — they are the data substrate. The modeling pipeline will:
- Compute baseline rates (e.g., points per minute, assists per minute) from historical box scores.
- Project minutes for the upcoming game (recent minutes + role + rotation changes).
- Adjust for context: pace, rest, home/away, injuries/usage shifts, matchup factors (lightly).
- Generate a distribution of outcomes (not just a point estimate) so we can price “over/under” lines.
- Compare to market once we ingest sportsbook odds/lines snapshots; compute EV and only recommend when edge is meaningful.