Talking Tech: Building a March Madness Model using XGBoost

In one of many earliest iterations of Speaking Tech, we constructed a random forest classifier to foretell play calls for faculty soccer. In one other version, we used my personally preffered methodology of constructing a synthetic neural community to foretell faculty soccer video games. On this version, we’ll dive into one other kind of machine studying methodology. Like the sooner walkthrough utilizing a random forest classifier, we’ll take a look at one other kind of ensemble methodology. An ensemble methodology builds quite a few disparate fashions and depends on energy via sheer numbers. Within the random forest methodology, a mess of choice timber is generated and their outputs all gathered collectively within the ultimate output. On this put up, we’ll use an ensemble methodology that is rather less, properly, random.

Gradient boosting is comparable in some ways to random forest strategies. Each are ensemble fashions. Each usually make use of choice timber. Each additionally can be utilized for both classification or regression. So what units them aside? For those who keep in mind, random forest strategies usually generate a mess of choice timber at random, relying on the faulty timber to cancel one another out, kind of, whereas the stronger timber rise to the highest. Gradient boosting, alternatively, will begin with one choice tree, consider it, after which use ensuing error to generate one other choice tree that’s incrementally extra correct. Rinse and repeat.

Finally, this ends in a mess of timber all chained collectively, every one utilizing the insights from its predecessors to make itself extra correct. However you do not merely discard the older fashions. All generated timber make up the ultimate mannequin, which makes this one other ensemble methodology. You possibly can see how this methodology may carry out a lot better than random forests. In actual fact, gradient boosted choice tree fashions are normally among the high performing in Kaggle competitions and the like.

In terms of gradient boosting in Python, there are two libraries with which I’m acquainted: XGBoost and LightGBM. Whereas each libraries are stable choices, we’ll be utilizing XGBoost on this put up. Nonetheless, I do advocate going again and giving LightGBM a glance in some unspecified time in the future.

We will probably be utilizing the CBBD Python library to tug knowledge from the CollegeBasketballData.com REST API. In whole, we will probably be utilizing these packages: cbbd, pandas, sklearn, xgboost. Be sure you have these all put in through pip or nonetheless you handle your Python dependencies. We are going to begin importing the whole lot we want up entrance. We can even arrange our CBBD API key so enter your into the placeholder under. For those who want a key, you may purchase one from the primary CBBD web site.

import cbbd
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

configuration = cbbd.Configuration(
access_token = ‘your_api_key_here’
)

I must also be aware that we are going to be making a complete of twenty-two API calls, properly inside the free tier of 1000 month-to-month calls supplied by CBBD and sufficient to rerun this mannequin many occasions over.

Subsequent, we are going to compile all NCAA match video games from 2013 to 2024. You possibly can go additional again in case you need. Word that we’re passing in a parameter of match=”NCAA”. This permits us to conveniently question all match video games for a given 12 months.

video games = []
with cbbd.ApiClient(configuration) as api_client:
games_api = cbbd.GamesApi(api_client)
for season in vary(2024, 2013, -1):
outcomes = games_api.get_games(season=season, match=”NCAA”)
video games += outcomes
len(video games)

That returned 686 video games. Let’s examine what knowledge is included in a recreation file.

video games[0]
GameInfo(id=12010, source_id=’401638579′, season_label=”20232024″, season=2024, season_type=, start_date=datetime.datetime(2024, 3, 19, 18, 40, tzinfo=datetime.timezone.utc), start_time_tbd=False, neutral_site=True, conference_game=False, game_type=”TRNMNT”, match=”NCAA”, game_notes=”Males’s Basketball Championship – West Area – First 4″, standing=, attendance=0, home_team_id=114, home_team=’Howard’, home_conference_id=18, home_conference=”MEAC”, home_seed=16, home_points=68, home_period_points=[27, 41], home_winner=False, away_team_id=341, away_team=’Wagner’, away_conference_id=21, away_conference=”NEC”, away_seed=16, away_points=71, away_period_points=[38, 33], away_winner=True, pleasure=4.7, venue_id=76, venue=”UD Area”, metropolis=’Dayton’, state=”OH”)

Now we have to load up some stats to include as options into our mannequin. We are going to use the CBBD Stats API to question for staff season stats for a similar years for which we queried match recreation knowledge. Word that we’re passing in a season_type=”common” parameter. THIS IS IMPORTANT. We wish to ONLY seize statistics for the common season. In different phrases, stats that have been out there previous to the beginning of the match in a given 12 months. Failing to go within the filter will end in a mannequin that isn’t predictive, however retrodictive. It is a VERY widespread mistake individuals make together with knowledge and statistics that weren’t out there on the time of the video games they’re searching for to foretell.

Anyway, run the code under to seize staff season stats.

stats = []
with cbbd.ApiClient(configuration) as api_client:
stats_api = cbbd.StatsApi(api_client)
for season in vary(2024, 2013, -1):
outcomes = stats_api.get_team_season_stats(season=season, season_type=”common”)
stats += outcomes
len(stats)

And we’ll additionally try the contents of the stats information.

stats[0]
TeamSeasonStats(season=2024, season_label=”20232024″, team_id=1, staff=’Abilene Christian’, convention=”WAC”, video games=32, wins=15, losses=17, total_minutes=1325, tempo=61.1, team_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=43.2, tried=1877, made=811), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.4, tried=1393, made=646), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=34.1, tried=484, made=165), free_throws=TeamSeasonUnitStatsFieldGoals(pct=73.1, tried=729, made=533), rebounds=TeamSeasonUnitStatsRebounds(whole=1070, defensive=756, offensive=314), turnovers=TeamSeasonUnitStatsTurnovers(team_total=12, whole=404), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, whole=635), factors=TeamSeasonUnitStatsPoints(fast_break=319, off_turnovers=466, in_paint=1138, whole=2320), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=38.8, offensive_rebound_pct=29.3, turnover_ratio=0.2, effective_field_goal_pct=47.6), assists=405, blocks=65, steals=253, possessions=2028, ranking=114.4, true_shooting=52.8), opponent_stats=TeamSeasonUnitStats(field_goals=TeamSeasonUnitStatsFieldGoals(pct=46.5, tried=1792, made=833), two_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=52.6, tried=1227, made=645), three_point_field_goals=TeamSeasonUnitStatsFieldGoals(pct=33.3, tried=565, made=188), free_throws=TeamSeasonUnitStatsFieldGoals(pct=68.7, tried=723, made=497), rebounds=TeamSeasonUnitStatsRebounds(whole=1171, defensive=859, offensive=312), turnovers=TeamSeasonUnitStatsTurnovers(team_total=23, whole=478), fouls=TeamSeasonUnitStatsFouls(flagrant=0, technical=6, whole=619), factors=TeamSeasonUnitStatsPoints(fast_break=316, off_turnovers=411, in_paint=1120, whole=2351), four_factors=TeamSeasonUnitStatsFourFactors(free_throw_rate=40.3, offensive_rebound_pct=26.6, turnover_ratio=0.2, effective_field_goal_pct=51.7), assists=388, blocks=108, steals=206, possessions=2023, ranking=116.2, true_shooting=55.7))

That is lots of stats! The ultimate step right here is to match the staff statistics with every recreation file and put these into an information body. We’re going to create a listing of dict objects to mix this knowledge, which will probably be fairly straightforward to load up into pandas.

Within the code under, we’re changing every recreation objet right into a dict, querying staff stats for the house and away staff, after which loading up knowledge factors from every stats object into the dict. You possibly can utterly change these up in case you need or add completely different stats. I’m not making an attempt to construct essentially the most complete or correct mannequin on this train. I’m merely making an attempt to provide you a good suggestion of how one can mix the information and get it into the right format.

information = []
for recreation in video games:
file = recreation.to_dict()
home_stats = [stat for stat in stats if stat.team_id == game.home_team_id and stat.season == game.season][0]
away_stats = [stat for stat in stats if stat.team_id == game.away_team_id and stat.season == game.season][0]
file[‘home_pace’] = home_stats.tempo
file[‘home_o_rating’] = home_stats.team_stats.ranking
file[‘home_d_rating’] = home_stats.opponent_stats.ranking
file[‘home_free_throw_rate’] = home_stats.team_stats.four_factors.free_throw_rate
file[‘home_offensive_rebound_rate’] = home_stats.team_stats.four_factors.offensive_rebound_pct
file[‘home_turnover_ratio’] = home_stats.team_stats.four_factors.turnover_ratio
file[‘home_efg’] = home_stats.team_stats.four_factors.effective_field_goal_pct
file[‘home_free_throw_rate_allowed’] = home_stats.opponent_stats.four_factors.free_throw_rate
file[‘home_offensive_rebound_rate_allowed’] = home_stats.opponent_stats.four_factors.offensive_rebound_pct
file[‘home_turnover_ratio_forced’] = home_stats.opponent_stats.four_factors.turnover_ratio
file[‘home_efg_allowed’] = home_stats.opponent_stats.four_factors.effective_field_goal_pct
file[‘away_pace’] = away_stats.tempo
file[‘away_o_rating’] = away_stats.team_stats.ranking
file[‘away_d_rating’] = away_stats.opponent_stats.ranking
file[‘away_free_throw_rate’] = away_stats.team_stats.four_factors.free_throw_rate
file[‘away_offensive_rebound_rate’] = away_stats.team_stats.four_factors.offensive_rebound_pct
file[‘away_turnover_ratio’] = away_stats.team_stats.four_factors.turnover_ratio
file[‘away_efg’] = away_stats.team_stats.four_factors.effective_field_goal_pct
file[‘away_free_throw_rate_allowed’] = away_stats.opponent_stats.four_factors.free_throw_rate
file[‘away_offensive_rebound_rate_allowed’] = away_stats.opponent_stats.four_factors.offensive_rebound_pct
file[‘away_turnover_ratio_forced’] = away_stats.opponent_stats.four_factors.turnover_ratio
file[‘away_efg_allowed’] = away_stats.opponent_stats.four_factors.effective_field_goal_pct
information.append(file)
len(information)

All that is left to do is load this into an information body. As soon as loaded up, I’m going to compute a brand new column for the ultimate rating margin based mostly on the house and away rating columns.

df = pd.DataFrame(information)
df[‘margin’] = df.homePoints – df.awayPoints
df.head()

id
sourceId
seasonLabel
season
seasonType
startDate
startTimeTbd
neutralSite
conferenceGame
gameType
…
away_d_rating
away_free_throw_rate
away_offensive_rebound_rate
away_turnover_ratio
away_efg
away_free_throw_rate_allowed
away_offensive_rebound_rate_allowed
away_turnover_ratio_forced
away_efg_allowed
margin

0
12010
401638579
20232024
2024
SeasonType.POSTSEASON
2024-03-19 18:40:00+00:00
False
True
False
TRNMNT
…
98.3
26.2
31.4
0.2
45.4
29.1
25.4
0.2
47.9
-3

1
12009
401638580
20232024
2024
SeasonType.POSTSEASON
2024-03-19 21:10:00+00:00
False
True
False
TRNMNT
…
102.0
32.4
23.5
0.2
55.4
31.4
28.4
0.2
48.8
-25

2
12023
401638581
20232024
2024
SeasonType.POSTSEASON
2024-03-20 18:40:00+00:00
False
True
False
TRNMNT
…
114.5
39.1
29.7
0.2
48.9
32.6
32.2
0.2
49.0
-7

3
12022
401638582
20232024
2024
SeasonType.POSTSEASON
2024-03-20 21:28:00+00:00
False
True
False
TRNMNT
…
102.7
35.3
27.0
0.2
55.3
28.1
29.1
0.2
49.3
-7

4
12022
401638582
20232024
2024
SeasonType.POSTSEASON
2024-03-20 21:28:00+00:00
False
True
False
TRNMNT
…
102.7
35.3
27.0
0.2
55.3
28.1
29.1
0.2
49.3
-7

5 rows × 58 columns

Step one right here is characteristic choice. Let’s examine what columns are at the moment included within the knowledge body.

df.columns
Index([‘id’, ‘sourceId’, ‘seasonLabel’, ‘season’, ‘seasonType’, ‘startDate’,
‘startTimeTbd’, ‘neutralSite’, ‘conferenceGame’, ‘gameType’,
‘tournament’, ‘gameNotes’, ‘status’, ‘attendance’, ‘homeTeamId’,
‘homeTeam’, ‘homeConferenceId’, ‘homeConference’, ‘homeSeed’,
‘homePoints’, ‘homePeriodPoints’, ‘homeWinner’, ‘awayTeamId’,
‘awayTeam’, ‘awayConferenceId’, ‘awayConference’, ‘awaySeed’,
‘awayPoints’, ‘awayPeriodPoints’, ‘awayWinner’, ‘excitement’, ‘venueId’,
‘venue’, ‘city’, ‘state’, ‘home_pace’, ‘home_o_rating’, ‘home_d_rating’,
‘home_free_throw_rate’, ‘home_offensive_rebound_rate’,
‘home_turnover_ratio’, ‘home_efg’, ‘home_free_throw_rate_allowed’,
‘home_offensive_rebound_rate_allowed’, ‘home_turnover_ratio_forced’,
‘home_efg_allowed’, ‘away_pace’, ‘away_o_rating’, ‘away_d_rating’,
‘away_free_throw_rate’, ‘away_offensive_rebound_rate’,
‘away_turnover_ratio’, ‘away_efg’, ‘away_free_throw_rate_allowed’,
‘away_offensive_rebound_rate_allowed’, ‘away_turnover_ratio_forced’,
‘away_efg_allowed’, ‘margin’],
dtype=”object”)

We’re going to pull out the columns we will probably be utilizing, particularly the characteristic for coaching and the output we will probably be coaching towards (margin).

options = [
‘home_o_rating’,
‘home_d_rating’,
‘home_pace’,
‘home_free_throw_rate’,
‘home_offensive_rebound_rate’,
‘home_turnover_ratio’,
‘home_efg’,
‘home_free_throw_rate_allowed’,
‘home_offensive_rebound_rate_allowed’,
‘home_turnover_ratio_forced’,
‘home_efg_allowed’,
‘away_o_rating’,
‘away_d_rating’,
‘away_pace’,
‘away_free_throw_rate’,
‘away_offensive_rebound_rate’,
‘away_turnover_ratio’,
‘away_efg’,
‘away_free_throw_rate_allowed’,
‘away_offensive_rebound_rate_allowed’,
‘away_turnover_ratio_forced’,
‘away_efg_allowed’,
‘homeSeed’,
‘awaySeed’
]

outputs = [‘margin’]

df[features + outputs]

home_o_rating
home_d_rating
home_pace
home_free_throw_rate
home_offensive_rebound_rate
home_turnover_ratio
home_efg
home_free_throw_rate_allowed
home_offensive_rebound_rate_allowed
home_turnover_ratio_forced
…
away_offensive_rebound_rate
away_turnover_ratio
away_efg
away_free_throw_rate_allowed
away_offensive_rebound_rate_allowed
away_turnover_ratio_forced
away_efg_allowed
homeSeed
awaySeed
margin

0
107.8
106.2
67.4
41.9
31.0
0.2
52.4
39.2
33.5
0.2
…
31.4
0.2
45.4
29.1
25.4
0.2
47.9
16
16
-3

1
103.6
96.8
59.4
25.1
26.9
0.1
49.3
25.7
27.2
0.2
…
23.5
0.2
55.4
31.4
28.4
0.2
48.8
10
10
-25

2
111.7
109.8
65.2
29.7
22.2
0.2
54.5
35.9
26.5
0.2
…
29.7
0.2
48.9
32.6
32.2
0.2
49.0
16
16
-7

3
113.6
101.3
65.2
36.8
30.7
0.2
52.2
31.9
24.8
0.2
…
27.0
0.2
55.3
28.1
29.1
0.2
49.3
10
10
-7

4
113.6
101.3
65.2
36.8
30.7
0.2
52.2
31.9
24.8
0.2
…
27.0
0.2
55.3
28.1
29.1
0.2
49.3
10
10
-7

…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…

681
118.4
96.6
59.2
43.4
32.5
0.2
52.7
32.6
31.7
0.2
…
28.5
0.2
51.4
35.5
36.4
0.2
43.9
1
7
-10

682
118.4
96.6
59.2
43.4
32.5
0.2
52.7
32.6
31.7
0.2
…
28.5
0.2
51.4
35.5
36.4
0.2
43.9
1
7
-10

683
120.4
105.2
61.2
44.1
26.5
0.1
53.1
25.9
29.1
0.2
…
35.6
0.2
49.7
37.8
36.1
0.2
45.0
2
8
-1

684
115.2
101.1
61.7
38.6
28.5
0.2
51.4
35.5
36.4
0.2
…
35.6
0.2
49.7
37.8
36.1
0.2
45.0
7
8
6

685
115.2
101.1
61.7
38.6
28.5
0.2
51.4
35.5
36.4
0.2
…
35.6
0.2
49.7
37.8
36.1
0.2
45.0
7
8
6

686 rows × 25 columns

Once more, you may be at liberty to combine that up. For those who added or modified any of the statistics within the prior part, that is the place you have to to include them.

We are going to now break up our knowledge set into coaching knowledge and testing knowledge. Coaching knowledge will probably be utilized in coaching the mannequin. Testing knowledge is pulled again to check out the mannequin as soon as it is able to go. On this instance, I’m pulling 2024 match video games as my check set. In case you are operating via this trying to make predictions on tourney video games which are sooner or later, you may pull these video games as an alternative (assuming you pulled video games and statistics for that season into the information set).

coaching = df.question(“season != 2024”).copy()
testing = df.question(“season == 2024”).copy()

We’re going to additional break up out the coaching knowledge into coaching and validation units. Each of those units will probably be utilized in coaching the mannequin. The coaching set is what is definitely fed into the mannequin whereas the validation set is what the mannequin makes use of in coaching to validate whether or not it’s really bettering. This mechanism mitigates overfitting onto the coaching knowledge.

X_train, X_valid, y_train, y_valid = train_test_split(coaching[features], coaching[outputs], train_size=0.8, test_size=0.2, random_state=0)

Word that this splits the coaching options (X) out from the anticipated outputs (y). Within the instance above, we’re randomly holding again 20% of the dataset for use for validation.

We’re prepared to coach! We will probably be utilizing XGBRegressor to make use of our gradient boosting mannequin for regression. If we have been doing classification, we’d use XGBClassifier.

mannequin = XGBRegressor(random_state=0)
mannequin.match(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, gadget=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, lacking=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=0, …)In a Jupyter atmosphere, please rerun this cell to indicate the HTML illustration or belief the pocket book. On GitHub, the HTML illustration is unable to render, please attempt loading this web page with nbviewer.org.

And similar to that, we’ve a educated mannequin! We are able to make predictions towards our validation set.

predictions = mannequin.predict(X_valid)
predictions
array([-1.87790477e+00, 7.16752386e+00, 1.32060270e+01, 6.78795004e+00,
1.44662819e+01, -2.85689831e+00, -8.69423985e-01, 8.75045967e+00,
3.85790849e+00, -6.43919373e+00, -8.83276880e-01, 6.97011662e+00,
4.38355398e+00, 8.06833267e+00, -8.77752018e+00, 5.22899723e+00,
2.80364990e+00, 3.31810045e+00, -9.09639931e+00, -1.38665593e+00,
4.66550255e+00, 3.16841202e+01, 9.18671894e+00, -2.34628081e+00,
1.58264847e+01, 9.93082142e+00, 9.44772053e+00, 1.88728504e+01,
2.87765160e+01, 3.31487012e+00, 1.30118427e+01, -1.30986392e-01,
5.33917189e+00, 8.50678921e+00, -3.34483713e-01, 2.57094145e+00,
1.66184235e+01, 5.99199915e+00, -2.74236417e+00, 1.33841276e+00,
-5.50944662e+00, -8.56299973e+00, 9.36406422e+00, 1.27445345e+01,
-5.79891968e+00, 9.32999039e+00, 4.99850559e+00, 1.41290035e+01,
1.27072744e+01, 5.49775696e+00, 2.92133301e-01, 2.85389748e+01,
-2.77683735e+00, 1.41666784e+01, 1.65023022e+01, 6.03557158e+00,
2.24876385e+01, -5.69163513e+00, 5.78824818e-01, 2.18679352e+01,
1.81881466e+01, 6.27820158e+00, -3.48073578e+00, -2.05786265e-02,
2.38070393e+01, 7.80937290e+00, 2.68855405e+00, 1.00340958e+01,
1.03051748e+01, 6.70673037e+00, -4.66818810e+00, 1.42929211e+01,
5.93736887e+00, 2.18488560e+01, -3.96203065e+00, -6.01904249e+00,
1.15123062e+01, 1.06525719e+00, -5.60221529e+00, -2.91650534e+00,
8.13025475e+00, -2.16232657e+00, -7.38539994e-02, -7.47696776e-03,
6.57202673e+00, 3.21248150e+00, 3.89195323e-01, 2.67519027e-01,
-1.49262440e+00, -5.93076229e+00, 1.55619888e+01, -9.42352295e-01,
6.86150503e+00, 2.09990826e+01, -2.62024927e+00, -3.10824728e+00,
1.55272758e+00, 6.41326475e+00, 2.17659950e+00, 2.06855249e+00,
1.48680840e+01, 3.38636231e+00, 1.16376562e+01, -1.75216424e+00,
1.12170439e+01, 1.02640734e+01, 1.19243898e+01, 6.55053318e-01,
1.79168587e+01, 1.12861748e+01, 1.15750656e+01, -1.21279058e+01,
-6.30171585e+00, 2.97097254e+00, 5.94197321e+00, -1.26525140e+00,
1.78847879e-01, 1.99955502e+01, 1.16229486e+01, 9.16914749e+00,
1.56323729e+01, 2.16536427e+01, 4.01582432e+00, 2.84138560e-01],
dtype=float32)

In case your validation set comprises video games which have already been performed, we will use this to calculate the imply absolute error (or some other metric) of our mannequin.

mae = mean_absolute_error(predictions, y_valid)
mae

7.965800762176514

I obtained a MAE of ~7.96. I will be sincere, I do not know how good that’s since I am a bit newer to basketball modeling. Primarily based on my studying, a MAE of round 6.5 is fairly good. So, that is maybe not nice however start line. My objective is to not have the very best mannequin however to stroll you thru this. It is going to be as much as you to make adjustments and get and get higher predictions.

What may superb tuning seem like? For one, we will replace the parameters on the mannequin. The under code snippet runs via the identical course of as above bu explicitly units the variety of estimators, the training charge, and the variety of jobs for the mannequin.

mannequin = XGBRegressor(n_estimators=100, learning_rate=0.05, n_jobs=4)
mannequin.match(X_train, y_train)
predictions = mannequin.predict(X_valid)
mae = mean_absolute_error(predictions, y_valid)
mae

7.976924419403076

As you may see, my MAE isn’t any higher, however you may mess around with these parameters and see in case you get something completely different. The easiest way to enhance this can probably come from tweaking the enter options and including extra stats.

Let’s return to our testing set, generate predictions, and evaluate them to precise outcomes from the 2024 NCAA Event.

predictions = mannequin.predict(testing[features])
testing[‘prediction’] = predictions
testing[[‘homeSeed’, ‘homeTeam’, ‘awaySeed’, ‘awayTeam’, ‘margin’, ‘prediction’]]

homeSeed
homeTeam
awaySeed
awayTeam
margin
prediction

0
16
Howard
16
Wagner
-3
4.429741

1
10
Virginia
10
Colorado State
-25
0.494260

2
16
Montana State
16
Grambling
-7
-0.163861

3
10
Boise State
10
Colorado
-7
0.399193

4
10
Boise State
10
Colorado
-7
0.399193

…
…
…
…
…
…
…

65
1
Purdue
2
Tennessee
6
-4.878470

66
4
Duke
11
NC State
-12
0.975319

67
1
Purdue
11
NC State
13
12.650157

68
1
UConn
4
Alabama
14
6.204337

69
1
UConn
1
Purdue
15
0.927093

70 rows × 6 columns

Let’s calculate the precise share of video games our mannequin accurately picked straight up.

testing.question(“(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)”).form[0] / testing.form[0]

0.6428571428571429

My mannequin accurately predicted all recreation within the 2024 Event at a 64.3% clip. Let’s take a look at simply the primary spherical. I am going use the gameNotes property (which comprises spherical info) to filter right down to first spherical video games.

testing[testing[‘gameNotes’].str.comprises(‘1st’)].question(“(margin < 0 and prediction < 0) or (margin > 0 and prediction > 0)”).form[0] / testing[testing[‘gameNotes’].str.comprises(‘1st’)].form[0]

0.696969696969697

For the primary spherical, I am at a barely higher 69.696969% clip (good).

At this level, we should always save our mannequin in order that we will load it up and use it at a later time.

mannequin.save_model(‘xgboostmodel’)

This exports the mannequin right into a file. Substitute xgboostmodel above with a filename of your selecting, particularly if you wish to practice and save a number of fashions. If we wish to use our mannequin afterward to make predictions, we will load it up as follows.

mannequin = XGBRegressor()
mannequin.load_model(‘xgboostmodel’)

As an example I wished to foretell a hypothetical matchup that hasn’t but occurred and is not even scheduled. This could be helpful in, for instance, filling out a bracket. Right here is an instance of how I’d try this with a reusable methodology.

stats = stats_api.get_team_season_stats(season=2025, season_type=”common”)

def predict_game(mannequin, stats, projected_home_seed, home_team, projected_away_seed, away_team):
home_stats = [stat for stat in stats if stat.team == home_team][0]
away_stats = [stat for stat in stats if stat.team == away_team][0]
file = {
‘home_o_rating’: home_stats.team_stats.ranking,
‘home_d_rating’: home_stats.opponent_stats.ranking,
‘home_pace’: home_stats.tempo,
‘home_free_throw_rate’: home_stats.team_stats.four_factors.free_throw_rate,
‘home_offensive_rebound_rate’: home_stats.team_stats.four_factors.offensive_rebound_pct,
‘home_turnover_ratio’: home_stats.team_stats.four_factors.turnover_ratio,
‘home_efg’: home_stats.team_stats.four_factors.effective_field_goal_pct,
‘home_free_throw_rate_allowed’: home_stats.opponent_stats.four_factors.free_throw_rate,
‘home_offensive_rebound_rate_allowed’: home_stats.opponent_stats.four_factors.offensive_rebound_pct,
‘home_turnover_ratio_forced’: home_stats.opponent_stats.four_factors.turnover_ratio,
‘home_efg_allowed’: home_stats.opponent_stats.four_factors.effective_field_goal_pct,
‘away_o_rating’: away_stats.team_stats.ranking,
‘away_d_rating’: away_stats.opponent_stats.ranking,
‘away_pace’: away_stats.tempo,
‘away_free_throw_rate’: away_stats.team_stats.four_factors.free_throw_rate,
‘away_offensive_rebound_rate’: away_stats.team_stats.four_factors.offensive_rebound_pct,
‘away_turnover_ratio’: away_stats.team_stats.four_factors.turnover_ratio,
‘away_efg’: away_stats.team_stats.four_factors.effective_field_goal_pct,
‘away_free_throw_rate_allowed’: away_stats.opponent_stats.four_factors.free_throw_rate,
‘away_offensive_rebound_rate_allowed’: away_stats.opponent_stats.four_factors.offensive_rebound_pct,
‘away_turnover_ratio_forced’: away_stats.opponent_stats.four_factors.turnover_ratio,
‘away_efg_allowed’: away_stats.opponent_stats.four_factors.effective_field_goal_pct,
‘homeSeed’: projected_home_seed,
‘awaySeed’: projected_away_seed
}
return mannequin.predict(pd.DataFrame([record]))[0]

predict_game(mannequin, stats, 5, ‘Michigan’, 11, ‘Dayton’)

np.float32(6.149086)

Within the above instance, I loaded up knowledge from the present season, created a technique that constructs an information body file utilizing the required options, after which known as that methodology to get a prediction, passing in a mannequin, stats assortment, and staff projected seeds and names. This mannequin predicts that Michigan as a 5 seed would beat Dayton as an 11 seed by 6.1 factors. Voila!

And that is the place I depart you. As talked about, there are various enhancements that may be made to get this factor prepared from prime time. There have been many options returned by the Stats API that we aren’t even utilizing. And none of our stats are opponent-adjusted. And you are not restricted to the Stats API, both. Tryi incorporating different endpoints and even different knowledge sources.

As all the time, let me know what you assume on Twitter, Bluesky, Discord, and many others. And good luck together with your brackets!

Source link