You are on page 1of 13

BSA Workshop - NFL Betting Data

Joey Maurer
1/23/2019

Introduction
This workshop will outline the basic process of exploratory analysis and more advanced predictions using R.
Most analysis projects have a similar framework. The methods here will be carried out on a relatively large
data set.

Data
NFL betting data from 2000-2019. Includes game results, point spread, over/under, and other variables such
as weather, stadium, etc.
library(ggplot2)
df <- read.csv("../../Datasets/2019-January/scores.csv",header=TRUE,stringsAsFactors=FALSE)

#df <- read.csv("https://raw.githubusercontent.com/bruinsportsanalytics/Resource-Folder/master/Data/Foot

Let’s take a look at the basic structure of the data.


head(df)

## schedule_date schedule_season schedule_week schedule_playoff


## 1 9/3/00 2000 1 FALSE
## 2 9/3/00 2000 1 FALSE
## 3 9/3/00 2000 1 FALSE
## 4 9/3/00 2000 1 FALSE
## 5 9/3/00 2000 1 FALSE
## 6 9/3/00 2000 1 FALSE
## team_home score_home team_away score_away
## 1 Atlanta Falcons 36 San Francisco 49ers 28
## 2 Buffalo Bills 16 Tennessee Titans 13
## 3 Cleveland Browns 7 Jacksonville Jaguars 27
## 4 Dallas Cowboys 14 Philadelphia Eagles 41
## 5 Green Bay Packers 16 New York Jets 20
## 6 Kansas City Chiefs 14 Indianapolis Colts 27
## team_favorite spread_favorite over_under_line
## 1 Atlanta Falcons -6.5 46.5
## 2 Buffalo Bills -1.0 40.0
## 3 Jacksonville Jaguars -10.5 38.5
## 4 Dallas Cowboys -6.0 39.5
## 5 Green Bay Packers -2.5 44.0
## 6 Indianapolis Colts -3.5 44.0
## stadium stadium_neutral weather_temperature weather_detail
## 1 Georgia Dome FALSE 72 DOME
## 2 Ralph Wilson Stadium FALSE 70
## 3 FirstEnergy Stadium FALSE 75
## 4 Texas Stadium FALSE 95

1
## 5 Lambeau Field FALSE 69
## 6 Arrowhead Stadium FALSE 86
str(df)

## 'data.frame': 5050 obs. of 15 variables:


## $ schedule_date : chr "9/3/00" "9/3/00" "9/3/00" "9/3/00" ...
## $ schedule_season : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ schedule_week : chr "1" "1" "1" "1" ...
## $ schedule_playoff : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ team_home : chr "Atlanta Falcons" "Buffalo Bills" "Cleveland Browns" "Dallas Cowboys" ..
## $ score_home : int 36 16 7 14 16 14 23 30 16 10 ...
## $ team_away : chr "San Francisco 49ers" "Tennessee Titans" "Jacksonville Jaguars" "Philade
## $ score_away : int 28 13 27 41 20 27 0 27 21 14 ...
## $ team_favorite : chr "Atlanta Falcons" "Buffalo Bills" "Jacksonville Jaguars" "Dallas Cowboys
## $ spread_favorite : num -6.5 -1 -10.5 -6 -2.5 -3.5 -2 -4.5 -3 0 ...
## $ over_under_line : num 46.5 40 38.5 39.5 44 44 36 46.5 36 40.5 ...
## $ stadium : chr "Georgia Dome" "Ralph Wilson Stadium" "FirstEnergy Stadium" "Texas Stadi
## $ stadium_neutral : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ weather_temperature: int 72 70 75 95 69 86 84 72 63 72 ...
## $ weather_detail : chr "DOME" "" "" "" ...
summary(df)

## schedule_date schedule_season schedule_week schedule_playoff


## Length:5050 Min. :2000 Length:5050 Mode :logical
## Class :character 1st Qu.:2004 Class :character FALSE:4848
## Mode :character Median :2009 Mode :character TRUE :202
## Mean :2009
## 3rd Qu.:2014
## Max. :2018
##
## team_home score_home team_away score_away
## Length:5050 Min. : 0.00 Length:5050 Min. : 0.0
## Class :character 1st Qu.:16.00 Class :character 1st Qu.:13.0
## Mode :character Median :23.00 Mode :character Median :20.0
## Mean :23.18 Mean :20.6
## 3rd Qu.:30.00 3rd Qu.:27.0
## Max. :62.00 Max. :59.0
## NA's :4 NA's :4
## team_favorite spread_favorite over_under_line stadium
## Length:5050 Min. :-26.500 Min. :30.00 Length:5050
## Class :character 1st Qu.: -7.000 1st Qu.:39.50 Class :character
## Mode :character Median : -4.500 Median :43.00 Mode :character
## Mean : -5.385 Mean :43.14
## 3rd Qu.: -3.000 3rd Qu.:46.50
## Max. : 0.000 Max. :63.50
##
## stadium_neutral weather_temperature weather_detail
## Mode :logical Min. :-6.00 Length:5050
## FALSE:4994 1st Qu.:50.00 Class :character
## TRUE :56 Median :64.00 Mode :character
## Mean :60.41
## 3rd Qu.:72.00
## Max. :97.00

2
## NA's :120
This is already a pretty clean data set. Often, you’ll have to spend a significant chunk of time redefining and
creating new variables, editing types, and tracing down mistakes/errors in the data set. This may be the
most important step, and will pay off when you get to the flashier visualizations and predictive models.
For now, we’ll add a couple of features to this data set.
# Specify who won the game
determine_winner <- function(h_team,h_score,a_team,a_score) {
if(h_score>a_score) {
return(h_team)
} else if(a_score>h_score) {
return(a_team)
} else {
return("TIE")
}
}

df$game_winner <- apply(df,1,function(x){determine_winner(x[5],as.numeric(x[6]),x[7],as.numeric(x[8]))})

## Error in if (h_score > a_score) {: missing value where TRUE/FALSE needed


Whoops, we are already getting an error. It looks like there is an NA somewhere in one of the scores columns:
df[is.na(df$score_home)|is.na(df$score_away),]

## schedule_date schedule_season schedule_week schedule_playoff


## 5047 1/5/19 2018 Wildcard TRUE
## 5048 1/5/19 2018 Wildcard TRUE
## 5049 1/6/19 2018 Wildcard TRUE
## 5050 1/6/19 2018 Wildcard TRUE
## team_home score_home team_away score_away
## 5047 Houston Texans NA Indianapolis Colts NA
## 5048 Dallas Cowboys NA Seattle Seahawks NA
## 5049 Chicago Bears NA Philadelphia Eagles NA
## 5050 Baltimore Ravens NA Los Angeles Chargers NA
## team_favorite spread_favorite over_under_line stadium
## 5047 Houston Texans -1 48.5 NRG Stadium
## 5048 Dallas Cowboys -2 43.0 AT&T Stadium
## 5049 Chicago Bears -6 41.0 Soldier Field
## 5050 Baltimore Ravens -3 42.0 M&T Bank Stadium
## stadium_neutral weather_temperature weather_detail
## 5047 FALSE 72 DOME
## 5048 FALSE 72 DOME
## 5049 FALSE 37
## 5050 FALSE 43
We see the recent playoff games are listed at the bottom with no scores. Let’s remove all playoff games from
this year.
df <- df[!(df$schedule_season=="2018" & df$schedule_playoff==TRUE),]

The bottom four rows have been removed and we can run the game winner code again with no errors.
df$game_winner <- apply(df,1,function(x){determine_winner(x[5],as.numeric(x[6]),x[7],as.numeric(x[8]))})
head(df[,c(5:8,16)])

## team_home score_home team_away score_away

3
## 1 Atlanta Falcons 36 San Francisco 49ers 28
## 2 Buffalo Bills 16 Tennessee Titans 13
## 3 Cleveland Browns 7 Jacksonville Jaguars 27
## 4 Dallas Cowboys 14 Philadelphia Eagles 41
## 5 Green Bay Packers 16 New York Jets 20
## 6 Kansas City Chiefs 14 Indianapolis Colts 27
## game_winner
## 1 Atlanta Falcons
## 2 Buffalo Bills
## 3 Jacksonville Jaguars
## 4 Philadelphia Eagles
## 5 New York Jets
## 6 Indianapolis Colts
A few more variables. . .
# Total points scored in game
df$total_points <- df$score_home+df$score_away

# Winner against the spread


spread_winner <- function(h_team,h_score,a_team,a_score,favorite,spread) {
if(h_team==favorite) {
h_score <- h_score + spread
} else if(a_team==favorite) {
a_score <- a_score + spread
}
return(determine_winner(h_team,h_score,a_team,a_score))
}

df$spread_winner <- apply(df,1,function(x){spread_winner(x[5],as.numeric(x[6]),x[7],as.numeric(x[8]),x[9

There are plenty of avenues to explore now. Here are a few visualizations so we can get a better look at the
distributions of some of these variables.
# install.packages("ggplot2")
# library(ggplot2)

# Histogram of spreads
ggplot(df,aes(x=spread_favorite)) +
geom_histogram(binwidth=1,col="black") +
scale_x_continuous(breaks=seq(-30,0))

4
1250

1000

750
count

500

250

−27−26−25−24−23−22−21−20−19−18−17−16−15−14−13−12−11−10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0
spread_favorite
# Density curve for over/under
ggplot(df,aes(x=over_under_line)) +
geom_line(stat="density") +
geom_vline(xintercept=median(df$over_under_line),linetype="dashed")

5
0.08

0.06
density

0.04

0.02

0.00

30 40 50 60
over_under_line
QUESTION: Can you generate a graph of over/under line vs actual number of points scored? (Hint: Use
geom_point() in ggplot)
ggplot(df,aes(x=over_under_line,y=total_points)) +
geom_point()

6
100

75
total_points

50

25

0
30 40 50 60
over_under_line
Correlation between the two.
cor(df$over_under_line,df$total_points)

## [1] 0.3184939
Moderately positive, but not as strong as you might expect. . .
A few tables with descriptive statistics:
How often does the favorite win the game?
tempdf <- df[df$team_favorite!="EVEN",]
table(tempdf$team_favorite==tempdf$game_winner) / nrow(tempdf)

##
## FALSE TRUE
## 0.3367306 0.6632694
How often does the favorite cover the spread?
table(tempdf$team_favorite==tempdf$spread_winner) / nrow(tempdf)

##
## FALSE TRUE
## 0.5273781 0.4726219
Interesting. The underdog seems to have a slight advantage. People like betting the favorite. Does small
favorite vs large favorite make a difference?
small_fav <- tempdf[tempdf$spread_favorite>=quantile(tempdf$spread_favorite,probs=.75),]
large_fav <- tempdf[tempdf$spread_favorite<=quantile(tempdf$spread_favorite,probs=.25),]

7
table(small_fav$team_favorite==small_fav$spread_winner) / nrow(small_fav)

##
## FALSE TRUE
## 0.5438402 0.4561598
table(large_fav$team_favorite==large_fav$spread_winner) / nrow(large_fav)

##
## FALSE TRUE
## 0.5326504 0.4673496
Slightly more extreme proportion for a small favorite. When the spread is small, the favorite is less likely
to cover. Of course, statistical tests can be performed to check the significance of these results. Here is an
example of a binomial test.
binom.test(sum(tempdf$team_favorite==tempdf$spread_winner),nrow(tempdf),p=.5)

##
## Exact binomial test
##
## data: sum(tempdf$team_favorite == tempdf$spread_winner) and nrow(tempdf)
## number of successes = 2365, number of trials = 5004, p-value =
## 0.0001133
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4587086 0.4865671
## sample estimates:
## probability of success
## 0.4726219
QUESTION: What is the winning percentage of a 10 point or higher favorite?
ten_point_df <- tempdf[tempdf$spread_favorite<=-10,]
table(ten_point_df$team_favorite==ten_point_df$game_winner) / nrow(ten_point_df)

##
## FALSE TRUE
## 0.1625207 0.8374793
Create a data frame with two columns. The first is a vector from 1 to 10 by 0.5. The second is the winning
percentage of the favorite in games with that spread. You can make a helper function to compute this. Then
make a barchart with the results.
win_df <- data.frame(spread=seq(1,10,by=.5),pct=NA)
compute_win_pct <- function(df,spread) {
tdf <- df[abs(df$spread_favorite)==spread,]
return(mean(tdf$team_favorite==tdf$game_winner))
}
win_df$pct <- sapply(win_df$spread,function(x){compute_win_pct(tempdf,x)})

ggplot(data=win_df,aes(x=spread,y=pct)) +
geom_bar(stat="identity",fill="forestgreen",col="black") +
geom_hline(yintercept=.5,linetype="dashed",col="brown",size=1) +
coord_cartesian(ylim=c(.3,.9)) +
scale_x_continuous(breaks=seq(1,10,by=.5)) +
ggtitle("Favorite Win % by Spread NFL 2000-2018")

8
Favorite Win % by Spread NFL 2000−2018

0.8
pct

0.6

0.4

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
spread
Obviously, there are much more exploratory things like this that you can do. Haven’t even looked at playoffs
vs regular season, weather, stadium features, or trends over time. Any of these would make a very solid basis
for a data journalism article.
Let’s get a little fancier and try to predict the result of this year’s Super Bowl: Los Angeles Rams vs New
England Patriots. Patriots are -2.5 favorites as of 1/22/2019. Over/under is 57.5. -> Vegas score: 30 - 27.5
New England.
Both teams against the spread this year:
pats_games <- tempdf[tempdf$schedule_season=="2018" & (tempdf$team_home=="New England Patriots" | tempdf
rams_games <- tempdf[tempdf$schedule_season=="2018" & (tempdf$team_home=="Los Angeles Rams" | tempdf$tea

# NE
paste(sum(pats_games$spread_winner=="New England Patriots"),"-",sum(pats_games$spread_winner!="New Engla

## [1] "9 - 7 - 0"


# LAR
paste(sum(rams_games$spread_winner=="Los Angeles Rams"),"-",sum(rams_games$spread_winner!="Los Angeles R

## [1] "7 - 8 - 1"


Note: The Patriots and Rams were favored in every game they played this season.
Build a linear regression model to predict the total amount of points scored in the game.
# Variables to work with: Playoff game, spread, over/under line, weather
lin_reg_df <- df[,c(4,10,11,14,17)]
model <- lm(total_points~.,data=lin_reg_df)
summary(model)

9
##
## Call:
## lm(formula = total_points ~ ., data = lin_reg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.823 -9.330 -0.790 8.678 67.334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.639169 1.785155 2.599 0.00938 **
## schedule_playoffTRUE 2.214652 1.550786 1.428 0.15333
## spread_favorite 0.030379 0.056709 0.536 0.59220
## over_under_line 0.919591 0.039765 23.125 < 2e-16 ***
## weather_temperature -0.006876 0.012488 -0.551 0.58192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.5 on 4921 degrees of freedom
## (120 observations deleted due to missingness)
## Multiple R-squared: 0.1024, Adjusted R-squared: 0.1017
## F-statistic: 140.4 on 4 and 4921 DF, p-value: < 2.2e-16
Not a very good model. Only over/under is a significant predictor and the Rˆ2 is very low. The diagnostics
tell a similar story.
plot(model)

Residuals vs Fitted
80

1218
60

1874
2211
40
Residuals

20
0
−40

35 40 45 50 55 60

Fitted values
lm(total_points ~ .)

10
Normal Q−Q

1218
Standardized residuals

1874
4
2211
2
0
−2

−4 −2 0 2 4

Theoretical Quantiles
lm(total_points ~ .)
Scale−Location
1218
2.0

1874
2211
Standardized residuals

1.5
1.0
0.5
0.0

35 40 45 50 55 60

Fitted values
lm(total_points ~ .)

11
Standardized residuals Residuals vs Leverage

2646
4787
3712
2
0
−2

Cook's distance

0.000 0.005 0.010 0.015

Leverage
lm(total_points ~ .) We
can also get a predicted value of the total points scored:
predict_df <- data.frame(schedule_playoff=TRUE,spread_favorite=-2.5,over_under_line=57.5,weather_tempera
predict_df$pred_total_pts <- predict(model,newdata=predict_df,type="response")
predict_df

## schedule_playoff spread_favorite over_under_line weather_temperature


## 1 TRUE -2.5 57.5 54
## pred_total_pts
## 1 59.28301
This model predicts 59.3 total points.
Build a logistic regression model to predict the probability of a Patriots win (that is, the favorite wins the
game).
log_reg_df <- df[,c(4,9,10,11,14,16)]
log_reg_df$result <- apply(log_reg_df,1,function(x){if(x[2]==x[6]) return(1) else return(0)})
model2 <- glm(result~schedule_playoff+spread_favorite+over_under_line+weather_temperature,data=log_reg_d
summary(model2)

##
## Call:
## glm(formula = result ~ schedule_playoff + spread_favorite + over_under_line +
## weather_temperature, data = log_reg_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9921 -0.5680 0.2584 0.3881 0.5055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4669435 0.0610466 7.649 2.42e-14 ***

12
## schedule_playoffTRUE -0.0174920 0.0530319 -0.330 0.7415
## spread_favorite -0.0311797 0.0019393 -16.078 < 2e-16 ***
## over_under_line 0.0017076 0.0013598 1.256 0.2093
## weather_temperature -0.0008243 0.0004271 -1.930 0.0536 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2132729)
##
## Null deviance: 1107.7 on 4925 degrees of freedom
## Residual deviance: 1049.5 on 4921 degrees of freedom
## (120 observations deleted due to missingness)
## AIC: 6374.8
##
## Number of Fisher Scoring iterations: 2
Again, not the best model but this gives you a basic idea of the process. We can get a predicted probability:
predict_df2 <- data.frame(schedule_playoff=TRUE,spread_favorite=-2.5,over_under_line=57.5,weather_temper
predict_df2$pred_probability <- predict(model2,newdata=predict_df2,type="response")
predict_df2

## schedule_playoff spread_favorite over_under_line weather_temperature


## 1 TRUE -2.5 57.5 54
## pred_probability
## 1 0.581076
This model gives the Patriots a 58.1% chance of winning the Super Bowl.
Hopefully this has given you some ideas for a potential article. I encourage you to keep playing around with
the data. Hands on projects are one of the best ways to learn and become better at programming.
Thank you everyone!

13

You might also like