A repo containing all my Data Analysis projects
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
After running the first cell to load all necessary libraries, we need to load our dataset. Using pandas, load the dataset season_2021.csv
and save it as nfl
. Inspect the first few rows.
# load dataset
nfl = pd.read_csv('season_2021.csv')
# inspect first few rows
nfl.head()
team_name | week | day | date | result | opponent | tm_score | opp_score | 1stD_offense | TotYd_offense | PassY_offense | RushY_offense | TO_offense | 1stD_defense | TotYd_defense | PassY_defense | RushY_defense | TO_defense | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Arizona Cardinals | 1 | Sun | September 12 | W | Tennessee Titans | 38 | 13 | 22 | 416 | 280 | 136 | 1 | 17 | 248 | 162 | 86 | 3 |
1 | Arizona Cardinals | 2 | Sun | September 19 | W | Minnesota Vikings | 34 | 33 | 21 | 474 | 371 | 103 | 2 | 22 | 419 | 242 | 177 | 0 |
2 | Arizona Cardinals | 3 | Sun | September 26 | W | Jacksonville Jaguars | 31 | 19 | 23 | 407 | 316 | 91 | 1 | 20 | 361 | 202 | 159 | 4 |
3 | Arizona Cardinals | 4 | Sun | October 3 | W | Los Angeles Rams | 37 | 20 | 27 | 465 | 249 | 216 | 0 | 24 | 401 | 280 | 121 | 2 |
4 | Arizona Cardinals | 5 | Sun | October 10 | W | San Francisco 49ers | 17 | 10 | 20 | 304 | 210 | 94 | 1 | 19 | 338 | 186 | 152 | 1 |
Next, we want to examine our outcome variable to find out how wins and losses are recorded. Check the counts of each value of the result
variable.
# check result value counts
nfl.result.value_counts()
W 284
L 284
T 2
Name: result, dtype: int64
We have two problems with the result
variable:
T
group is very small, which can lead to issues with our model’s performance.We can solve both of these issues in one step! We’ll group the tie with the losses and convert to 1 for wins and 0 for ties and losses.
Using the provided encoder, use the .replace()
function to convert the result
column values to numeric values. Then check the value counts again to make sure you have only two categories that are numbers rather than letters.
# nested dictionary to encode alphanumeric values to numeric values
result_encoder = {'result': {'W': 1, 'T': 0, 'L': 0}}
# encode result column using encoder
nfl.replace(result_encoder, inplace=True)
# check result value counts
nfl['result'].value_counts()
0 286
1 284
Name: result, dtype: int64
Now let’s take a moment to explore trends in the stats we will be using to predict wins. The variable stat
has been set to 1stD_offense
by default.
Use sns.boxplot()
to create a box plot of stat
by wins and losses. Set the x
, y
, and data
parameters inside the function and save the plot as stat_plot
.
We’ve included code for plot labels and to view a list of the names of the stats in the dataset. Try changing the value of the stat
variable to any one of the stat names and run the cell again to see a plot of how losing teams’ stats compare to winning teams’ stats.
# change stat to view plot
stat = '1stD_offense'
# box plot of stat
stat_plot = sns.boxplot(data=nfl, x='result', y=stat)
# plot labels
stat_plot.set_xticklabels(['loss/tie','win'])
plt.show()
# list feature names
print(nfl.columns[8:])
Index(['1stD_offense', 'TotYd_offense', 'PassY_offense', 'RushY_offense',
'TO_offense', '1stD_defense', 'TotYd_defense', 'PassY_defense',
'RushY_defense', 'TO_defense'],
dtype='object')
Before running our regression, we need to prepare our data by standardizing all the game stats. The provided code saves the game stats to a variable named features
and saves the necessary scaling function as scaler
.
Use the function scaler.fit()
to fit features
to the the scaling function. Then use scaler.transform()
to standardize the game stats. Save this output as X
.
# select feature variables to be scaled
features = nfl.iloc[:,8:]
scaler = StandardScaler()
# fit the transformer to the features
scaler.fit(features)
# transform and save as X
X = scaler.transform(features)
Let’s also separate our game outcome variable for easier reference. Save the game outcomes as a variable called y
.
# save result variable as y
y = nfl['result']
We need to randomly split the data into two groups:
Use the train_test_split()
function imported from the sklearn
library to split the data. This function will split up our features and result labels into training data and testing data, with test_size
corresponding to the proportion of data reserved for testing. Set test_size
to 0.5 and random_state
to 42.
# create train-test split of the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
In this step, we’ll train our model to use the patterns of the offensive and defensive stats to predict the probability of a winning game.
Create a LogisticRegression()
classifier and save it to the variable lrc
. Then call the .fit()
function using the training data X_train
and y_train
.
# create the classifier
lrc = LogisticRegression()
# fit classifier to the training data
lrc.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
With our classifier fitted (trained) to the training data, we can use the trained classifier to make predictions on the test data. Pass the test features X_test
as a parameter of lrc.predict()
and save the resulting predictions as y_pred
.
Now we can check the percentage of outcomes that our model predicted correctly. Use the accuracy_score()
function imported from the sklearn
library to compare our predicted test values y_pred
to the true values y_test
.
# predict with the classifier using the .predict() function
y_pred = lrc.predict(X_test)
# view the model accuracy with the accuracy_score() function
accuracy_score(y_pred, y_test)
0.8280701754385965
We can improve our model performance by closely studying how different paremeters affect performance. Let’s consider two hyperparameters for the LogisticRegression
classifer: penalty
and C
.
penalty
imposes a regularization penalty on the model for having too many variables. Our options generally are l1
and l2
regularization.C
is the inverse of regularization strength. It is applying a penalty to increasing the magnitude of parameter values in order to reduce overfitting.The following code runs a logistic regression on our same data and gets an accuracy score for each combination of penalty
and C
. Run the code to see how model accuracy changes when we use different values of these hyperparameters. If you’d like, try changing the values of C
in the list.
# create a list of penalties
penalties = ['l1', 'l2']
# create a list of values for C
C = [0.01, 0.1, 1.0, 10.0, 1000.0]
for penalty in penalties:
for c in C:
# instantiate the classifier
lrc_tuned = LogisticRegression(penalty=penalty, C=c, solver='liblinear')
# fit the classifier to the training data
lrc_tuned.fit(X_train, y_train)
# predict with the classifier using the .predict() function
y_pred = lrc_tuned.predict(X_test)
# view the model accuracy with the accuracy_score() function
accuracy = accuracy_score(y_test, y_pred)
accuracy_rd = round(accuracy*100,1)
# print accuracy for each combination of penalty and C
print(f'Accuracy: {accuracy_rd}% | penalty = {penalty}, C = {c}')
Accuracy: 52.3% | penalty = l1, C = 0.01
Accuracy: 84.6% | penalty = l1, C = 0.1
Accuracy: 83.2% | penalty = l1, C = 1.0
Accuracy: 82.5% | penalty = l1, C = 10.0
Accuracy: 82.5% | penalty = l1, C = 1000.0
Accuracy: 80.4% | penalty = l2, C = 0.01
Accuracy: 83.5% | penalty = l2, C = 0.1
Accuracy: 82.8% | penalty = l2, C = 1.0
Accuracy: 82.5% | penalty = l2, C = 10.0
Accuracy: 82.5% | penalty = l2, C = 1000.0
In the cell above, we see that sweeping our parameters did not yield much improvement in prediction accuracy. Let’s try another method of parameter tuning: changing the test size of the train-test split. A list of test sizes between 0 and 1 has been coded for you. Similar to the last task, at each test size the code performs a train-test split, fits the model, and computes an accuracy score.
Run the code to see how test size affects accuracy. If you’d like, try changing the list of test sizes to get better accuracy.
# optimal penalty and C
penalty = 'l1'
C = 0.1
# create a list of test_sizes
test_sizes = [val/100 for val in range(20,36)]
for test_size in test_sizes:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
# instantiate the classifier
lrc_tts = LogisticRegression(penalty = penalty, C = C, solver='liblinear')
# fit the classifier to the training data
lrc_tts.fit(X_train, y_train)
# predict with the classifier using the .predict() function
y_pred = lrc_tts.predict(X_test)
# view the model accuracy with the accuracy_score() function
accuracy = accuracy_score(y_test, y_pred)
accuracy_rd = round(accuracy*100,1)
# print accuracy for each combination of penalty and test size
print(f'Accuracy: {accuracy_rd}% | test size = {test_size}')
Accuracy: 87.7% | test size = 0.2
Accuracy: 87.5% | test size = 0.21
Accuracy: 87.3% | test size = 0.22
Accuracy: 87.9% | test size = 0.23
Accuracy: 88.3% | test size = 0.24
Accuracy: 88.8% | test size = 0.25
Accuracy: 87.9% | test size = 0.26
Accuracy: 88.3% | test size = 0.27
Accuracy: 88.1% | test size = 0.28
Accuracy: 88.6% | test size = 0.29
Accuracy: 87.1% | test size = 0.3
Accuracy: 87.6% | test size = 0.31
Accuracy: 86.9% | test size = 0.32
Accuracy: 87.3% | test size = 0.33
Accuracy: 86.1% | test size = 0.34
Accuracy: 86.0% | test size = 0.35
Now that we know which parameters optimize our model, let’s run and save the final model with our choices for test_size
, penalty
, and C
. Fill in the code to run and save the final model as optLr
. Continue setting random_state=42
for the split.
# set the test size and hyperparameters
test_size = 0.25
penalty = 'l1'
C = 0.1
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
# instantiate the classifier
optLr = LogisticRegression(penalty = penalty, C = C, solver='liblinear')
# fit the classifier to the training data
lrc.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Let’s find out which stats were most important to our model predicting wins. The absolute values of the model coefficients has been saved for you as importance
. We’ll print and plot these scores to see which stat has the highest score.
Add code to create a bar plot of the feature importances.
# get importance
optLr.fit(X_train, y_train)
importance = abs(optLr.coef_[0])
# visualize feature importance
sns.barplot(x=importance, y=features.columns)
# add labels and titles
plt.suptitle('Feature Importance for Logistic Regression')
plt.xlabel('Score')
plt.ylabel('Stat')
plt.show()
# summarize feature importance
for i,v in enumerate(importance.round(2)):
print(f'Feature: {features.columns[i]}, Score: {v}')
Feature: 1stD_offense, Score: 0.08
Feature: TotYd_offense, Score: 0.56
Feature: PassY_offense, Score: 0.0
Feature: RushY_offense, Score: 0.13
Feature: TO_offense, Score: 0.75
Feature: 1stD_defense, Score: 0.19
Feature: TotYd_defense, Score: 0.6
Feature: PassY_defense, Score: 0.0
Feature: RushY_defense, Score: 0.21
Feature: TO_defense, Score: 0.75
Congratulations! You’ve conducted a successful case study on NFL data where the outcome of a game can be predicted using the team’s offensive and defensive stats from a given game.
Want to see how your model holds up for 2022? Change the team
variable to your favorite team’s name in the code cell below. We’ve provided the helper function get_new_data()
that will get that team’s data for the given year from the site Pro Football Reference.
We’ve provided the code for this final step, but we encourage learners who feel confident enough to try the challenge of coding the solution themselves!
# set team abbreviation (in capitals) and year
team = 'Dallas Cowboys'
year = 2022
# use helper function to pull new data
from helper import get_new_data
new_data = get_new_data(team=team, year=year)
# view head of new data
new_data.head()
team_name | week | day | date | result | opponent | tm_score | opp_score | 1stD_offense | TotYd_offense | PassY_offense | RushY_offense | TO_offense | 1stD_defense | TotYd_defense | PassY_defense | RushY_defense | TO_defense | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dallas Cowboys | 1 | Sun | September 11 | 0 | Tampa Bay Buccaneers | 3 | 19 | 12 | 244 | 173 | 71 | 1 | 18 | 347 | 195 | 152 | 1 |
1 | Dallas Cowboys | 2 | Sun | September 18 | 1 | Cincinnati Bengals | 20 | 17 | 19 | 337 | 230 | 107 | 1 | 19 | 254 | 165 | 89 | 0 |
2 | Dallas Cowboys | 3 | Mon | September 26 | 1 | New York Giants | 23 | 16 | 23 | 391 | 215 | 176 | 0 | 22 | 336 | 169 | 167 | 1 |
3 | Dallas Cowboys | 4 | Sun | October 2 | 1 | Washington Commanders | 25 | 10 | 15 | 279 | 217 | 62 | 0 | 17 | 297 | 155 | 142 | 2 |
4 | Dallas Cowboys | 5 | Sun | October 9 | 1 | Los Angeles Rams | 22 | 10 | 10 | 239 | 76 | 163 | 0 | 14 | 323 | 285 | 38 | 3 |
Before we can run the data in our model and get predictions, we need to standardize the stats using the same scaler
we used for our original dataset.
# select just the game stats
new_X = new_data.loc[:,features.columns]
# standardize using original data's scaling
new_X_sc = scaler.transform(new_X)
Now we can use our model to make predictions and get an accuracy score for how well our model predicted wins with the new data.
# get new predictions
new_preds = optLr.predict(new_X_sc)
# get actual results and set type to float
new_results = new_data['result'].astype(float)
# get accuracy score for new data
acc_score = accuracy_score(new_results, new_preds)
Let’s put all this information together in a table and print out our accuracy score.
# select only game data
col_names = ['day', 'date', 'result', 'opponent', 'tm_score', 'opp_score']
game_data = new_data.loc[:,col_names]
# create comparison table
comp_table = game_data.assign(predicted = new_preds,
actual = new_results.astype(int))
# print title and table
print(f'Predicted Wins vs Actual Wins for {team} in {year}')
comp_table
Predicted Wins vs Actual Wins for Dallas Cowboys in 2022
day | date | result | opponent | tm_score | opp_score | predicted | actual | |
---|---|---|---|---|---|---|---|---|
0 | Sun | September 11 | 0 | Tampa Bay Buccaneers | 3 | 19 | 0 | 0 |
1 | Sun | September 18 | 1 | Cincinnati Bengals | 20 | 17 | 1 | 1 |
2 | Mon | September 26 | 1 | New York Giants | 23 | 16 | 1 | 1 |
3 | Sun | October 2 | 1 | Washington Commanders | 25 | 10 | 1 | 1 |
4 | Sun | October 9 | 1 | Los Angeles Rams | 22 | 10 | 1 | 1 |
5 | Sun | October 16 | 0 | Philadelphia Eagles | 17 | 26 | 0 | 0 |
6 | Sun | October 23 | 1 | Detroit Lions | 24 | 6 | 1 | 1 |
7 | Sun | October 30 | 1 | Chicago Bears | 49 | 29 | 1 | 1 |
8 | Sun | November 13 | 0 | Green Bay Packers | 28 | 31 | 0 | 0 |
9 | Sun | November 20 | 1 | Minnesota Vikings | 40 | 3 | 1 | 1 |
10 | Thu | November 24 | 1 | New York Giants | 28 | 20 | 0 | 1 |
11 | Sun | December 4 | 1 | Indianapolis Colts | 54 | 19 | 1 | 1 |
12 | Sun | December 11 | 1 | Houston Texans | 27 | 23 | 1 | 1 |
13 | Sun | December 18 | 0 | Jacksonville Jaguars | 34 | 40 | 0 | 0 |
14 | Sat | December 24 | 1 | Philadelphia Eagles | 40 | 34 | 1 | 1 |
# print accuracy
print(f'\nCurrent Accuracy Score: ' + str(round(acc_score*100,1)) + '%')
Current Accuracy Score: 93.3%
Our table gives us some context on the game, the opponent, and our prediction. Feel free to go back and change the team name or year (you can look at past years too!).