A repo containing all my Data Analysis projects
import pandas as pd
import datetime as dt
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# set plot theme and palette
sns.set_theme()
sns.set_palette('colorblind')
After running the first cell to load all necessary libraries, we need to load our dataset. Using pandas, load the dataset traffic.csv
and save it as traffic
. Inspect the first few rows.
# load dataset
traffic = pd.read_csv('traffic.csv')
# inspect first few rows
traffic.head()
Date | Crashes_per_100k | Season | |
---|---|---|---|
0 | 2006-01-01 | 169.176541 | Winter |
1 | 2006-02-01 | 154.028836 | Winter |
2 | 2006-03-01 | 159.930002 | Spring |
3 | 2006-04-01 | 155.741270 | Spring |
4 | 2006-05-01 | 168.179208 | Spring |
The traffic
data frame contains three columns: Date
, Crashes_per_100k
, and Season
. In order to plot the Crashes_per_100k
column as a time series, we need to make sure that the Date
column is in date format. Inspect the data types in the data frame, convert the Date
column to date format, and inspect the data types a second time.
# inspect data types
traffic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 180 non-null object
1 Crashes_per_100k 180 non-null float64
2 Season 180 non-null object
dtypes: float64(1), object(2)
memory usage: 4.3+ KB
Convert the Date
column to the date datatype using the pd.to_datatime(column)
function.
# convert Date to date format
traffic['Date'] = pd.to_datetime(traffic['Date'])
# inspect data types
traffic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 180 non-null datetime64[ns]
1 Crashes_per_100k 180 non-null float64
2 Season 180 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 4.3+ KB
To get a sense of trends that may exist in the data, use seaborn’s sns.lineplot()
function to create a line plot of the traffic
data with Date
on the x-axis and Crashes_per_100k
on the y-axis.
# create line plot
sns.lineplot(x='Date', y='Crashes_per_100k', data=traffic)
<AxesSubplot:xlabel='Date', ylabel='Crashes_per_100k'>
Since we saw a fair amount of variance in the number of collisions occurring throughout the year, we might hypothesize that the number of collisions increases or decreases during different seasons. We can visually explore this with a box plot.
Use sns.boxplot()
with crash rate on the x-axis and season on the y-axis. Remove the anomolous 2020 data by adjusting the data
parameter to traffic[traffic.Date.dt.year != 2020]
.
# create box plot by season
sns.boxplot(x='Crashes_per_100k', y='Season',data=traffic[traffic.Date.dt.year != 2020])
<AxesSubplot:xlabel='Crashes_per_100k', ylabel='Season'>
The dataset crashes_smartphones.csv
contains smartphone data from Pew Research Center matched to normalized crash rates from the traffic
data frame for the years 2011 to 2019.
Load the dataset as smartphones
and inspect the first few rows.
# import dataset
smartphones = pd.read_csv('crashes_smartphones.csv')
# inspect first few rows
smartphones.head()
Month_Year | Crashes_per_100k | Season | Smartphone_Survey_Date | Smartphone_usage | |
---|---|---|---|---|---|
0 | Apr-12 | 133.213685 | Spring | 4/3/12 | 46 |
1 | Apr-15 | 150.077792 | Spring | 4/12/15 | 67 |
2 | Apr-16 | 172.401948 | Spring | 4/4/16 | 72 |
3 | Aug-12 | 145.403147 | Summer | 8/5/12 | 44 |
4 | Dec-12 | 169.160811 | Winter | 12/9/12 | 45 |
Similar to the traffic
data frame, the smartphones
data frame has a date column that is not properly formatted. Convert the Smartphone_Survey_Date
column to the date data type using the pd.to_datetime()
function and then inspect the data types in the data frame.
# change to datetime object
smartphones['Smartphone_Survey_Date'] = pd.to_datetime(smartphones['Smartphone_Survey_Date'])
# inspect data types
smartphones.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Month_Year 28 non-null object
1 Crashes_per_100k 28 non-null float64
2 Season 28 non-null object
3 Smartphone_Survey_Date 28 non-null datetime64[ns]
4 Smartphone_usage 28 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.2+ KB
Now let’s take a look at smartphone use over time. Create a line plot of the smartphones
data with Smartphone_Survey_Date
on the x-axis and Smartphone_usage
on the y-axis.
# create line plot
sns.lineplot(x='Smartphone_Survey_Date', y='Smartphone_usage', data=smartphones)
plt.show()
A scatter plot with smartphone usage on one axis and crash rates on the other axis will give us an idea of whether there is a relationship between these two variables.
Create a scatter plot with a regression line using seaborn’s sns.regplot()
with Smartphone_usage
on the x-axis and Crashes_per_100k
on the y-axis.
# create scatter plot with regression line
sns.regplot(x='Smartphone_usage', y='Crashes_per_100k', data=smartphones)
plt.show()
To test whether the correlation between Smartphone_usage
and Crashes_per_100k
is statistically significant, we can calculate the Pearson’s r correlation coefficient and the associated p-value.
Use corr, p = pearsonr(column1, column2)
on the Smartphone_usage
and Crashes_per_100k
columns in the smartphones
dataframe. Then use the provided code to print corr
and p
to see the results.
# find Pearson's r and p-value
corr, p = pearsonr(smartphones['Smartphone_usage'], smartphones['Crashes_per_100k'])
# print corr and p
print("Pearson's r =", round(corr,3))
print("p = ", round(p,3))
Pearson's r = 0.513
p = 0.005
We can use a linear regression to predict crash rates based on smart phone usage. Let’s regress crash rates on smartphone usage. Then we can predict the crash rate in 2020 and see if it matches the actual crash rate in 2020!
We have provided the code to convert the variables to NumPy arrays that will work with the modeling function. The Smartphone_usage
array is saved as X
, and the Crashes_per_100k
array is saved as y
.
Initiate the model by saving LinearRegression()
to the variable lm
. Then fit the model and run the regression with .fit()
.
# convert columns to arrays
X = smartphones['Smartphone_usage'].to_numpy().reshape(-1, 1)
y = smartphones['Crashes_per_100k'].to_numpy().reshape(-1, 1)
# initiate the linear regression model
lm = LinearRegression()
# fit the model
lm.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Let’s see the values our model produced. Print the coefficients from our lm
model. Then think about which parts of the regression line equation these values represent.
# print the coefficients
print("Coef: \n", lm.intercept_, lm.coef_)
Coef:
[120.6637106] [[0.66103316]]
Let’s assume smartphone usage was the same for 2020 as it was for 2019. This is a reasonable asssumption since the increase in smartphone usage that we observed in our plot started to plateau at the end of the time series. Let’s use this approximation and our regression model to predict the crash rate in 2020.
From our model output, the regression line equation is Crashes_per_100k = 120.6637 + (0.6610 * Smartphone_usage)
. Run the provided code to view the smartphone usage rate for 2019. Then substitute this value into the equation, using Python as a calculator to predict the crash rate for 2020.
# get the smartphone usage rate from 2019
smartphones[smartphones['Month_Year'] == "Feb-19"].Smartphone_usage
7 81
Name: Smartphone_usage, dtype: int64
# predict the crash rate in 2020 using the regression equation
Crashes_per_100k = 120.6637 + (0.6610 * 81)
print(Crashes_per_100k)
174.2047
How good was our prediction? Get the actual crash rate for February of 2020 from the traffic
dataframe using pd.to_datetime("2020-02-01")
as the value for Date
.
# get the actual crash rate in Feb 2020
traffic[traffic['Date'] == pd.to_datetime("2020-02-01")].Crashes_per_100k
169 157.88955
Name: Crashes_per_100k, dtype: float64
Let’s plot our regression plot again, but let’s add two new points on top:
Code has been provided for the original regression plot and a legend title.
Add a scatter plot layer to add the 2020 predicted and actual crash rates that both used the 2019 smartphone usage rate. Use different colors and marker shapes for the predicted and actual 2020 crash rates.
# recreate the regression plot we made earlier
sns.regplot(x = 'Smartphone_usage', y = 'Crashes_per_100k', data = smartphones)
# add a scatter plot layer to show the actual and predicted 2020 values
sns.scatterplot(x=[81,81], y=[174.2047,157.88955],
hue=['predicted', 'actual'],
style=['predicted', 'actual'],
markers=['X', 'o'],
palette=['navy', 'orange'],
s=200)
# add legend title
plt.legend(title='2020')
plt.show()