© Copyright Quantopian Inc.
© Modifications Copyright QuantRocket LLC
Licensed under the Creative Commons Attribution 4.0.
By Chris Fenaroli and Max Margenot
Linear regression is one of our most fundamental modeling techniques. We use it to estimate a linear relationship between a set of independent variables $X_i$ and a dependent outcome variable $y$. Our model takes the form of:
$$ y_i = \beta_{0} 1 + \beta_{i, 1} x_{i, 1} + \dots + \beta_{i, p} x_{i, p} + \epsilon_i = x_i'\beta + \epsilon_i $$For $i \in \{1, \dots, n\}$, where $n$ is the number of observations. We write this in vector form as:
$$ y = X\beta + \epsilon $$Where $y$ is a $n \times 1$ vector, $X$ is a $n \times p$ matrix, $\beta$ is a $p \times 1$ vector of coefficients, and $\epsilon$ is a standard normal error term. Typically we call a model with $p = 1$ a simple linear regression and a model with $p > 1$ a multiple linear regression. More background information on regressions can be found in the lectures on simple linear regression and multiple linear regression.
Whenever we build a model, there will be gaps between what a model predicts and what is observed in the sample. The differences between these values are known as the residuals of the model and can be used to check for some of the basic assumptions that go into the model. The key assumptions to check for are:
We can use the residuals to help diagnose whether the relationship we have estimated is real or spurious.
Statistical error is a similar metric associated with regression analysis with one important difference: While residuals quantify the gap between a regression model predictions and the observed sample, statistical error is the difference between a regression model and the unobservable expected value. We use residuals in an attempt to estimate this error.
# Import libraries
import numpy as np
import pandas as pd
from statsmodels import regression
import statsmodels.api as sm
import statsmodels.stats.diagnostic as smd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
First we'll define a function that performs linear regression and plots the results.
def linreg(X,Y):
# Running the linear regression
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
B0 = model.params[0]
B1 = model.params[1]
X = X[:, 1]
# Return summary of the regression and plot results
X2 = np.linspace(X.min(), X.max(), 100)
Y_hat = X2 * B1 + B0
plt.scatter(X, Y, alpha=1) # Plot the raw data
plt.plot(X2, Y_hat, 'r', alpha=1); # Add the regression line, colored in red
plt.xlabel('X Value')
plt.ylabel('Y Value')
return model, B0, B1
Let's define a toy relationship between $X$ and $Y$ that we can model with a linear regression. Here we define the relationship and construct a model on it, drawing the determined line of best fit with the regression parameters.
n = 50
X = np.random.randint(0, 100, n)
epsilon = np.random.normal(0, 1, n)
Y = 10 + 0.5 * X + epsilon
linreg(X,Y)[0];
print("Line of best fit: Y = {0} + {1}*X".format(linreg(X, Y)[1], linreg(X, Y)[2]))
Line of best fit: Y = 10.293236475704886 + 0.492974457218984*X
This toy example has some generated noise, but all real data will also have noise. This is inherent in sampling from any sort of wild data-generating process. As a result, our line of best fit will never exactly fit the data (which is why it is only "best", not "perfect"). Having a model that fits every single observation that you have is a sure sign of overfitting.
For all fit models, there will be a difference between what the regression model predicts and what was observed, which is where residuals come in.
The definition of a residual is the difference between what is observed in the sample and what is predicted by the regression. For any residual $r_i$, we express this as
$$r_i = Y_i - \hat{Y_i}$$Where $Y_i$ is the observed $Y$-value and $\hat{Y}_i$ is the predicted Y-value. We plot these differences on the following graph:
model, B0, B1 = linreg(X,Y)
residuals = model.resid
plt.errorbar(X,Y,xerr=0,yerr=[residuals,0*residuals],linestyle="None",color='Green');
We can pull the residuals directly out of the fit model.
residuals = model.resid
print(residuals)
[-0.67164157 -1.76099618 -0.30365386 0.89951205 0.13111216 1.84579793 -0.24615041 -0.61291441 -0.16805376 -2.01007856 0.97105556 1.0335287 0.30758715 -0.39443363 -0.20703832 -0.84868733 -0.3618303 -1.33749005 -0.33532895 -0.11257807 2.15170323 0.74535268 -0.73381961 -1.04266584 -1.09253471 0.1517854 -0.81191279 2.33493536 0.47629945 1.91093397 -0.74114386 -0.28935872 0.07483554 0.46211878 -1.73681098 -0.70175562 1.17377211 0.35187893 1.89796585 -1.23642993 -1.63012988 1.17657191 -0.2298605 -1.84124037 0.55669397 -0.69165981 2.08042533 1.06844539 -0.2232824 0.57116899]
Many of the assumptions that are necessary to have a valid linear regression model can be checked by identifying patterns in the residuals of that model. We can make a quick visual check by looking at the residual plot of a given model.
With a residual plot, we look at the predicted values of the model versus the residuals themselves. What we want to see is just a cloud of unrelated points, like so:
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
plt.xlim([1,50]);
What we want is a fairly random distribution of residuals. The points should form no discernible pattern. This would indicate that a plain linear model is likely a good fit. If we see any sort of trend, this might indicate the presence of autocorrelation or heteroscedasticity in the model.
By looking for patterns in residual plots we can determine whether a linear model is appropriate in the first place. A plain linear regression would not be appropriate for an underlying relationship of the form:
$$Y = \beta_0 + \beta_1 X^2$$as a linear function would not be able to fully explain the relationship between $X$ and $Y$.
If the relationship is not a good fit for a linear model, the residual plot will show a distinct pattern. In general, a residual plot of a linear regression on a non-linear relationship will show bias and be asymmetrical with respect to residual = 0 line while a residual plot of a linear regression on a linear relationship will be generally symmetrical over the residual = 0 axis.
As an example, let's consider a new relationship between the variables $X$ and $Y$ that incorporates a quadratic term.
n = 50
X = np.random.randint(0, 50, n)
epsilon = np.random.normal(0, 1, n)
Y_nonlinear = 10 - X**1.2 + epsilon
model = sm.OLS(Y_nonlinear, sm.add_constant(X)).fit()
B0, B1 = model.params
residuals = model.resid
print('beta_0:', B0)
print('beta_1:', B1)
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
beta_0: 16.749406324930508 beta_1: -2.2582325051107133
The "inverted-U" shape shown by the residuals is a sign that a non-linear model might be a better fit than a linear one.
One of the main assumptions behind a linear regression is that the underlying data has a constant variance. If there are some parts of the data with a variance different from another part the data is not appropriate for a linear regression. Heteroscedasticity is a term that refers to data with non-constant variance, as opposed to homoscedasticity, when data has constant variance.
Significant heteroscedasticity invalidates linear regression results by biasing the standard error of the model. As a result, we can't trust the outcomes of significance tests and confidence intervals generated from the model and its parameters.
To avoid these consequences it is important to use residual plots to check for heteroscedasticity and adjust if necessary.
As an example of detecting and correcting heteroscedasticity, let's consider yet another relationship between $X$ and $Y$:
n = 50
X = np.random.randint(0, 100, n)
epsilon = np.random.normal(0, 1, n)
Y_heteroscedastic = 100 + 2*X + epsilon*X
model = sm.OLS(Y_heteroscedastic, sm.add_constant(X)).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
Heteroscedasticity often manifests as this spread, giving us a tapered cloud in one direction or another. As we move along in the $x$-axis, the magnitudes of the residuals are clearly increasing. A linear regression is unable to explain this varying variability and the regression standard errors will be biased.
Generally, we want to back up qualitative observations on a residual plot with a quantitative method. The residual plot led us to believe that the data might be heteroscedastic. Let's confirm that result with a statistical test.
A common way to test for the presence of heteroscedasticity is the Breusch-Pagan hypothesis test. It's good to combine the qualitative analysis of a residual plot with the quantitative analysis of at least one hypothesis test. We can add the White test as well, but for now we will use only Breush-Pagan to test our relationship above. A function exists in the statsmodels
package called het_breushpagan
that simplifies the computation:
breusch_pagan_p = smd.het_breuschpagan(model.resid, model.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
print("The relationship is not heteroscedastic.")
if breusch_pagan_p < 0.05:
print("The relationship is heteroscedastic.")
0.0010935999868479371 The relationship is heteroscedastic.
We set our confidence level at $\alpha = 0.05$, so a Breusch-Pagan p-value below $0.05$ tells us that the relationship is heteroscedastic. For more on hypothesis tests and interpreting p-values, refer to the lecture on hypothesis testing. Using a hypothesis test bears the risk of a false positive or a false negative, which is why it can be good to confirm with additional tests if we are skeptical.
If, after creating a residual plot and conducting tests, you believe you have heteroscedasticity, there are a number of methods you can use to attempt to adjust for it. The three we will focus on are differences analysis, log transformations, and Box-Cox transformations.
A differences analysis involves looking at the first-order differences between adjacent values. With this, we are looking at the changes from period to period of an independent variable rather than looking directly at its values. Often, by looking at the differences instead of the raw values, we can remove heteroscedasticity. We correct for it and can use the ensuing model on the differences.
# Finding first-order differences in Y_heteroscedastic
Y_heteroscedastic_diff = np.diff(Y_heteroscedastic)
Now that we have stored the first-order differences of Y_heteroscedastic
in Y_heteroscedastic_diff
let's repeat the regression and residual plot to see if the heteroscedasticity is still present:
model = sm.OLS(Y_heteroscedastic_diff, sm.add_constant(X[1:])).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
breusch_pagan_p = smd.het_breuschpagan(residuals, model.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
print("The relationship is not heteroscedastic.")
if breusch_pagan_p < 0.05:
print("The relationship is heteroscedastic.")
0.0049117416665093875 The relationship is heteroscedastic.
Note: This new regression was conducted on the differences between data, and therefore the regression output must be back-transformed to reach a prediction in the original scale. Since we regressed the differences, we can add our predicted difference onto the original data to get our estimate:
$$\hat{Y_i} = Y_{i-1} + \hat{Y}_{diff}$$Next, we apply a log transformation to the underlying data. A log transformation will bring residuals closer together and ideally remove heteroscedasticity. In many (though not all) cases, a log transformation is sufficient in stabilizing the variance of a relationship.
# Taking the log of the previous data Y_heteroscedastic and saving it in Y_heteroscedastic_log
Y_heteroscedastic_log = np.log(Y_heteroscedastic)
Now that we have stored the log transformed version of Y_heteroscedastic
in Y_heteroscedastic_log
let's repeat the regression and residual plot to see if the heteroscedasticity is still present:
model = sm.OLS(Y_heteroscedastic_log, sm.add_constant(X)).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
# Running and interpreting a Breusch-Pagan test
breusch_pagan_p = smd.het_breuschpagan(residuals, model.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
print("The relationship is not heteroscedastic.")
if breusch_pagan_p < 0.05:
print("The relationship is heteroscedastic.")
0.06983179032520836 The relationship is not heteroscedastic.
Note: This new regression was conducted on the log of the original data. This means the scale has been altered and the regression estimates will lie on this transformed scale. To bring the estimates back to the original scale, you must back-transform the values using the inverse of the log:
$$\hat{Y} = e^{\log(\hat{Y})}$$Finally, we examine the Box-Cox transformation. The Box-Cox transformation is a powerful method that will work on many types of heteroscedastic relationships. The process works by testing all values of $\lambda$ within the range $[-5, 5]$ to see which makes the output of the following equation closest to being normally distributed: $$ Y^{(\lambda)} = \begin{cases} \frac{Y^{\lambda}-1}{\lambda} & : \lambda \neq 0\\ \log{Y} & : \lambda = 0 \end{cases} $$
The "best" $\lambda$ will be used to transform the series along the above function. Instead of having to do all of this manually, we can simply use the scipy
function boxcox
. We use this to adjust $Y$ and hopefully remove heteroscedasticity.
Note: The Box-Cox transformation can only be used if all the data is positive
# Finding a power transformation adjusted Y_heteroscedastic
Y_heteroscedastic_box_cox = stats.boxcox(Y_heteroscedastic)[0]
Now that we have stored the power transformed version of Y_heteroscedastic
in Y_heteroscedastic_prime
let's repeat the regression and residual plot to see if the heteroscedasticity is still present:
model = sm.OLS(Y_heteroscedastic_box_cox, sm.add_constant(X)).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
# Running and interpreting a Breusch-Pagan test
breusch_pagan_p = smd.het_breuschpagan(residuals, model.model.exog)[1]
print(breusch_pagan_p)
if breusch_pagan_p > 0.05:
print("The relationship is not heteroscedastic.")
if breusch_pagan_p < 0.05:
print("The relationship is heteroscedastic.")
0.11981079385385478 The relationship is not heteroscedastic.
Note: Now that the relationship is not heteroscedastic, a linear regression is appropriate. However, because the data was power transformed, the regression estimates will be on a different scale than the original data. This is why it is important to remember to back-transform results using the inverse of the Box-Cox function:
$$\hat{Y} = (Y^{(\lambda)}\lambda + 1)^{1/\lambda}$$Another approach to dealing with heteroscadasticity is through a GARCH (generalized autoregressive conditional heteroscedasticity) model. More information can be found in the lecture on GARCH modeling.
Another assumption behind linear regressions is that the residuals are not autocorrelated. A series is autocorrelated when it is correlated with a delayed version of itself. An example of a potentially autocorrelated time series series would be daily high temperatures. Today's temperature gives you information on tomorrow's temperature with reasonable confidence (i.e. if it is 90 °F today, you can be very confident that it will not be below freezing tomorrow). A series of fair die rolls, however, would not be autocorrelated as seeing one roll gives you no information on what the next might be. Each roll is independent of the last.
In finance, stock prices are usually autocorrelated while stock returns are independent from one day to the next. We represent a time dependency on previous values like so:
$$Y_i = Y_{i-1} + \epsilon$$If the residuals of a model are autocorrelated, you will be able to make predictions about adjacent residuals. In the case of $Y$, we know the data will be autocorrelated because we can make predictions based on adjacent residuals being close to one another.
n = 50
X = np.linspace(0, n, n)
Y_autocorrelated = np.zeros(n)
Y_autocorrelated[0] = 50
for t in range(1, n):
Y_autocorrelated[t] = Y_autocorrelated[t-1] + np.random.normal(0, 1)
# Regressing X and Y_autocorrelated
model = sm.OLS(Y_autocorrelated, sm.add_constant(X)).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
Autocorrelation in the residuals in this example is not explicitly obvious, so our check is more to make absolutely certain.
As with all statistical properties, we require a statistical test to ultimately decide whether there is autocorrelation in our residuals or not. To this end, we use a Ljung-Box test.
A Ljung-Box test is used to detect autocorrelation in a time series. The Ljung-Box test examines autocorrelation at all lag intervals below a specified maximum and returns arrays containing the outputs for every tested lag interval.
Let's use the acorr_ljungbox
function in statsmodels
to test for autocorrelation in the residuals of our above model. We use a max lag interval of $10$, and see if any of the lags have significant autocorrelation:
ljung_box = smd.acorr_ljungbox(residuals, lags=10, return_df=True)
print("Lagrange Multiplier Statistics:", ljung_box.lb_stat.tolist())
print("\nP-values:", ljung_box.lb_pvalue.tolist(), "\n")
if (ljung_box.lb_pvalue < 0.05).any():
print("The residuals are autocorrelated.")
else:
print("The residuals are not autocorrelated.")
Lagrange Multiplier Statistics: [41.20303806067113, 68.48106724559787, 85.60238863877254, 95.72944552604835, 100.55673658072398, 102.15007597664821, 102.28262654391536, 102.57102275502551, 104.21587287946316, 107.85978429224278] P-values: [1.3720911307573524e-10, 1.3474888808945885e-15, 1.9268496770484735e-18, 7.972792573801277e-20, 4.033747699491496e-20, 8.92857505419026e-20, 3.6415800501787816e-19, 1.27192245084767e-18, 2.2024721331799075e-18, 1.4403010494720476e-18] The residuals are autocorrelated.
Because the Ljung-Box test yielded a p-value below $0.05$ for at least one lag interval, we can conclude that the residuals of our model are autocorrelated.
We can adjust for autocorrelation in many of the same ways that we adjust for heteroscedasticity. Let's see if a model on the first-order differences of $Y$ has autocorrelated residuals:
# Finding first-order differences in Y_autocorrelated
Y_autocorrelated_diff = np.diff(Y_autocorrelated)
model = sm.OLS(Y_autocorrelated_diff, sm.add_constant(X[1:])).fit()
B0, B1 = model.params
residuals = model.resid
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('Predicted Values');
plt.ylabel('Residuals');
# Running and interpreting a Ljung-Box test
ljung_box = smd.acorr_ljungbox(residuals, lags=10, return_df=True)
print("P-values:", ljung_box.lb_pvalue.tolist(), "\n")
if (ljung_box.lb_pvalue < 0.05).any():
print("The residuals are autocorrelated.")
else:
print("The residuals are not autocorrelated.")
P-values: [0.17264886534613838, 0.3278814889855232, 0.5242570878937549, 0.6292323413314802, 0.6953534497764703, 0.7279135451873229, 0.8220961266759883, 0.8892880400214722, 0.916320911885328, 0.48499986345032586] The residuals are not autocorrelated.
Note: This new regression was conducted on the differences between data, and therefore the regression output must be back-transformed to reach a prediction in the original scale. Since we regressed the differences, we can add our predicted difference onto the original data to get our estimate:
$$\hat{Y_i} = Y_{i-1} + \hat{Y_{diff}}$$We can also perform a log transformation, if we so choose. This process is identical to the one we performed on the heteroscedastic data up above, so we will leave it out this time.
Let's calculate the market beta between AAPL and SPY using a simple linear regression, and then conduct a residual analysis on the regression to ensure the validity of our results. To regress AAPL and SPY, we will focus on their returns, not their price, and set SPY returns as our independent variable and AAPL returns as our outcome variable. The regression will give us a line of best fit:
$$\hat{r_{AAPL}} = \hat{\beta_0} + \hat{\beta_1}r_{SPY}$$The slope of the regression line $\hat{\beta_1}$ will represent our market beta, as for every $r$ percent change in the returns of SPY, the predicted returns of AAPL will change by $\hat{\beta_1}$.
Let's start by conducting the regression the returns of the two assets.
from quantrocket.master import get_securities
from quantrocket import get_prices
securities = get_securities(symbols=["AAPL", "SPY"], vendors='usstock')
start = '2017-01-01'
end = '2018-01-01'
closes = get_prices("usstock-free-1min", data_frequency="daily", sids=securities.index.tolist(), fields='Close', start_date=start, end_date=end).loc['Close']
sids_to_symbols = securities.Symbol.to_dict()
closes = closes.rename(columns=sids_to_symbols)
asset = closes['AAPL']
benchmark = closes['SPY']
# We have to take the percent changes to get to returns
# Get rid of the first (0th) element because it is NAN
r_a = asset.pct_change()[1:].values
r_b = benchmark.pct_change()[1:].values
# Regressing the benchmark b and asset a
r_b = sm.add_constant(r_b)
model = sm.OLS(r_a, r_b).fit()
r_b = r_b[:, 1]
B0, B1 = model.params
# Plotting the regression
A_hat = (B1*r_b + B0)
plt.scatter(r_b, r_a, alpha=1) # Plot the raw data
plt.plot(r_b, A_hat, 'r', alpha=1); # Add the regression line, colored in red
plt.xlabel('AAPL Returns')
plt.ylabel('SPY Returns')
# Print our result
print("Estimated AAPL Beta:", B1)
# Calculating the residuals
residuals = model.resid
Estimated AAPL Beta: 1.3605712244705852
Our regression yielded an estimated market beta of 1.36; according to the regression, for every 1% in return we see from the SPY, we should see 1.36% from AAPL.
Now that we have the regression results and residuals, we can conduct our residual analysis. Our first step will be to plot the residuals and look for any red flags:
plt.scatter(model.predict(), residuals);
plt.axhline(0, color='red')
plt.xlabel('AAPL Returns');
plt.ylabel('Residuals');
By simply observing the distribution of residuals, it does not seem as if there are any abnormalities. The distribution is relatively random and no patterns can be observed (the clustering around the origin is a result of the nature of returns to cluster around 0 and is not a red flag). Our qualitative conclusion is that the data is homoscedastic and not autocorrelated and therefore satisfies the assumptions for linear regression.
Our qualitative assessment of the residual plot is nicely supplemented with a couple statistical tests. Let's begin by testing for heteroscedasticity using a Breusch-Pagan test. Using the het_breuschpagan
function from the statsmodels package:
bp_test = smd.het_breuschpagan(residuals, model.model.exog)
print("Lagrange Multiplier Statistic:", bp_test[0])
print("P-value:", bp_test[1])
print("f-value:", bp_test[2])
print("f_p-value:", bp_test[3], "\n")
if bp_test[1] > 0.05:
print("The relationship is not heteroscedastic.")
if bp_test[1] < 0.05:
print("The relationship is heteroscedastic.")
Lagrange Multiplier Statistic: 0.03631046966792728 P-value: 0.8488757668605337 f-value: 0.03602521828096212 f_p-value: 0.8496186813191156 The relationship is not heteroscedastic.
Because the P-value is greater than 0.05, we do not have enough evidence to reject the null hypothesis that the relationship is homoscedastic. This result matches up with our qualitative conclusion.
Let's also check for autocorrelation quantitatively using a Ljung-Box test. Using the acorr_ljungbox
function from the statsmodels package and the default maximum lag:
ljung_box = smd.acorr_ljungbox(r_a, lags=1, return_df=True)
print("P-Values:", ljung_box.lb_pvalue.tolist(), "\n")
if (ljung_box.lb_pvalue < 0.05).any():
print("The residuals are autocorrelated.")
else:
print("The residuals are not autocorrelated.")
P-Values: [0.7154123872552114] The residuals are not autocorrelated.
Because the Ljung-Box test yielded p-values above 0.05 for all lags, we can conclude that the residuals are not autocorrelated. This result matches up with our qualitative conclusion.
After having visually assessed the residual plot of the regression and then backing it up using statistical tests, we can conclude that the data satisfies the main assumptions and the linear model is valid.
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian") or QuantRocket LLC ("QuantRocket"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, neither Quantopian nor QuantRocket has taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information believed to be reliable at the time of publication. Neither Quantopian nor QuantRocket makes any guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.