© Copyright Quantopian Inc.
© Modifications Copyright QuantRocket LLC
Licensed under the Creative Commons Attribution 4.0.
By Evgenia "Jenny" Nitishinskaya and Delaney Granizo-Mackenzie with example algorithms by David Edwards
The Spearman Rank Correlation Coefficient allows us to determine whether or not two data series move together; that is, when one increases (decreases) the other also increases (decreases). This is more general than a linear relationship; for instance, $y = e^x$ is a monotonic function, but not a linear one. Therefore, in computing it we compare not the raw data but the ranks of the data.
This is useful when your data sets may be in different units, and therefore not linearly related (for example, the price of a square plot of land and its side length, since the price is more likely to be linear in the area). It's also suitable for data sets which do not satisfy the assumptions that other tests require, such as the observations being normally distributed as would be necessary for a t-test.
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
# Example of ranking data
l = [10, 9, 5, 7, 5]
print('Raw data:', l)
print('Ranking:', list(stats.rankdata(l, method='average')))
Raw data: [10, 9, 5, 7, 5] Ranking: [5.0, 4.0, 1.5, 3.0, 1.5]
The intution is now that instead of looking at the relationship between the two variables, we look at the relationship between the ranks. This is robust to outliers and the scale of the data.
The argument method='average'
indicates that when we have a tie, we average the ranks that the numbers would occupy. For example, the two $5$s above, which would take up ranks 1 and 2, each get assigned a rank of $1.5$.
To compute the Spearman rank correlation for two data sets $X$ and $Y$, each of size $n$, we use the formula $$r_S = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)}$$
where $d_i$ is the difference between the ranks of the $i$th pair of observations, $X_i - Y_i$.
The result will always be between $-1$ and $1$. A positive value indicates a positive relationship between the variables, while a negative value indicates an inverse relationship. A value of 0 implies the absense of any monotonic relationship. This does not mean that there is no relationship; for instance, if $Y$ is equal to $X$ with a delay of 2, they are related simply and precisely, but their $r_S$ can be close to zero:
Let's see what happens if we draw $X$ from a poisson distribution (non-normal), and then set $Y = e^X + \epsilon$ where $\epsilon$ is drawn from another poisson distribution. We'll take the spearman rank and the correlation coefficient on this data and then run the entire experiment many times. Because $e^X$ produces many values that are far away from the rest, we can think of this as modeling 'outliers' in our data. Spearman rank compresses the outliers and does better at measuring correlation. Normal correlation is confused by the outliers and on average will measure less of a relationship than is actually there.
## Let's see an example of this
n = 100
def compare_correlation_and_spearman_rank(n, noise):
X = np.random.poisson(size=n)
Y = np.exp(X) + noise * np.random.normal(size=n)
Xrank = stats.rankdata(X, method='average')
# n-2 is the second to last element
Yrank = stats.rankdata(Y, method='average')
diffs = Xrank - Yrank # order doesn't matter since we'll be squaring these values
r_s = 1 - 6*sum(diffs*diffs)/(n*(n**2 - 1))
c_c = np.corrcoef(X, Y)[0,1]
return r_s, c_c
experiments = 1000
spearman_dist = np.ndarray(experiments)
correlation_dist = np.ndarray(experiments)
for i in range(experiments):
r_s, c_c = compare_correlation_and_spearman_rank(n, 1.0)
spearman_dist[i] = r_s
correlation_dist[i] = c_c
print('Spearman Rank Coefficient: ' + str(np.mean(spearman_dist)))
# Compare to the regular correlation coefficient
print('Correlation coefficient: ' + str(np.mean(correlation_dist)))
Spearman Rank Coefficient: 0.8784615961596159 Correlation coefficient: 0.7713366604097229
Let's take a look at the distribution of measured correlation coefficients and compare the spearman with the regular metric.
plt.hist(spearman_dist, bins=50, alpha=0.5)
plt.hist(correlation_dist, bins=50, alpha=0.5)
plt.legend(['Spearman Rank', 'Regular Correlation'])
plt.xlabel('Correlation Coefficient')
plt.ylabel('Frequency');
Now let's see how the Spearman rank and Regular coefficients cope when we add more noise to the situation.
n = 100
noises = np.linspace(0, 3, 30)
experiments = 100
spearman = np.ndarray(len(noises))
correlation = np.ndarray(len(noises))
for i in range(len(noises)):
# Run many experiments for each noise setting
rank_coef = 0.0
corr_coef = 0.0
noise = noises[i]
for j in range(experiments):
r_s, c_c = compare_correlation_and_spearman_rank(n, noise)
rank_coef += r_s
corr_coef += c_c
spearman[i] = rank_coef/experiments
correlation[i] = corr_coef/experiments
plt.scatter(noises, spearman, color='r')
plt.scatter(noises, correlation)
plt.legend(['Spearman Rank', 'Regular Correlation'])
plt.xlabel('Amount of Noise')
plt.ylabel('Average Correlation Coefficient')
Text(0, 0.5, 'Average Correlation Coefficient')
We can see that the Spearman rank correlation copes with the non-linear relationship much better at most levels of noise. Interestingly, at very high levels, it seems to do worse than regular correlation.
Or you might have the case that one process affects another, but after a time lag. Now let's see what happens if we add the delay.
n = 100
X = np.random.rand(n)
Xrank = stats.rankdata(X, method='average')
# n-2 is the second to last element
Yrank = stats.rankdata([1,1] + list(X[:(n-2)]), method='average')
diffs = Xrank - Yrank # order doesn't matter since we'll be squaring these values
r_s = 1 - 6*sum(diffs*diffs)/(n*(n**2 - 1))
print(r_s)
-0.040975097509750924
Sure enough, the relationship is not detected. It is important when using both regular and spearman correlation to check for lagged relationships by offsetting your data and testing for different offset values.
We can also use the spearmanr
function in the scipy.stats
library:
# Generate two random data sets
np.random.seed(161)
X = np.random.rand(10)
Y = np.random.rand(10)
r_s = stats.spearmanr(X, Y)
print('Spearman Rank Coefficient:', r_s[0])
print('p-value:', r_s[1])
Spearman Rank Coefficient: 0.23636363636363633 p-value: 0.5108853175152002
We now have ourselves an $r_S$, but how do we interpret it? It's positive, so we know that the variables are not anticorrelated. It's not very large, so we know they aren't perfectly positively correlated, but it's hard to say from a glance just how significant the correlation is. Luckily, spearmanr
also computes the p-value for this coefficient and sample size for us. We can see that the p-value here is above 0.05; therefore, we cannot claim that $X$ and $Y$ are correlated.
Now that we've seen how Spearman rank correlation works, we'll quickly go through the process again with some real data. For instance, we may wonder whether the expense ratio of a mutual fund is indicative of its three-year Sharpe ratio. That is, does spending more money on administration, management, etc. lower the risk or increase the returns? Our p-value cutoff will be the usual default of 0.05.
Thanks to Matthew Madurski for the data. First, download the data:
NOTE: prefixing a cell with !
causes the cell to be executed as a shell command instead of as Python code.
!curl -L 'https://gist.github.com/dursk/82eee65b7d1056b469ab/raw/2b3ad3b3b1b8964a22db73730c24b366e0df51b0/mutual_fund_data.csv' -o 'mutual_fund_data.csv'
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 1708 100 1708 0 0 4217 0 --:--:-- --:--:-- --:--:-- 4217
mutual_fund_data = pd.read_csv('mutual_fund_data.csv')
expense = mutual_fund_data['Annual Expense Ratio'].values
sharpe = mutual_fund_data['Three Year Sharpe Ratio'].values
plt.scatter(expense, sharpe)
plt.xlabel('Expense Ratio')
plt.ylabel('Sharpe Ratio')
r_S = stats.spearmanr(expense, sharpe)
print('Spearman Rank Coefficient: ', r_S[0])
print('p-value: ', r_S[1])
Spearman Rank Coefficient: -0.23757393235503554 p-value: 0.016746509711640854
Our p-value is below the cutoff, which means we accept the hypothesis that the two are correlated. The negative coefficient indicates that there is a negative correlation, and that more expensive mutual funds have worse sharpe ratios. However, there is some weird clustering in the data, it seems there are expensive groups with low sharpe ratios, and a main group whose sharpe ratio is unrelated to the expense. Further analysis would be required to understand what's going on here.
NOTE: The Factor Analysis lecture covers this topic in much greater detail.
Let's say that we have some way of ranking securities and that we'd like to test how well our ranking performs in practice. In this case our model just takes the mean daily return for the last month and ranks the stocks by that metric.
We hypothesize that this will be predictive of the mean returns over the next month. To test this we score the stocks based on a lookback window, then take the spearman rank correlation of the score and the mean returns over the walk forward month.
from quantrocket.master import get_securities
from quantrocket import get_prices
symbol_list = ['AAPL', 'MSFT', 'JNJ', 'HD', 'MON', 'XOM', 'KKD']
securities = get_securities(symbols=symbol_list, vendors='usstock')
# Get the returns over the lookback window
start = '2014-12-01'
end = '2015-01-01'
prices = get_prices('usstock-free-1min', data_frequency='daily', sids=securities.index.tolist(), fields='Close', start_date=start, end_date=end)
historical_returns = prices.loc['Close'].pct_change()[1:]
sids_to_symbols = securities.Symbol.to_dict()
historical_returns = historical_returns.rename(columns=sids_to_symbols)
# Compute our stock score
scores = np.mean(historical_returns)
print('Our Scores\n')
print(scores)
print('\n')
start = '2015-01-01'
end = '2015-02-01'
prices = get_prices('usstock-free-1min', data_frequency='daily', sids=securities.index.tolist(), fields='Close', start_date=start, end_date=end)
walk_forward_returns = prices.loc['Close'].pct_change()[1:]
walk_forward_returns = walk_forward_returns.rename(columns=sids_to_symbols)
walk_forward_returns = np.mean(walk_forward_returns)
print('The Walk Forward Returns\n')
print(walk_forward_returns)
print('\n')
plt.scatter(scores, walk_forward_returns)
plt.xlabel('Scores')
plt.ylabel('Walk Forward Returns')
r_s = stats.spearmanr(scores, walk_forward_returns)
print('Correlation Coefficient:' + str(r_s[0]))
print('p-value:' + str(r_s[1]))
Our Scores Sid AAPL -0.001875 HD 0.002908 JNJ -0.001483 KKD -0.002153 MON 0.000292 MSFT -0.002075 XOM 0.000181 dtype: float64 The Walk Forward Returns Sid AAPL 0.003927 HD 0.000668 JNJ -0.002168 KKD 0.000332 MON -0.000732 MSFT -0.007298 XOM -0.003050 dtype: float64 Correlation Coefficient:0.14285714285714288 p-value:0.7599453002180929
The p-value indicates that our hypothesis is false and we accept the null hypothesis that our ranking was no better than random. This is a really good check of any ranking system one devises for constructing a long-short equity portfolio.
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by Quantopian, Inc. ("Quantopian") or QuantRocket LLC ("QuantRocket"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, neither Quantopian nor QuantRocket has taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information believed to be reliable at the time of publication. Neither Quantopian nor QuantRocket makes any guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.