Friday, February 6, 2015

py9. Correlation in Python

Two dice throws are independent and uncorrelated, and the calculated correlation should be near 0.


x and y correspond to 50 thousand tests of dice throw for first and second dice, respectively.


The pearsonr function from scipy.stats is used. The function cor could also have been defined with the def statement, rather than the lambda. However, lambda is usually used if the function is just one line. The name cor is used to be similar to the one in R.

# ex9.py
from __future__ import division, print_function
from numpy.random import randint
from numpy import sqrt
from pandas import DataFrame
from scipy.stats import pearsonr
cor = lambda x,y: pearsonr(x,y)[0]
num_samples = 50000
x = randint(1,7,num_samples)
y = randint(1,7,num_samples)
df = DataFrame({'x':x,'y':y})
df['a'] = df['x']-df['x'].mean()
df['b'] = df['y']-df['y'].mean()
df['ab'] = df['a']*df['b']
df['sqa'] = df['a']**2
df['sqb'] = df['b']**2
den = sqrt(df['sqa'].sum()*df['sqb'].sum())
correlation = df['ab'].sum()/den
print("correlation =", correlation)
print("cor(x,y) =", cor(x,y))
err = abs(correlation-cor(x,y))
print("error =", err)
#correlation = 0.000423927126959
#cor(x,y) = 0.000423927126959
#error = 0.0

No comments:

Post a Comment