Friday, February 27, 2015

R29. Discrete Cosine Transform in R

The Discrete Cosine Transform (DCT) leads to a real transform, whereas the fft leads to a complex transform. The relevant equations are given, for example, at Wikipedia


The program below finds the DCT and inverse DCT using fft and ifft. For both fft and ifft, we have to solve a 2N problem. For example, for the DCT, x is joined with its reverse. Besides the fft and ifft, we have to add phase terms and proper scaling constants.


The example x is of length 16. We plot x and its DCT. In R, the negative index indicates the element that we do not want.

# ex29.R

dct <- function(X) {
  x <- c(X,rev(X))
  N <- 2*length(X)
  f0 <- 1/sqrt(2*N)
  f1 <- 1/sqrt(N)
  Y <- fft(x)
  kin <- 0:(N-1)
  Mul <- cos(pi/N*kin)-1i*sin(pi/N*kin)
  Y <- Y * Mul
  Y <- Y * f1
  Y[1] <- Y[1]*f0/f1
  Re(Y[1:(N/2)])
}

idct <- function(X) {
  N <- length(X)
  f0 <- 1/sqrt(4*N)
  f1 <- 1/sqrt(2*N)
  kin <- 0:(N-1)
  FTR <- X*cos(.5*pi/N*kin)
  FTI <- X*sin(.5*pi/N*kin)
  FTR <- FTR/f1
  FTI <- FTI/f1
  FTR[1] <- FTR[1] * f1/f0
  FTI[1] <- FTI[1] * f1/f0
  FTR1 <- c(FTR,0,rev(FTR[-1]))
  FTI1 <- c(FTI,0,rev(-1*FTI[-1]))
  y <- fft(FTR1+1i*FTI1, TRUE)/length(FTR1)
  Re(y[1:N])
}

x <- c(0:3,2:4,10:2)
y <- dct(x)
z <- idct(y)
err <- sd(z-x)
cat('err =',err,'\n')
plot(y, col = 'blue', type = 'b',
     xlab = 'n', ylab = 'x, y',
     main = 'x (red), y = dct(x) (blue) ')
points(x, col = 'red', type = 'b')
# err = 1.110223e-16 
    

Output:

py29. Discrete Cosine Transform in Python

The Discrete Cosine Transform (DCT) leads to a real transform, whereas the fft leads to a complex transform. The relevant equations are given, for example, at Wikipedia Type 2 is the dct and type 3 is idct. We have to convert these equations to a form for the fft and ifft.


The program below finds the DCT and inverse DCT using fft and ifft. For both fft and ifft, we have to solve a 2N problem. For example, for the DCT, x is joined with its reverse. Besides the fft and ifft, we have to add phase terms and proper scaling constants.


The example x is of length 16. We plot x and its DCT. The main code is run only if the file is ran directly, and not imported by other python programs.


For comparison, we use the dct and idct functions from scipy.fftpack module with aliases of sdct, and sidct.

# ex29.py

from __future__ import print_function, division
import numpy as np
from scipy.fftpack import dct as sdct, idct as sidct
from numpy.fft import fft, ifft
import matplotlib.pyplot as plt

def dct(X):
    x = np.concatenate([X,np.flipud(X)])
    N = 2*len(X)
    f0 = 1/np.sqrt(2*N)
    f1 = 1/np.sqrt(N)
    Y = fft(x)
    kin = np.arange(N)
    Mul = np.cos(np.pi/N*kin)-1j*np.sin(np.pi/N*kin)
    Y *= Mul
    Y *= f1
    Y[0] *= f0/f1
    return Y[:N/2].real

def idct(X):
    N = len(X)
    f0 = 1/np.sqrt(4*N)
    f1 = 1/np.sqrt(2*N)
    kin = np.arange(N)
    FTR = X*np.cos(.5*np.pi/N*kin)
    FTI = X*np.sin(.5*np.pi/N*kin)
    FTR /= f1
    FTI /= f1
    FTR[0] *= f1/f0
    FTI[0] *= f1/f0
    FTR1 = np.concatenate((FTR,np.zeros(1),np.flipud(FTR[1:])))
    FTI1 = np.concatenate((FTI,np.zeros(1),-1*np.flipud(FTI[1:])))
    y = ifft(FTR1+1j*FTI1)
    return y[:N].real

if __name__ == '__main__':
    x = np.array(range(4)+range(2,5)+range(10,1,-1))
    y = dct(x)
    y1 = sdct(x.astype('float'), norm = 'ortho')
    z = idct(y)
    z1 = sidct(y1, norm = 'ortho')
    err1 = (z-x).std()
    err2 = (z1-x).std()
    print ('err1 =',err1)
    print ('err2 =',err2)
    plt.plot(x,'ro--', markerfacecolor='w')
    plt.plot(y,'bo--', markerfacecolor='w')
    plt.xlabel('n')
    plt.ylabel('x,y')
    plt.title('x (red), y = dct(x) (blue)')
    plt.xlim(-.5,15.5)
    plt.show()

# err1 = 1.43874580946e-15
# err2 = 4.52626554495e-16

Output:

Thursday, February 26, 2015

py28. Fast Fourier Transform in Python

The Discrete Fourier Transform (DFT) is the pair of transforms for the transform and its inverse:


The Fast Fourier Transform (FFT) is the solution of DFT using an algorithm based on symmetry of equations.


We use the fft function from the numpy.fft module to solve for the transform of a 64-length numpy array. It returns a complex numpy array, dtype = 'complex', which is sent to ifft function in the same module. We compare it to the the result from dft and inverse dft. The result are the same, with errors on order of double-precision tolerance.

# ex28.py

from __future__ import print_function, division
import numpy as np
from numpy.fft import fft, ifft
from numpy import pi, exp
def dft(x, inverse = False):
    N = len(x)
    s,d = (1,N) if inverse else (-1,1)
    X = np.zeros(N, dtype='complex')
    for k in range(N):
        for n in range(N):
            X[k] += x[n]*exp(1j*s*2*pi*k*n/N)
    return X/d

x = np.array(range(32)+range(31,-1,-1))
y = fft(x)
y1 = dft(x)
z = ifft(y)
z1 = dft(y1,True)
xR = z.real
x1R = z1.real
error_fft = (x-xR).std()
error_dft = (x-x1R).std()
print("error_fft = %e\nerror_dft = %e" % (error_fft,error_dft))
# error_fft = 1.122558e-15
# error_dft = 1.060977e-13

R28. Fast Fourier Transform in R

The Discrete Fourier Transform (DFT) is the pair of transforms for the transform and its inverse:


The Fast Fourier Transform (FFT) is the solution of DFT using an algorithm based on symmetry of equations.


We use the fft function to solve for the transform of a 64-length vector. It returns a complex vector, which is again sent to fft function but with inverse set to TRUE, so the sign in exponent is plus. We compare it to the the result from dft and inverse dft. The result are the same, with errors on order of double-precision tolerance.

# ex28.R

dft <- function(x, inverse = FALSE) {
  s <- if(inverse) 1 else -1
  N <- length(x)
  X <- complex(N)
  for (k in 0:(N-1)) {
    for (n in 0:(N-1)) {
      X[k+1] = X[k+1] + x[n+1]*exp(s*1i*2*pi*k*n/N)
    }
  }
  X
}

x <- c(0:31,31:0)
y <- fft(x)
y1 <- dft(x)
z <- fft(y,TRUE)/length(y)
z1 <- dft(y1,TRUE)/length(y1)
xR <- Re(z)
x1R <- Re(z1)
error.fft <- sd(x-xR)
error.dft <- sd(x-x1R)
out <- sprintf("error.fft = %e\nerror.dft = %e",
               error.fft, error.dft)
cat(out)
# error.fft = 2.976939e-15
# error.dft = 1.067366e-13

py27. Cauchy Distribution in Python


The Cauchy Cumulative Distribution Function is:







We use this formula as well as scipy.stats.cauchy.cdf function in the plot, and the two lines are plotted with different linewidths.


# ex27.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
from numpy import linspace, pi, arctan as atan
from scipy.stats import cauchy
N = 500
gamma = 2
x0 = 3
x = linspace(-5*gamma+x0,+5*gamma+x0,num = N)
y = 1/pi*atan((x-x0)/gamma) + 0.5
y1 = cauchy.cdf(x, loc = x0, scale = gamma)
title = 'Cauchy CDF, gamma = %.1f, and x0 = %.1f' % (gamma, x0)
plt.plot(x,y, 'k', lw = 10)
plt.plot(x,y1,'g', lw = 5)
plt.title(title)
plt.show()

Output:

R27. Cauchy Distribution in R


The Cauchy Cumulative Distribution Function is:






We use this formula as well as pcauchy function in the plot, and the two lines are plotted with different linewidths.


# ex27.R
N <- 500
gamma <- 2
x0 <- 3
x <- seq(-5*gamma+x0,+5*gamma+x0,length.out = N)
y <- 1/pi*atan((x-x0)/gamma) + 0.5
main <- sprintf('Cauchy CDF, gamma = %.1f, and x0 = %.1f',
                gamma, x0)
plot(x,y, col = 'black', lwd = 10)
lines(x,pcauchy(x,location = x0, scale = gamma),
      col = 'green', lwd = 5)
title(main = main)

Output:

Monday, February 23, 2015

py26. Magic Function in Python

Magic matrices have sum of all rows and sum of all columns equal to some number. In addition, the sum of the primary and secondary cross diagonal equal that number, called the magic number. The magic number is only dependent on size of matrix.


The program will solve magic matrices for odd size, such as 11 by 11 (N = 11).


For a magic matrix, such as a size of 11 by 11, we have elements from 1 to 11 square. The magic number is the mean of all elements from 1 to N square, multiplied by N.


We use the numpy.zeros function to create a N*N matrix and then loop, over the 2-dimensions, to get the individual elements, {1,2,3,...N^2}, into their correct positions. Magic matrices are not unique, there are many others, dependent on the algorithm, with the same sum property, however the magic number for a particular N is unique.


Lastly, we check all the relevant sums.

# ex26.py
from __future__ import print_function, division
import numpy as np

N = 11

magic_num = sum(range(1,N**2+1))/N

def mat(i,j):
    i = i % N
    j = j % N
    return i,j
    
A = np.zeros((N,N))
st = (N-1)/2,(N+1)/2

for k2 in range(N):
    st = st[0]+1,st[1]-1
    for k in range(N):
        A[mat(st[0]+k,st[1]+k)] = k+k2*N+1

A = A.astype('int')
AR = np.fliplr(A)
print('magic(%d) = \n%s' % (N,A))
print('\nmagic number = %d' % magic_num)
print('sum primary diag = %d' % sum(A.diagonal()))
print('sum secondary diag = %d' % sum(AR.diagonal()))
for k2 in range(N):
    print('sum row %d = %d' % (k2,sum(A[k2,:])))
    print('sum col %d = %d' % (k2,sum(A[:,k2])))

#magic(11) = 
#[[ 56 117  46 107  36  97  26  87  16  77   6]
# [  7  57 118  47 108  37  98  27  88  17  67]
# [ 68   8  58 119  48 109  38  99  28  78  18]
# [ 19  69   9  59 120  49 110  39  89  29  79]
# [ 80  20  70  10  60 121  50 100  40  90  30]
# [ 31  81  21  71  11  61 111  51 101  41  91]
# [ 92  32  82  22  72   1  62 112  52 102  42]
# [ 43  93  33  83  12  73   2  63 113  53 103]
# [104  44  94  23  84  13  74   3  64 114  54]
# [ 55 105  34  95  24  85  14  75   4  65 115]
# [116  45 106  35  96  25  86  15  76   5  66]]
#
#magic number = 671
#sum primary diag = 671
#sum secondary diag = 671
#sum row 0 = 671
#sum col 0 = 671
#sum row 1 = 671
#sum col 1 = 671
#sum row 2 = 671
#sum col 2 = 671
#sum row 3 = 671
#sum col 3 = 671
#sum row 4 = 671
#sum col 4 = 671
#sum row 5 = 671
#sum col 5 = 671
#sum row 6 = 671
#sum col 6 = 671
#sum row 7 = 671
#sum col 7 = 671
#sum row 8 = 671
#sum col 8 = 671
#sum row 9 = 671
#sum col 9 = 671
#sum row 10 = 671
#sum col 10 = 671

R26. Magic Function in R

Magic matrices have sum of all rows and sum of all columns equal to some number. In addition, the sum of the primary and secondary cross diagonal equal that number, called the magic number. The magic number is only dependent on size of matrix.


The program will solve magic matrices for odd size, such as 11 by 11 (N = 11).


For a magic matrix, such as that of size 11 by 11, we have elements from 1 to 11 square. The magic number is the mean of all elements from 1 to N square, multiplied by N.


We use the matrix function to create a matrix and then loop, over the 2-dimensions, to get the individual elements, {1,2,3,...N^2}, into their correct positions. Magic matrices are not unique, there are many others, dependent on the algorithm, with the same sum property, however the magic number for a particular N is unique.


Lastly, we check all the relevant sums.

# ex26.R

N <- 11

magic_num <- sum(1:N^2)/N

mat <- function(i,j){
  i <- (i-1) %% N
  j <- (j-1) %% N
  c(x = i+1, y = j+1)
}
    
A <- matrix(numeric(N^2), nrow = N)
st <- c(N-(N-1)/2,(N+3)/2)

for (k2 in 1:N) {
  st <- c(st[1]+1,st[2]-1)
  for (k in 1:N) {
    cxy <- mat(st[1]+k-1,st[2]+k-1)
    A[cxy["x"],cxy["y"]] <- k+(k2-1)*N
  }
}

AR <- matrix(numeric(N^2),nrow = N)
for (k2 in 1:N) {
  AR[k2,] <- rev(A[k2,])
}

cat('\nmagic(', N, ') = \n', sep = '')
print(A)
cat('\nmagic number =', magic_num)
cat('\nsum primary diag =', sum(diag(A)))
cat('\nsum secondary diag =', sum(diag(AR)))
for (k2 in 1:N) {
  cat('\nsum row', k2, '=', sum(A[k2,]))
  cat('\nsum col', k2, '=', sum(A[,k2]))
}
# magic(11) = 
#   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
# [1,]   56  117   46  107   36   97   26   87   16    77     6
# [2,]    7   57  118   47  108   37   98   27   88    17    67
# [3,]   68    8   58  119   48  109   38   99   28    78    18
# [4,]   19   69    9   59  120   49  110   39   89    29    79
# [5,]   80   20   70   10   60  121   50  100   40    90    30
# [6,]   31   81   21   71   11   61  111   51  101    41    91
# [7,]   92   32   82   22   72    1   62  112   52   102    42
# [8,]   43   93   33   83   12   73    2   63  113    53   103
# [9,]  104   44   94   23   84   13   74    3   64   114    54
# [10,]   55  105   34   95   24   85   14   75    4    65   115
# [11,]  116   45  106   35   96   25   86   15   76     5    66
# 
# magic number = 671
# sum primary diag = 671
# sum secondary diag = 671
# sum row 1 = 671
# sum col 1 = 671
# sum row 2 = 671
# sum col 2 = 671
# sum row 3 = 671
# sum col 3 = 671
# sum row 4 = 671
# sum col 4 = 671
# sum row 5 = 671
# sum col 5 = 671
# sum row 6 = 671
# sum col 6 = 671
# sum row 7 = 671
# sum col 7 = 671
# sum row 8 = 671
# sum col 8 = 671
# sum row 9 = 671
# sum col 9 = 671
# sum row 10 = 671
# sum col 10 = 671
# sum row 11 = 671
# sum col 11 = 671

Saturday, February 21, 2015

py25. Gamma Distribution in Python

Like the Poisson and Exponential distribution, the Gamma distribution is also dependent on the exponential series.


Besides the calculation above, we can use the function scipy.stats.gamma.cdf

# ex25.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gamma
# shape, scale parameters
k,s = 7,2

def get_num(x,kv):
    xp = x/s
    coeff = np.exp(-xp)
    if kv == 0: return coeff
    cv = xp/np.arange(1,kv+1)
    return coeff*np.prod(cv)

def gamma_dis(x, calc):
    dis_calc = np.zeros(len(x))
    for i in range(len(x)):
        S = [get_num(x[i],kv) for kv in range(k)]
        dis_calc[i] = 1 - sum(S)
    dis_R = gamma.cdf(x, k, scale = s)
    if calc: return dis_calc
    else: return dis_R

main = ('Gamma Cumulative Distribution Function'
        '\nShape = 7, Scale = 2')
x = np.arange(40, step = .1)
plt.plot(x, gamma_dis(x,False),'r')
plt.xlabel('x')
plt.ylabel('P')
plt.title(main)
plt.plot(x, gamma_dis(x,True), 'b')
plt.show()

Output:

Thursday, February 19, 2015

R25. Gamma Distribution in R


Like the Poisson and Exponential distribution, the Gamma distribution is also dependent on the exponential series.


Besides the calculation, we can use the function pgamma.

# ex25.R
# shape, scale parameters
k <- 7
s <- 2

get_num <- function(x,kv) {
  xp <- x/s
  coeff <- exp(-xp)
  if (kv == 0) return(coeff) 
  cv <- xp/1:kv
  coeff*prod(cv)
}

gamma.dis <- function(x, calc) {
  dis.calc <- numeric(length(x))
  for (i in 1:length(x)) {
    S <- mapply(get_num,x[i],0:(k-1))
    dis.calc[i] <- 1 - sum(S)
  }
  dis.R <- pgamma(x, shape = k, scale = s)
  if (calc) return(dis.calc) else return(dis.R)
}

main <- paste('Gamma Cumulative Distribution Function',
              'Shape = 7, Scale = 2', sep = '\n', collapse = '')
x <- seq(0,40,.1)
plot(x, gamma.dis(x,FALSE), type = 'l', col = 'red', 
     ylab = 'P', main = main)
lines(x, gamma.dis(x,TRUE), col = 'blue')

Output:

R24. Exponential Distribution in R

The exponential distribution, gives probability of events in time, etc, which is usually denoted by x, given a rate of lambda. The Poisson distribution gives the number of events with a rate. Rate and time are reciprocal, so expected value of x is 1/lambda.


The exponential distribution cumulative density function, CDF, is:


P(X<=x) = 1 - exp(-lambda*x)


Three important x values are:.


x = 0. P(X<=0) = 0, as exp(0) = 1, number of probability at time 0 is 0, where the function starts.


x = 1/lambda. P(X<=1/lambda)= 0.632 as exp(-1) = 0.368, that is a little less than 2/3 probability.


x = Infinity. P(X<=Infinity) = 1, as exp(-Infinity) = 0, since probability has to happen in some finite time.


The probability density function is the derivative with respect to x, lambda*exp(-lambda*x).


The function pexp is used for the exponential CDF.

# ex24.R
N <- 1.1
x <- seq(0,N,.01)

exp.dis <- function(lambda, calc) {
  dis.calc <- 1 - exp(-lambda*x)
  dis.R <- pexp(x,lambda)
  if (calc) {
    r <- dis.calc
  } else {
    r <- dis.R
  }
  r
}

c0 <- 0.8
plot(x,exp.dis(1, FALSE), col = rgb(0,1-c0,c0),
     type = 'l', xlab = 'x', ylab = 'p',
     main = "Exponential Distribution (lambda = 9,7,5,3,1)")
abline(h = 1 - exp(-1), col = 'red')
for (i in seq(3,9,2)) {
  c0 <- c0-.1
  pr <- (i-1)/2 + 1
  points(x,exp.dis(i, TRUE), 
         col = rgb(0,1-c0,c0), type = 'l')
}
err <- exp.dis(9,TRUE)-exp.dis(9,FALSE)
cat('err for lambda of 9 =', sd(err))
# err for lambda of 9 = 3.462564e-17

Output:

py24. Exponential Distribution in Python

The exponential distribution, gives probability of events in time, etc, which is usually denoted by x, given a rate of lambda. The Poisson distribution gives the number of events with a rate. Rate and time are reciprocal, so expected value of x is 1/lambda.


The exponential distribution cumulative density function, CDF, is:


P(X<=x) = 1 - exp(-lambda*x)


Three important x values are:.


x = 0. P(X<=0) = 0, as exp(0) = 1, number of probability at time 0 is 0, where the function starts.


x = 1/lambda. P(X<=1/lambda)= 0.632 as exp(-1) = 0.368, that is, we have a little less than 2/3 probability.


x = Infinity. P(X<=Infinity) = 1, as exp(-Infinity) = 0, since probability has to happen in some finite time.


The probability density function is the derivative with respect to x, lambda*exp(-lambda*x).


The function scipy.stats.expon.cdf is used for the exponential CDF. In the call to this function, we have to put the mean value: 1/lambda. Again, since lambda is a reserved word in Python, we used lamb_da in the example.

# ex24.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import expon

N = 1.1
x = np.arange(N+.01,step=.01)

def exp_dis(lamb_da, calc):
    dis_calc = 1 - np.exp(-lamb_da*x)
    dis_R = expon.cdf(x,scale = 1/lamb_da)
    if calc: return dis_calc
    else: return dis_R

c0 = 0.8
plt.plot(x,exp_dis(1, False), color = (0,1-c0,c0))
plt.xlabel('x')
plt.ylabel('p')
plt.title('Exponential Distribution (lambda = 9,7,5,3,1)')
plt.plot(x, [1-np.exp(-1)]*len(x), color = 'red')
for i in 3,5,7,9:
    c0 -= .1
    plt.plot(x,exp_dis(i, False),
         color = (0,1-c0,c0))
plt.show()         
err = exp_dis(9,True)-exp_dis(9,False)
print('err for lambda of 9 =', err.std())
# err for lambda of 9 = 3.03591780425e-17

Output:

Wednesday, February 18, 2015

R23. Poisson Ditribution in R

Given a rate of lambda for an event in some interval, the Poisson distribution finds k such events, in the same interval. We know that the expected value of k has to be lambda.


For a particular k, we have the formula lambda^k/factorial(k) multiplied by a normalization constant. We can rewrite lambda^k/factorial(k) as:


(lambda/k)*(lambda/(k-1))*(lambda/(k-2))*...*(lambda/2)*(lambda) for k not 0, and 1 for k equal to 0


The function get_num finds this value multiplied by the normalization constant, which is exp(-lambda). With the normalization, the total probability for all k is 1.


Besides calculating, we can use the dpois function.

# ex23.R

N <- 20

get_num <- function(lambda,kv) {
  coeff <- exp(-lambda)
  if (kv == 0) return(coeff) 
  cv <- lambda/1:kv
  coeff*prod(cv)
}

poisson <- function(lambda, calc) {
  k <- 0:N
  dis.calc <- mapply(get_num,lambda,k)
  dis.R <- dpois(k,lambda)
  if (calc) {
    r <- dis.calc
  } else {
    r <- dis.R
  }
  r
}

c0 <- 0.8
pr <- 6
plot(0:N,poisson(1, FALSE), col = rgb(0,1-c0,c0),
     type = 'o', pch = pr, lty = pr,
     xlab = 'k', ylab = 'p',
     main = "Poisson Distribution (lambda = 1,3,5,7,9)")
for (i in seq(3,9,2)) {
  c0 <- c0-.1
  pr <- (i-1)/2 + 1
  points(0:N,poisson(i, TRUE), 
         col = rgb(0,1-c0,c0), type = 'o', pch = pr, lty = pr)
}
err <- poisson(9,TRUE)-poisson(9,FALSE)
cat('err for lambda of 9 =', sd(err))
# err for lambda of 9 = 1.755738e-17

Output:

py23. Poisson Distribution in Python

Given a rate of lambda for an event in some interval, the Poisson distribution finds k such events, in the same interval. We know that the expected value of k has to be lambda.


For a particular k, we have the formula lambda^k/factorial(k) multiplied by a normalization constant. We can rewrite lambda^k/factorial(k) as:


(lambda/k)*(lambda/(k-1))*(lambda/(k-2))*...*(lambda/2)*(lambda) for k not 0, and 1 for k equal to 0


The function get_num finds this value multiplied by the normalization constant, which is exp(-lambda). With the normalization, the total probability for all k is 1.


Besides calculating, we can use the scipy.stats.poisson.pmf function. Since lambda is a keyword in Python, it was written as lamb_da, in the example.

# ex23.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import poisson
N = 20
k = np.arange(0,N+1)

def get_num(lamb_da,k):
    coeff = np.exp(-lamb_da)
    if k==0: return coeff
    cv = lamb_da/np.arange(1,k+1)
    return coeff*np.prod(cv)
    

def pois(lamb_da, calc):
    dis_calc = np.array([get_num(lamb_da,i) for i in range(N+1)])
    dis_R = poisson.pmf(k,lamb_da)
    if calc: return dis_calc
    else: return dis_R

c0 = 0.8
pr = 0
st = ['o','v','^','s','d']
plt.plot(k,pois(1, False), st[pr] + '--', color = (0,1-c0,c0))
for i in 3,5,7,9:
    c0 -= .1
    pr = (i-1)//2
    plt.plot(k,pois(i, True), st[pr] + '--', color = (0,1-c0,c0))
    plt.xlabel('k')
    plt.ylabel('p')
    plt.title('Poisson Distribution (lambda = 1,3,5,7,9)')

err = pois(9,True) - pois(9,False)
print('err for lambda of 9 =', err.std())
# err for lambda of 9 = 9.4458398513e-17

Output:

py22. One Sample t-test in Python

A one-sample t-test is conducted over a range of mean values. The null hypothesis is that test mean is the true mean of the population.


The sample is of length 500 and is a random sample based on a standard normal distribution, which has a mean of 0. Thus, we expect values far from 0 will be rejected. For the normal sample, we use scipy.stats.norm.rvs function


If you uncomment the print() lines, you will get intermediate results. You can use the scipy.stats.ttest_1samp function or a calculation, based on the t-distribution. We have a two sided alternative hypotheses, that is, we sum over both tails. This is the default for ttest_1samp from scipy.stats module. The t distribution, from this module, is used but with the alias st.

# ex22.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm, ttest_1samp, t as st
N = 500
mu = np.arange(-.2,+.201,.001) # test for the population mean
pr = np.zeros(len(mu))
rej = np.zeros(len(mu))
samp = norm.rvs(size = N)
mean_samp = samp.mean()
sd_samp = samp.std(ddof = 1)
for i in range(len(mu)):
    m = mu[i]
    t_crit = (mean_samp-m)/(sd_samp/np.sqrt(N))
    if t_crit>=0:
        p = 2*(1-st.cdf(t_crit, df = N-1))
    else:
        p = 2*st.cdf(t_crit, df = N-1 )
    pr[i] = p
    res = ttest_1samp(samp, popmean = m)
    # t1,t2 tuples for prints
    t1 = t_crit, res[0]
    t2 = p, res[1]
    #print('\n-- t_crit = %.5f \tres[0] = %.5f' % t1)
    #print('\n-- p = %.5f \tres[1] = %.5f' % t2)
    if p<0.05:
        #print('\nmu =',m,' rejected')
        rej[i] = 1
    else:
        #print('\nmu =',m,' not rejected')
        rej[i] = 0
plt.plot(mu,rej,'r')
plt.plot(mu,pr,'b')
plt.plot(mu,len(mu)*[0.05],'g')
plt.title('Means rejected (p-value: blue), (threshold: green)')
plt.xlabel('mean')
plt.ylabel('Reject = 1 (red)')
plt.ylim(-.05,1.05)
plt.show()

Output:

Tuesday, February 17, 2015

R22. One Sample t-test in R

A one-sample t-test is conducted over a range of mean values. The null hypothesis is that test mean is the true mean of the population.


The sample is of length 500 and is a random sample based on a standard normal distribution, which has a mean of 0. Thus, we expect values far from 0 will be rejected.


If you uncomment the cat() lines, you will get intermediate results. You can use the t.test function or a calculation, based on the t-distribution. We have a two sided alternative hypotheses, that is, we sum over both tails. This is the default for a t.test.

# ex22.R
N <- 500
mu <- seq(-.2,+.2,.001) # test for the population mean
pr <- numeric(length(mu))
rej <- numeric(length(mu))
samp <- rnorm(N)
mean.samp <- mean(samp)
sd.samp <- sd(samp)
for (i in 1:length(mu)) {
  m <- mu[i]
  t.crit <- (mean.samp-m)/(sd.samp/sqrt(N))
  if (t.crit>=0) {
    p <- 2*(1-pt(t.crit, df = N-1))
  } else {
    p <- 2*pt(t.crit, df = N-1 )
  }
  pr[i] <- p
  res <- t.test(samp, mu = m)
  #cat('\n-- t.crit = ',t.crit,'\tres$statistic =',res$statistic)
  #cat('\n-- p =',p,'\tres$p.value =', res$p.value)
  if (p<0.05) {
    #cat('\nmu =',m,' rejected')
    rej[i] <- 1
  } else {
    #cat('\nmu =',m,' not rejected')
    rej[i] <- 0
  }
}
plot(mu, rej, type = 'l',
     main = "Means rejected (p-value: blue), (threshold: green)",
     col = 'red', xlab = 'mean', ylab = "Reject = 1 (red)")
lines(mu, pr, col = 'blue')
abline(h = .05, col = 'green')

Output:

py21. Negative Binomial Distribution in Python

In negative binomial distribution, we find probability of k successes in n trials, with the requirement that the last trial be a success. This requirement is due to the fact that the total probability is 1, and we not not want any double counting.


Unlike binomial distribution, in negative binomials, k is fixed. Instead of using n, we use r, the number of failures, as the x-parameter, and which is just n-k.


Thus, if we have 4 trials and we want 3 successes we can have these combinations (p = success, q = failure):


p p q p


p q p p


q p p p


The above can be found from choose(3,2) = 3. The function scipy.misc.comb is used with the alias choose.


Thus, if n is the number of trials, and k is the number of success, we have choose(n-1,k-1) combinations. This is due to the loss of freedom of the final trial. Each of these combinations has the probability p^k*q^(n-k), where q, failure, is 1-p.


We can rewrite the formulas in terms of failures, r, where r = n - k. This distribution goes from r = 0, to infinity. However, it goes to zero very fast. The reason we change the formulas to depend on k and r is so we can compare it to the Python probability function for negative binomial distribution, scipy.stats.nbinom.pmf

# ex21.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import nbinom
from scipy.misc import comb as choose
N = 12
p = 0.55
k = 3 
suc = np.zeros(N)
for r in range(N):
    suc[r] = choose(r+k-1,k-1)*p**k*(1-p)**r

dis = nbinom.pmf(range(N),k,p)
# dis == suc
err = sum(dis-suc)
print("err =",err)
plt.bar(range(N),suc, color='gray')
plt.xlabel('r fails')
plt.ylabel('{} suc with 0:{} fails'.format(k,N-1))
plt.title('Negative Binomial Distribution, p = {}'.format(p))
plt.show()
# err = -3.0617869351e-16

Output:

Monday, February 16, 2015

R21. Negative Binomial Distribution in R

In negative binomial distribution, we find probability of k successes in n trials, with the requirement that the last trial be a success. This requirement is due to the fact that the total probability is 1, and we not not want any double counting.


Thus, if we have 4 trials and we want 3 successes we can have these combinations (p = success, q = failure):


p p q p


p q p p


q p p p


The above can be found from choose(3,2) = 3.


Thus, if n is the number of trials, and k is the number of success, we have choose(n-1,k-1) combinations. This is due to the loss of freedom of the final trial. Each of these combinations has the probability p^k*q^(n-k), where q, failure, is 1-p.


We can rewrite the formulas in terms of failures, r, where r = n - k. This distribution goes from r = 0, to infinity. However, it goes to zero very fast. The reason we change the formulas to depend on k and r is so we can compare it to the R function for negative binomial distribution, dnbinom.

# ex21.R
N <- 12
p <- 0.55
k <- 3 
suc <- numeric(N)
for (r in 0:N-1) {
  suc[r+1] <- choose(r+k-1,k-1)*p^k*(1-p)^r
}
dis <- dnbinom(0:(N-1), size = k, prob = p)
# dis == suc
err <- sum(dis-suc)
cat("err =",err)
names(suc) <- 0:(N-1)
barplot(suc, xlab = 'r fails',
        ylab = paste(k,'suc with 0:',N-1,'fails',collapse=''),
        main = 'Negative Binomial Distribution',
        sub = paste('p =',p, collapse=''))
# err = -1.539567e-16

Output:

Sunday, February 15, 2015

R20. qqnorm in R

The qqnorm function is used for the qqnorm plots.


The quantile of x is plotted against the normal distribution quantiles. The probabilities, of the quantiles, are slightly different as you can see by printing prob1 and prob2. This is done because the normal is infinite in x, and any sample is finite. This means, for the sample, 0% quantile and 100% quantile are known, the minimum and maximum values. For the theoretical norm, there are no minimum and maximum values (-Inf and +Inf).


The calculated value, as well as that from qqnorm, are plotted. Ofcourse, there is no difference. qqnorm returns a list of x and y coordinates. We can compare them with the calculated ones. The compare library is used, which may be installed with install.packages function. The reason identical function can not be used is because comparing real numbers will lead to small errors, in the 10^-16 range, for double precision numbers.

# ex20.R
library(compare)
N <- 100
nm <- c('normal','beta','uniform')
ld <- 1
if (ld==1) {
  y <- rnorm(N)
} else if (ld == 2) {
  y <- rbeta(N,2,5)
} else if (ld == 3) {
  y <- runif(N,-.1,3)
}
z <- (y-mean(y))/sd(y)
prob1 <- seq(0, 1, length.out = N)
prob2 <- seq(1/(2*N), by = 1/N, along.with = prob1)
zs <- quantile(sort(z),probs = prob1)
par(mfrow = c(1,2))
pz <- qnorm(prob2)
g <- qqnorm(sort(z))
qqline(sort(z))
mtext(nm[ld], line = -1)
plot(pz,sort(z),main = "Calculated")
abline(0,1)
mtext(nm[ld], line = -1)
print(compare(g$y,sort(z)))
print(compare(g$x,pz))
# TRUE
# TRUE

Output:

py20. qqnorm in Python

The qqnorm function from statsmodels.api is used for the qqnorm plots.


The percentile of x is plotted against the normal distribution values for same percentiles or quantiles.


The calculated, vs. that from the statsmodels.api, are plotted. Ofcourse, there is no difference, as we will see in the R program.

# ex20.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from statsmodels.api import qqplot
N = 100
nm = ['normal','beta','uniform']
ld = 0
if ld == 0:
    y = np.random.normal(size = N)
elif ld == 1:
    y = np.random.beta(2,5,N)
elif ld == 2:
    y = np.random.uniform(-.1,3,N)

z = (y-y.mean())/y.std()
z.sort()
prob1 = np.linspace(0,100, num = N)
prob2 = np.arange(1/(2*N), 1, step = 1/N)
zs = np.percentile(z,prob1)
pz = norm.ppf(prob2)
plt.plot(pz,zs,'go',pz,pz,'k')
plt.title('Calculated - ' + nm[ld],)
plt.xlabel('pz')
plt.ylabel(' zs')
qqplot(z, fit = True, line='45')
plt.title('qqnorm - ' + nm[ld])

Output:

Saturday, February 14, 2015

R19. Geometric Distribution in R

The Geometric Distribution gives the chance of success at trial n. The success at the first trial is p.


The success at at n is (1-p)^(n-1)*p.


The formula is contrasted to dgeom, which should be the same.

# ex19.R
prob <- 0.2
N <- 12
p <- numeric(N)
d <- numeric(N)
for (k in 1:N) {
  p[k] <- (1-prob)^(k-1)*prob
  d[k] <- dgeom(k-1,prob)
}
names(p) <- 1:N
barplot(p, xlab = 'n', ylab = 'p(n)', main = 'First Success at n')
cat('Probability success after', N ,'trials is',
    (1-sum(p)),'\naccording to formula.\n')
cat('Probability success after', N ,'trials is',
    (1-sum(d)),'\naccording to dgeom')
# Probability success after 12 trials is 0.06871948
# according to formula.
# Probability success after 12 trials is 0.06871948
# according to dgeom

Output

py19. Geometric Distribution in Python

The Geometric Distribution gives the chance of success at trial n. The success at the first trial is p.


The success at at n is (1-p)^(n-1)*p. In Python, the ^ operation is performed by **.


The formula is contrasted to scipy.stas.geom.pmf, which should be the same.

# ex19.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import geom
prob = 0.2
N = 12
p = np.zeros(N)
d = np.zeros(N)
for k in range(1,N+1):
    p[k-1] = (1-prob)**(k-1)*prob
    d[k-1] = geom.pmf(k,prob)
plt.bar(range(1,13),p, color = 'gray')
plt.xlim(1,13)
plt.xlabel('n')
plt.ylabel('p(n)')
plt.title('First Success at n')
plt.show()

print('Probability success after', N ,'trials is',
      (1-sum(p)),'\naccording to formula.\n')
print('Probability success after', N ,'trials is',
      (1-sum(d)),'\naccording to geom.pmf')
# Probability success after 12 trials is 0.068719476736
# according to formula.
# Probability success after 12 trials is 0.068719476736
# according to geom.pmf

Output:

Friday, February 13, 2015

R18. Student t Distribution in R

We can use the function dt for the Student t-Distribution to get the probability density function.


The Student t-Distribution is for the case the sample comes from a normal population, however the number of samples is less than 30.


The t-Distribution goes very near the normal as n becomes 30. It is hard to spot the difference in the graph.

# ex18.R
x <- seq(-3,+3, length = 500)
n <- 5
y <- dt(x, n-1)
plot(x, y, ylim = c(0,.4), col = 'red', type = 'l',
     ylab = 'Probability Density Function',
     main = 't dist, n=5:red, n=10:green, n=30:blue, norm:magenta')
n <- 10
y <- dt(x, n-1)
lines(x, y, col = 'green')
n <- 30
y <- dt(x, n-1)
lines(x, y, col = 'blue')
y <- dnorm(x)
lines(x, y, col = 'magenta')

Output:

py18. Student t Distribution in Python

We can use the function scipy.stats.t for the Student t-Distribution. The function t is renamed as st. We use the st.pdf function for the probability density function.


The Student t-Distribution is for the case the sample comes from a normal population, however the number of samples is less than 30.


The t-Distribution goes very near the normal as n becomes 30. It is hard to spot the difference in the graph.

# ex18.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm, t as st
x = np.linspace(-3, +3, num = 500)
n = 5
y = st.pdf(x, n-1)
plt.plot(x, y, 'r')
plt.xlabel('x')
plt.ylabel('Probability Density Function')
plt.title('t dist, n=5:red, n=10:green, n=30:blue, norm:magenta')
n = 10
y = st.pdf(x, n-1)
plt.plot(x, y, 'g')
n = 30
y = st.pdf(x, n-1)
plt.plot(x, y, 'b')
y = norm.pdf(x)
plt.plot(x, y, 'm')
plt.show()

Output:

Thursday, February 12, 2015

R17. Central Limit Theorem and Mean of Beta Distribution in R

For any distribution (such as Beta, but not limited to Beta), the mean is a normal random variable. It does require that we use sample alpha and beta parameters, for the Beta, for example.


For the Beta distribution, we use rbeta. For the confidence level calculations, we use the qnorm for the cumulative density function.


The percentage of mean inside margin of error is calculated, for the 5000 Beta distributions.

# ex17.R
N <- 5000
n <- 1000
alpha <- 2
beta <- 5
mn <- numeric(N)
for (i in 1:N) {
    x <- rbeta(n, alpha, beta)
    mn[i] <- mean(x)
}
mn.samps <- mean(mn)
conf.range <- seq(5,95,5)
zin <- numeric(length(conf.range))
std.samps <- sd(mn)
for (i in 1:length(conf.range)) {
    p <- conf.range[i]
    z.crit <- qnorm((100+p)/200)
    moe <- z.crit*std.samps
    ins <- (mn.samps-moe < mn) & (mn < mn.samps+moe)
    zin[i] <- sum(ins)/N
}
plot(conf.range, zin*100, col = 'blue', type = 'l',
     xlab = 'Confidence Level',
     ylab = 'Percentage mean inside moe',
     main = 'Mean of Beta Samples',
     xlim = c(0,100), ylim = c(0,100))

Output:

py17. Central Limit Theorem and Mean of Beta Distribution in Python

For any distribution (such as Beta, but not limited to Beta), the mean is a normal random variable. It does require that we use sample alpha and beta parameters, for the Beta, for example.


For the Beta distribution, we use numpy.random.beta. For the confidence level calculations, we use the scipy.stats.norm, for the cumulative density function.


The percentage of mean inside margin of error is calculated, for the 5000 Beta distributions.

# ex17.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import beta
from scipy.stats import norm
N = 5000
n = 1000
a = 2
b = 5
mn = np.zeros(N)
for i in range(N):
    x = beta(a, b, n)
    mn[i] = x.mean()

mn_samps = mn.mean()
conf_range = range(5,100,5)
zin = np.zeros(len(conf_range))
std_samps = mn.std()
for i,p in enumerate(conf_range):
    z_crit = norm.ppf((100+p)/200)
    moe = z_crit*std_samps
    ins = (mn_samps-moe < mn) & (mn < mn_samps+moe)
    zin[i] = sum(ins)/N

plt.plot(conf_range, zin*100, 'b')
plt.xlabel('Confidence Level')
plt.ylabel('Percentage mean inside moe')
plt.title('Mean of Beta Samples')
plt.show()

Output:

R16. One Beta Distributed Sample in R

The function rbeta is used to create one Beta distributed sample of size N, which is 10000 in example.


The alpha is 2 and beta is 5, same as before.


The mean of sample is contrasted with the actual value.

# ex16.R
N <- 10000
alpha <- 2
beta <- 5
x <- rbeta(N, alpha,beta)
hist(x, breaks = 50, col = 'pink')
m1 <- mean(x)
m2 <- alpha/(alpha+beta)
error <- (m1-m2)/m2
cat("mean of sample = ", m1, " but should be ", m2)
cat("\nThe error = ", error)
# mean of sample =  0.2833446  but should be  0.2857143
# The error =  -0.008294062

py16. One Beta Distributed Sample in Python


The function numpy.random.beta is used to create one Beta distributed sample of size N, which is 10000 in example.


The alpha (renamed a) is 2 and beta (renamed b) is 5, same as before.


The mean of sample is contrasted with the actual value.

# ex16.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
from numpy.random import beta
N = 10000
a = 2
b = 5
x = beta(a, b, N)
plt.hist(x, bins = 50, color = 'pink')
plt.xlabel('x')
plt.ylabel('Frequency')
plt.title('Histogram of x - Beta Distribution')
m1 = x.mean()
m2 = a/(a+b)
error = (m1-m2)/m2
print("mean of sample = %s but should be %s" % (m1,m2))
print("The error = %s" % error)
plt.show()
# mean of sample = 0.286522609772 but should be 0.285714285714
# The error = 0.00282913420239

Output:

R15. Beta Distribution in R

The Beta distribution depends on two parameters, alpha and beta. The probability is nonzero between x = 0 and x = 1, and goes to zero near the two boundaries. You can see both y[1] and y[N] are zero.


Here, x is a numeric vector from 0 to 1, with a total number of points N.


All probability density functions start with d: dnorm, dbinom, dbeta, etc. You can get a full listing by typing ?distributions at the console.


We also compare two strings. In R, comparisons are case insensitive.

# ex15.R
N <- 10000
x <- seq(0, 1, length = N)
alpha <- 2
beta <- 5
y <- dbeta(x,alpha,beta)
main <- paste("alpha =", alpha, ", beta =", beta)
plot(x, y, col='blue', type = 'l',
     ylab = 'Beta', main = main)
cat('Total Probability:  ',1/N*sum(y),'\n')
cat('Error, Total Probability:  ',1-1/N*sum(y),'\n')
m1 <- 1/N*sum(x*y)
m2 <- alpha/(alpha+beta)
cat('Approx. Mean:', m1, '\n')
cat('Mean:', m2, '\n')
error <- (m1-m2)/m2
cat('Error, Mean:   ', error, '\n')
cat('python > R:  ', 'python' > 'R')
# Total Probability:   0.9999 
# Error, Total Probability:   0.000100025 
# Approx. Mean: 0.2856857 
# Mean: 0.2857143 
# Error, Mean:    -1e-04
# python > R:   FALSE

Ouptpu:

py15. Beta Ditribution in Python

The Beta distribution depends on two parameters, alpha and beta. Here, a and b names are used instead. The probability is nonzero between x = 0 and x =1, and goes to zero near the two boundaries.


Here, x is a numpy array from 0 to 1, with a total number of points N. In Python, we can test for the last element as x[-1]. Both x[0] and x[-1] should be 0.


We use scipy.stats.beta for the beta distribution. If you are using IPython, type beta, dot and press tab to see beta functions. One of the beta functions is pdf which stands for the probability density function.


We also test for two strings. In Python, string tests are ASCII based, with the lower case letters higher than capital case letter. For example, 'A' is 65 (or 41 hex) and 'a' is 97 (or 61 hex).

# ex15.py
from __future__ import print_function, division
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import beta
N = 10000
x = np.linspace(0, 1, num = N)
a = 2
b = 5
y = beta.pdf(x,a,b)
plt.plot(x, y, 'b')
plt.xlabel("x")
plt.ylabel("Beta")
plt.title(r"$\alpha$ = %s,   $\beta$ = %s" % (a,b))
plt.show()
print('Total Probability:  %s' % (1/N*sum(y)))
print('Error, Total Probability:  %s' % (1-1/N*sum(y)))
m1 = 1/N*sum(x*y)
m2 = a/(a+b)
print('Approx. Mean: ', m1)
print('Mean: ', m2)
error = (m1-m2)/m2
print('Error, Mean: ', error)
print('python > R: ', 'python'>'R')
#Total Probability:  0.999899974997
#Error, Total Probability:  0.000100025002502
#Approx. Mean:  0.285685714286
#Mean:  0.285714285714
#Error, Mean:  -0.000100000000004
#python > R:  True

Output:

Wednesday, February 11, 2015

R14. New objects in R

mtcars is a built-in dataset in R. Only first 3 rows and first column are returned, in cars. The drop argument is given as FALSE, so a vector is not returned.


If you extract a column, etc, you are always creating new objects in R. Any changes on the new object do not affect the original object. While this might seem to lead to easier programming, it also makes R more memory intensive.

# ex14.R
# get only 3 row and 2 columns
cars <- mtcars[1:3,1,drop = FALSE]
cat("Original cars =\n")
print(cars)
mpg <- cars["mpg"]
cat("\nOriginal mpg =\n")
print(mpg)
mpg[1] <- 42
cat("\nModified mpg =\n")
print(mpg)
cat("\nMazda RX4 mpg is not modified\n")
cat("as mpg is a copy\n")
print(cars)
# Original cars =
#   mpg
# Mazda RX4     21.0
# Mazda RX4 Wag 21.0
# Datsun 710    22.8
# 
# Original mpg =
#   mpg
# Mazda RX4     21.0
# Mazda RX4 Wag 21.0
# Datsun 710    22.8
# 
# Modified mpg =
#   mpg
# Mazda RX4      42
# Mazda RX4 Wag  42
# Datsun 710     42
# 
# Mazda RX4 mpg is not modified
# as mpg is a copy
# mpg
# Mazda RX4     21.0
# Mazda RX4 Wag 21.0
# Datsun 710    22.8

py14. Copy in Python

mtcars is a built-in dataset in R. To load it into Python, you may find a copy at https://raw.githubusercontent.com/ChrisBeaumont/mplfacet/master/mtcars.csv. Download it to your working directory, that is, the directory of the script.


The csv file is loaded using read_csv pandas function, and only first 3 rows and first 2 columns are retained, in cars.


If you extract a column, you are extracting only a view and any modification on the view affects the original object, as well. However if you use copy() on the view, a new object is returned and any changes on it, do not affect original object. While this may seem more complex at first, this leads to more efficient memory management in Python. Most of the time, you are not going to modify the data, so rarely you will need to do a copy() of the view.

# ex14.py
from __future__ import print_function, division
import pandas as pd
# get only 3 row and 2 columns
cars = pd.read_csv("mtcars.csv").iloc[:3,:2]
print("Original cars =\n%s" % cars)
mpg = cars["mpg"]
print("\nOriginal mpg =\n%s" % mpg)
mpg[0] = 42 # Warning
print("\nModified mpg =\n%s" % mpg)
print("\nMazda RX4 mpg modified\n%s" % cars)
cars = pd.read_csv("mtcars.csv").iloc[:3,:2]
print("\nOriginal cars\n%s" % cars)
mpg = cars["mpg"].copy()
print("\nmpg is now a new object")
print("\nmpg before change\n%s" % mpg)
mpg[0] = 42
print("\nmpg after change:\n%s" % mpg)
print("\nMazda RX4 mpg is not modified\n%s" % cars)
#Original cars =
#             #""   mpg
#0      Mazda RX4  21.0
#1  Mazda RX4 Wag  21.0
#2     Datsun 710  22.8
#
#Original mpg =
#0    21.0
#1    21.0
#2    22.8
#Name: mpg, dtype: float64
#
#Modified mpg =
#0    42.0
#1    21.0
#2    22.8
#Name: mpg, dtype: float64
#
#Mazda RX4 mpg modified
#             #""   mpg
#0      Mazda RX4  42.0
#1  Mazda RX4 Wag  21.0
#2     Datsun 710  22.8
#
#Original cars
#             #""   mpg
#0      Mazda RX4  21.0
#1  Mazda RX4 Wag  21.0
#2     Datsun 710  22.8
#
#mpg is now a new object
#
#mpg before change
#0    21.0
#1    21.0
#2    22.8
#Name: mpg, dtype: float64
#
#mpg after change:
#0    42.0
#1    21.0
#2    22.8
#Name: mpg, dtype: float64
#
#Mazda RX4 mpg is not modified
#             #""   mpg
#0      Mazda RX4  21.0
#1  Mazda RX4 Wag  21.0
#2     Datsun 710  22.8

Monday, February 9, 2015

R13. Central Limit Theorem and Confidence Level in R

A normal sample of size 500 is used. We create 5000 such samples and record their mean values.


The calculated mean will be slightly different from the population mean within margin of error (moe).


For 5 to 95% confidence level, the moe is found and the proportion of means inside the range. We expect a linear line with slope of 1 (y = x)

# ex13.R
N <- 5000
n <- 500
p <- 0.6
val <- numeric(N)
pop.sd <- runif(1,10,25)
pop.mn <- p*n
for (j in 1:N) {
  s <- rnorm(n, mean = pop.mn, sd = pop.sd)
  val[j] <- mean(s)
}

# In general, population values are not known.
# Only 1 sample will be used in practice. 
# Many samples are used for the CLT
use.samp <- TRUE
if (use.samp) {
  mn1 <- mean(val)
  sd1 <- sd(val)
} else {
  mn1 <- pop.mn
  sd1 <- pop.sd/sqrt(n)
}
conf.range <- seq(5,95,5)
v <- numeric(length(conf.range))
for (i in 1:length(conf.range)) {
  per <- conf.range[i]
  z <- qnorm((100+per)/200)
  moe <- z * sd1
  ins <- (val > mn1 - moe) & (val < mn1 + moe)
  v[i] <- sum(ins)/N
}
plot(conf.range,v*100, type='l', col='blue',
     xlab = 'Confidence level',
     ylab = 'Proportion mean within moe',
     xlim = c(0,100), ylim = c(0,100))

Output:

py13. Central Limit Theorem and Confidence Level in Python

A normal sample of size 500 is used. We create 5000 such samples and record their mean values.


The calculated mean will be slightly different from the population mean within margin of error (moe).


For 5 to 95% confidence level, the moe is found and the proportion of means inside the range. We expect a linear line with slope of 1 (y = x)

# ex13.py
from __future__ import print_function, division
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
N = 5000
n = 500
p = 0.6
val = np.zeros(N)
pop_sd = (15*np.random.sample(1)+10)[0] 
pop_mn = p*n
for j in range(N):
    s = np.random.normal(loc = pop_mn, scale = pop_sd, size = n)
    val[j] = s.mean()

# In generation, population values are not known
# Only 1 sample will be used in practice. 
# Many samples are used for the CLT
use_sample = True
if use_sample:
    mn1 = val.mean()
    sd1 = val.std()
else:
    mn1 = pop_mn
    sd1 = pop_sd/np.sqrt(n)

conf_range = range(5,100,5)
v = np.zeros(len(conf_range))
for i,per in enumerate(conf_range):
    z = norm.ppf((100+per)/200)
    moe = z * sd1
    ins = (val > mn1 - moe) & (val < mn1 + moe)
    v[i] = sum(ins)/N

plt.plot(conf_range,v*100,'b')
plt.xlabel('Confidence level')
plt.ylabel('Proportion mean within moe')
plt.show()

Output:

Sunday, February 8, 2015

R12. Normal Approximation to Binomial in R

According to the Central Limit Theorem, average of distributions such as Binomial can be approximated by the Normal for large n.


The function pnorm and dbinom are used. For the binomial, the probabilities are summed for a range near p*n.


For the normal, the pnorm at x = 0 is p = 0.5. Thus only the right side probability is used and multiplied by 2 since normal is symmetric about mean.

# ex12.R
p <- 0.4
n <- seq(100,2000,100)
s <- numeric(length(n))
nrm <- numeric(length(n))
print('Probability of successes n*(p-.01):n*(p+.01) in n: -> s')
for (j in 1:length(n)) {
  vals <- as.integer(c(n[j]*(p-.01), n[j]*(p+.01)))
  s[j] <- sum(dbinom(vals[1]:vals[2], size = n[j], prob = p ))
}
print("Normal approximation: -> nrm")
for (j in 1:length(n)) {
  mn <- n[j]*p
  std <- sqrt(n[j]*p*(1-p))
  vals <- as.integer(c(n[j]*(p-.01), n[j]*(p+.01)))
  nrm[j] <- 2*(pnorm(vals[2]+0.5, mean = mn, sd = std)-.5)
}
error <- (s-nrm)^2
plot(n, error, type='l', col = 'blue')
# [1] "Probability of successes n*(p-.01):n*(p+.01) in n: -> s"
# [1] "Normal approximation: -> nrm"

Output:

py12. Normal Approximation to Binomial in Python

According to the Central Limit Theorem, average of distributions such as Binomial can be approximated by the Normal for large n.


The function norm and binom from scipy.stats are used. For the binomial, the probabilities are summed for a range near p*n.


For the normal, the cdf at x = 0 is p = 0.5. Thus only the right side probability is used and multiplied by 2 since normal is symmetric about mean.

# ex12.py
from __future__ import print_function, division
import numpy as np
from scipy.stats import norm, binom
import matplotlib.pyplot as plt
p = 0.4
n = np.arange(100,2001,100)
s = np.zeros(len(n))
nrm = np.zeros(len(n))
print('Probability of successes n*(p-.01):n*(p+.01) in n: = s')
for j in range(len(n)):
    vals = range(int(n[j]*(p-.01)), int(n[j]*(p+.01))+1)
    s[j] = sum(binom.pmf(vals, n[j], p ))
print("Normal approximation: = nrm")
for j in range(len(n)):
    mn = n[j]*p
    std = np.sqrt(n[j]*p*(1-p))
    val = n[j]*(p+.01)
    nrm[j] = 2*(norm.cdf(val+0.5, loc = mn, scale = std)-.5)
error = (s-nrm)**2
plt.plot(n, error, 'b')
plt.show()
#Probability of successes n*(p-.01):n*(p+.01) in n: = s
#Normal approximation: = nrm

Output:

Saturday, February 7, 2015

R11. Frequency table in R

The csv file is downloaded from http://countrylist.net/en/. The delimiter is a semicolon and not a comma.


According to the file, there are 3 countries in Antarctica.


The ftable function is used to organize the data under the category of 'continent'. The $ can be be used since a data.frame is a list, as well. The class of group is 'ftable', and we may extract the names using one of its attributes.

# ex11.R
df <- read.csv('2015-02-07.dump.countrylist.net.csv',
                sep = ";") 
print('***The columns are')
for (i in names(df)) cat(i,'\n')
cat('\n***The number of rows is\n')
print(nrow(df))
cat('\n***Countries in Antartica:\n')
df1 <- df[df['continent'] == 'Antarctica',]
print(df1['name'])
group <- ftable(df$continent)
name <- attr(group, 'col.vars')[[1]]
cat('\n\n***Frequency table:\n')
for (i in 1:length(group)) {
  cat(name[i],group[i],'\n')
}
# [1] "***The columns are"
# id 
# continent 
# name 
# capital 
# iso.2 
# iso.3 
# ioc 
# tld 
# currency 
# phone 
# utc 
# wiki 
# name_de 
# capital_de 
# wiki_de 
# 
# ***The number of rows is
# [1] 250
# 
# ***Countries in Antartica:
#   name
# 11                  Antarctica
# 33               Bouvet Island
# 65 French Southern Territories
# 
# 
# ***Frequency table:
# Africa 62 
# Antarctica 3 
# Asia 55 
# Australia 26 
# Europe 53 
# North America 31 
# South America 20 

py11. Frequency table in Python

The csv file is downloaded from http://countrylist.net/en/. The delimiter is a semicolon and not a comma.


According to the file, there are 3 countries in Antarctica.


The groupby function is used to organize the data under the category of 'continent'. We are usually are only interested in the 'name' column.

# ex11.py
from __future__ import division, print_function
from pandas import read_table
df = read_table('2015-02-07.dump.countrylist.net.csv',
                sep = ';') 
print('\n***The columns are')
for i in df.columns: print(i)
print('\n\n***There are %d rows.' % len(df))
print('\n***Countries in Antartica:')
df1 = df[df['continent'] == 'Antarctica']
print(df1['name'])
group = df.groupby('continent')
print('\n\n***Frequency table:')
print(group.count()['name'])
#***The columns are
#id
#continent
#name
#capital
#iso-2
#iso-3
#ioc
#tld
#currency
#phone
#utc
#wiki
#name_de
#capital_de
#wiki_de
#
#
#***There are 250 rows.
#
#***Countries in Antartica:
#10                     Antarctica
#32                  Bouvet Island
#64    French Southern Territories
#Name: name, dtype: object
#
#
#***Frequency table:
#continent
#Africa           62
#Antarctica        3
#Asia             55
#Australia        26
#Europe           53
#North America    31
#South America    20
#Name: name, dtype: int64

R10. Reading a csv table in R

This data is from https://raw.githubusercontent.com/datasets/population/master/data/population.csv. Download the csv file into the working directory.


The read.csv function is used to read the text csv file and return a data.frame object.


When the rows for United States, and the Arab World, are found, we have to include a comma within brackets to indicate we want rows matching the logical condition. Also R puts a period for the space in 'Country Name' so it becomes 'Country.Name'.

# ex10.R
df <- read.csv('population.csv')
print('***class of df =')
print(class(df))
print('***The first 6 rows are')
print(head(df))
print('***The last 6 rows are')
print(tail(df))
print('***The total number of rows is')
print(nrow(df))
df1 <- df[df['Country.Name'] == 'United States',]
df2 <- df[df['Country.Name'] == 'Arab World',]
plot(df2[['Year']],df2[['Value']]/1e6, type = 'l',
     col = 'green', xlab = 'Year',
     ylab = 'Population (millions)',
     main = 'USA - blue, Arab World - green')
lines(df1[['Year']],df1[['Value']]/1e6, col = 'blue')
# [1] "***class of df ="
# [1] "data.frame"
# [1] "***The first 6 rows are"
# Country.Name Country.Code Year     Value
# 1   Arab World          ARB 1960  96388069
# 2   Arab World          ARB 1961  98882541
# 3   Arab World          ARB 1962 101474076
# 4   Arab World          ARB 1963 104169209
# 5   Arab World          ARB 1964 106978105
# 6   Arab World          ARB 1965 109907857
# [1] "***The last 6 rows are"
# Country.Name Country.Code Year    Value
# 12402     Zimbabwe          ZWE 2005 12570686
# 12403     Zimbabwe          ZWE 2006 12529655
# 12404     Zimbabwe          ZWE 2007 12481245
# 12405     Zimbabwe          ZWE 2008 12451543
# 12406     Zimbabwe          ZWE 2009 12473992
# 12407     Zimbabwe          ZWE 2010 12571000
# [1] "***The total number of rows is"
# [1] 12407

Output:

py10. Reading a csv table in Python

This data is from https://raw.githubusercontent.com/datasets/population/master/data/population.csv. Download the csv file into the working directory. If you want, you can write the url, instead of the filename in the read_csv call.


The read_csv function from the pandas module is used to read text csv file and return a DataFrame.


If working in IPython, typing the DataFrame, such as df, pressing period, and then hitting tab will list the methods that may be applied. To get help on a particular function, you can type ? at the end such as df.head?

# ex10.py
from __future__ import division, print_function
import matplotlib.pyplot as plt
from pandas import read_csv
df = read_csv('population.csv')
print('***type of df = \n',type(df))
print('\n***The first 5 rows are\n',df.head())
print('\n***The last 5 rows are\n',df.tail())
print('\n***There are %d rows.' % len(df))
df1 = df[df['Country Name'] == 'United States']
df2 = df[df['Country Name'] == 'Arab World']
plt.plot(df1['Year'],df1['Value']/1e6,'b',
         df2['Year'],df2['Value']/1e6,'g')
plt.xlabel('Year')
plt.ylabel('Popuplation (millions)')
plt.title('USA - blue, Arab World - green')
plt.show()
#***type of df =
#  <class 'pandas.core.frame.DataFrame'>
#
#***The first 5 rows are
#   Country Name Country Code  Year        Value
#0   Arab World          ARB  1960   96388069.0
#1   Arab World          ARB  1961   98882541.4
#2   Arab World          ARB  1962  101474075.8
#3   Arab World          ARB  1963  104169209.2
#4   Arab World          ARB  1964  106978104.6
#
#***The last 5 rows are
#       Country Name Country Code  Year     Value
#12402     Zimbabwe          ZWE  2006  12529655
#12403     Zimbabwe          ZWE  2007  12481245
#12404     Zimbabwe          ZWE  2008  12451543
#12405     Zimbabwe          ZWE  2009  12473992
#12406     Zimbabwe          ZWE  2010  12571000
#
#***There are 12407 rows.

Output:

Friday, February 6, 2015

R9. Correlation in R

x and y correspond to 50 thousand samples of 2 throws of dice.


The correlation is near 0, as the two samples are independent, and will go to 0 as the number of samples is increased.


The calculated correlation and the result from cor(x,y) are compared.

# ex9.R
num_samples <- 50000
x <- sample(6, num_samples, replace = T)
y <- sample(6, num_samples, replace = T)
df <- data.frame(x,y)
df["a"] <- df["x"]-mean(df[["x"]])
df["b"] <- df["y"]-mean(df[["y"]])
df["ab"] <- df["a"]*df["b"]
df["sqa"] <- df["a"]*df["a"]
df["sqb"] <- df["b"]*df["b"]
den <- sqrt(sum(df[["sqa"]])*sum(df[["sqb"]]))
correlation <- sum(df[["ab"]])/den
print("correlation =")
print(correlation)
print("cor(x,y) =")
print(cor(x,y))
err <- abs(correlation-cor(x,y))
print("error =")
print(err)
# [1] "correlation ="
# [1] 0.0008143925
# [1] "cor(x,y) ="
# [1] 0.0008143925
# [1] "error ="
# [1] 1.12757e-17

py9. Correlation in Python

Two dice throws are independent and uncorrelated, and the calculated correlation should be near 0.


x and y correspond to 50 thousand tests of dice throw for first and second dice, respectively.


The pearsonr function from scipy.stats is used. The function cor could also have been defined with the def statement, rather than the lambda. However, lambda is usually used if the function is just one line. The name cor is used to be similar to the one in R.

# ex9.py
from __future__ import division, print_function
from numpy.random import randint
from numpy import sqrt
from pandas import DataFrame
from scipy.stats import pearsonr
cor = lambda x,y: pearsonr(x,y)[0]
num_samples = 50000
x = randint(1,7,num_samples)
y = randint(1,7,num_samples)
df = DataFrame({'x':x,'y':y})
df['a'] = df['x']-df['x'].mean()
df['b'] = df['y']-df['y'].mean()
df['ab'] = df['a']*df['b']
df['sqa'] = df['a']**2
df['sqb'] = df['b']**2
den = sqrt(df['sqa'].sum()*df['sqb'].sum())
correlation = df['ab'].sum()/den
print("correlation =", correlation)
print("cor(x,y) =", cor(x,y))
err = abs(correlation-cor(x,y))
print("error =", err)
#correlation = 0.000423927126959
#cor(x,y) = 0.000423927126959
#error = 0.0

Thursday, February 5, 2015

R8. Rolling Two Dice in R

There are 5000 trials of rolling 2 dice. The sum of the 2 dices go from 2 to 12.


x1 and x2 are random samples of {1,2,3,4,5,6}, uniformly distributed. They are both vectors of length 5000.


We have to use the double brackets in the plots to return a vector from the data frame.

# ex8.R
num_samples <- 5000
x1 <- sample(6, num_samples, replace = T)
x2 <- sample(6, num_samples, replace = T)
df <- data.frame(x1,x2)
df['sum'] <- df['x1'] + df['x2']
x <- numeric(13-2)
for (i in 2:12) {
  x[i-1] <- sum(df['sum']==i)
}
df1 <- data.frame(x, row.names = 2:12)
df1['cum_x'] <- cumsum(df1['x'])
df1['cum_p'] <- df1['cum_x']/num_samples
x_lim <- c(1:6,5:1)
x_lim <- num_samples/36 * x_lim
df1['x_lim'] <- x_lim
df1['cum_lim'] <- cumsum(df1['x_lim'])
df1['cum_p_lim'] <- df1['cum_lim']/num_samples
print(df1)
err_df <- df1['cum_p_lim']-df1['cum_p']
err <- sd(err_df[,1])
print(sprintf('error = %.8f',err))
plot(2:12, df1[['cum_p']], type = 'l', col = 'red', 
     ylab = 'Cum Prob', xlab = 'x')
lines(2:12, df1[['cum_p_lim']], col = 'blue')
#     x cum_x  cum_p    x_lim   cum_lim  cum_p_lim
# 2  130   130 0.0260 138.8889  138.8889 0.02777778
# 3  305   435 0.0870 277.7778  416.6667 0.08333333
# 4  408   843 0.1686 416.6667  833.3333 0.16666667
# 5  584  1427 0.2854 555.5556 1388.8889 0.27777778
# 6  666  2093 0.4186 694.4444 2083.3333 0.41666667
# 7  851  2944 0.5888 833.3333 2916.6667 0.58333333
# 8  701  3645 0.7290 694.4444 3611.1111 0.72222222
# 9  553  4198 0.8396 555.5556 4166.6667 0.83333333
# 10 377  4575 0.9150 416.6667 4583.3333 0.91666667
# 11 301  4876 0.9752 277.7778 4861.1111 0.97222222
# 12 124  5000 1.0000 138.8889 5000.0000 1.00000000
# [1] "error = 0.00329718"

Output:

py8. Rolling Two Dice in Python

There are 5000 trials, each of rolling 2 dices. The sum of the dices is between 2 and 12.


x1 is dice1 of the 5000 rolls. x2 is dice2 of the 5000 rolls. In randint call, we give the upper number as 7 (not included). You should always look in the documentation to see if the endpoints are included which would be randint? in IPython or help(randint) in Python. For scientific work, IPython is the more productive platform.


The limiting probabilities are given in x_lim. There is 1 combination of 2 (both dices 1), 2 combination of 3 {1,2 and 2,1}, and so on. If we go above 5000 for num_samples, the error should decrease on average.

# ex8.py
from __future__ import division, print_function
import numpy as np
from numpy.random import randint
from pandas import DataFrame
import matplotlib.pyplot as plt
num_samples = 5000
x1 = randint(1,7,num_samples)
x2 = randint(1,7,num_samples)
df = DataFrame({'x1':x1, 'x2':x2})
df['sum'] = df['x1'] + df['x2']
x = np.zeros(13-2)
for i in range(2,13):
    x[i-2] = sum(df['sum']==i)
df1 = DataFrame({'x':x}, index = range(2,13))
df1['cum_x'] = df1['x'].cumsum()
df1['cum_p'] = df1['cum_x']/num_samples
x_lim = np.array([1,2,3,4,5,6,5,4,3,2,1])
x_lim = num_samples/36 * x_lim
df1['x_lim'] = x_lim
df1['cum_lim'] = df1['x_lim'].cumsum()
df1['cum_p_lim'] = df1['cum_lim']/num_samples
print(df1)
err = (df1['cum_p_lim']-df1['cum_p']).std()
print('\nerror =', err)
plt.plot(range(2,13), df1['cum_p'], 'r',
         range(2,13), df1['cum_p_lim'], 'b')
#      x  cum_x   cum_p       x_lim      cum_lim  cum_p_lim
#2   145    145  0.0290  138.888889   138.888889   0.027778
#3   272    417  0.0834  277.777778   416.666667   0.083333
#4   413    830  0.1660  416.666667   833.333333   0.166667
#5   548   1378  0.2756  555.555556  1388.888889   0.277778
#6   689   2067  0.4134  694.444444  2083.333333   0.416667
#7   837   2904  0.5808  833.333333  2916.666667   0.583333
#8   661   3565  0.7130  694.444444  3611.111111   0.722222
#9   559   4124  0.8248  555.555556  4166.666667   0.833333
#10  429   4553  0.9106  416.666667  4583.333333   0.916667
#11  303   4856  0.9712  277.777778  4861.111111   0.972222
#12  144   5000  1.0000  138.888889  5000.000000   1.000000
#
#error = 0.00353882300178

Output:

Wednesday, February 4, 2015

R7. Adding Mean Column and Row to a data.frame in R

The rowMeans and colMeans functions are used.


If we did not explicitly change the row.names argument in the data.frame construct, the row numbers would start with 1. This is contrast to languages such as Python, and other languages, which start at zero index.


Whenever we have new row or column indices in an assignment, a new row or column is created.

# ex7.R
x1 <- sample(0:9)
x2 <- sample(0:9)
x3 <- sample(0:9)
df <- data.frame(x1,x2,x3,row.names = c(0:9))
print('data.frame df =')
print(df)
print('First Column =')
print(df['x1'])
print('First Row =')
print(df[1,])
print('First Column, First Row = ')
print(df[1,'x1'])
print('Adding ave column')
df['ave'] <- rowMeans(df)
print('df =')
print(df)
print('Adding ave row')
df["ave_row",] <- colMeans(df)
print('df =')
print(df)
# [1] "data.frame df ="
# x1 x2 x3
# 0  2  3  4
# 1  7  8  2
# 2  1  4  5
# 3  0  1  7
# 4  9  0  1
# 5  8  6  8
# 6  3  5  6
# 7  4  7  0
# 8  6  9  3
# 9  5  2  9
# [1] "First Column ="
# x1
# 0  2
# 1  7
# 2  1
# 3  0
# 4  9
# 5  8
# 6  3
# 7  4
# 8  6
# 9  5
# [1] "First Row ="
# x1 x2 x3
# 0  2  3  4
# [1] "First Column, First Row = "
# [1] 2
# [1] "Adding ave column"
# [1] "df ="
# x1 x2 x3      ave
# 0  2  3  4 3.000000
# 1  7  8  2 5.666667
# 2  1  4  5 3.333333
# 3  0  1  7 2.666667
# 4  9  0  1 3.333333
# 5  8  6  8 7.333333
# 6  3  5  6 4.666667
# 7  4  7  0 3.666667
# 8  6  9  3 6.000000
# 9  5  2  9 5.333333
# [1] "Adding ave row"
# [1] "df ="
# x1  x2  x3      ave
# 0       2.0 3.0 4.0 3.000000
# 1       7.0 8.0 2.0 5.666667
# 2       1.0 4.0 5.0 3.333333
# 3       0.0 1.0 7.0 2.666667
# 4       9.0 0.0 1.0 3.333333
# 5       8.0 6.0 8.0 7.333333
# 6       3.0 5.0 6.0 4.666667
# 7       4.0 7.0 0.0 3.666667
# 8       6.0 9.0 3.0 6.000000
# 9       5.0 2.0 9.0 5.333333
# ave_row 4.5 4.5 4.5 4.500000

py7. Adding Mean Column and Row to a DataFrame in Python

DataFrames in pandas are similar to data.frame in R and can hold tabular data.


Every column is a specific type, such as integer or float. This is similar to R in which each column is a class such as integer or numeric.


The means function is a method of the DataFrame and the period is important here. In R, the periods are mostly treated as parts of name.


A dictionary d is created, of the 3 samples, and next a DataFrame is created based on the dictionary.

# ex7.py
from __future__ import division, print_function
from pandas import DataFrame
from numpy.random import choice
x1 = choice(10, size = 10, replace = False)
x2 = choice(10, size = 10, replace = False)
x3 = choice(10, size = 10, replace = False)
d = {'x1':x1,'x2':x2,'x3':x3}
df = DataFrame(d)
print('\nDataFrame df =\n%s' % df)
print('\nFirst Column =\n%s' % df['x1'])
print('\nFirst Row =\n%s' % df.ix[0])
print('\nFirst Column, First Row = %s' % df.loc[0,'x1'])
print('\nAdding ave column')
df['ave'] = df.mean(axis=1)
print('\ndf =\n%s' % df)
print('\nAdding ave row')
df.ix['ave_row'] = df.mean(axis=0)
print('\ndf =\n%s' % df)
#DataFrame df =
#   x1  x2  x3
#0   6   9   8
#1   3   1   6
#2   9   5   5
#3   0   3   3
#4   5   2   9
#5   2   8   7
#6   7   7   4
#7   4   0   1
#8   1   6   0
#9   8   4   2
#
#First Column =
#0    6
#1    3
#2    9
#3    0
#4    5
#5    2
#6    7
#7    4
#8    1
#9    8
#Name: x1, dtype: int32
#
#First Row =
#x1    6
#x2    9
#x3    8
#Name: 0, dtype: int32
#
#First Column, First Row = 6
#
#Adding ave column
#
#df =
#   x1  x2  x3       ave
#0   6   9   8  7.666667
#1   3   1   6  3.333333
#2   9   5   5  6.333333
#3   0   3   3  2.000000
#4   5   2   9  5.333333
#5   2   8   7  5.666667
#6   7   7   4  6.000000
#7   4   0   1  1.666667
#8   1   6   0  2.333333
#9   8   4   2  4.666667
#
#Adding ave row
#
#df =
#          x1   x2   x3       ave
#0        6.0  9.0  8.0  7.666667
#1        3.0  1.0  6.0  3.333333
#2        9.0  5.0  5.0  6.333333
#3        0.0  3.0  3.0  2.000000
#4        5.0  2.0  9.0  5.333333
#5        2.0  8.0  7.0  5.666667
#6        7.0  7.0  4.0  6.000000
#7        4.0  0.0  1.0  1.666667
#8        1.0  6.0  0.0  2.333333
#9        8.0  4.0  2.0  4.666667
#ave_row  4.5  4.5  4.5  4.500000