Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Comparing R and Python for PCA PyData Boston 2013

Vipin Sachdeva

Senior Engineer, IBM Research


Comparison of R and Python for Principal Component Analysis

§  R and Python are popular choices for data analysis.

§  How do they compare in terms of programmer productivity and performance ?

§  Use a common task for both R and Python – Principal Component Analysis (PCA) – PCA is a very commonly used technique for dimension reduction.

§  Dataframes is an essential part of languages supporting data analysis – R provides data frame with numerous statistical packages. – Python has included numPy (arrays) and Pandas (dataframe) for data handling which

we use.

§  Both language have rich development environments – Rstudio for R –  iPython for Python.

§  Both languages have many features that helps in data analysis. –  In this talk we compare those features with some code examples to solve our

problem.

§  This talk is not about as much about principal component analysis as about programming and performance of Python and R

§  Let’s get started


PCA – Short Introduction

§  PCA is a standard tool in modern data analysis.

§  Simple method to extract information from confusing datasets – Reduce a complex dataset to a lower dimension

•  PCA projects the data along the direction where data varies the most. •  Directions are determined by the direction of the eigenvectors coresponding to

largest eigenvalues


PCA – Mathematical approaches

§  Find eigenvalues of standardized covariance matrix.

§  Choose eigenvalues with sum exceeding a threshold.

§  Reduction in dimension from N to K: – Create data with subset of eigenvalues (whose sum exceeds that threshold).

-- --

- 7 -

• How to choose the principal components?

- To choose K , use the following criterion:

K

i=1Σ i

N

i=1Σ i

> Threshold (e.g., 0.9 or 0.95)

• What is the error due to dimensionality reduction?

- We saw above that an original vector x can be reconstructed using its the prin-cipla components:

x̂ − x =K

i=1Σ biui or x̂ =

K

i=1Σ biui + x

- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:

e = ||x − x̂||

- It can be shown that the error is equal to:

e = 1/2N

i=K+1Σ i

• Standardization

- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.

- We should always standardize the data prior to using PCA.

- A common standardization method is to transform all the data to have zeromean and unit standard deviation:

xi − ( and are the mean and standard deviation of xi’s)

-- --

- 7 -

• How to choose the principal components?

- To choose K , use the following criterion:

K

i=1Σ i

N

i=1Σ i

> Threshold (e.g., 0.9 or 0.95)

• What is the error due to dimensionality reduction?

- We saw above that an original vector x can be reconstructed using its the prin-cipla components:

x̂ − x =K

i=1Σ biui or x̂ =

K

i=1Σ biui + x

- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:

e = ||x − x̂||

- It can be shown that the error is equal to:

e = 1/2N

i=K+1Σ i

• Standardization

- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.

- We should always standardize the data prior to using PCA.

- A common standardization method is to transform all the data to have zeromean and unit standard deviation:

xi − ( and are the mean and standard deviation of xi’s)


PCA using Singular Value Decomposition (SVD)

§  More generalized approach for performing PCA.

§  Decompose X=UDVT

§  D*D is eigenvalues of covariance matrix.

§  Reconstruction of data by zeroing out regions as shown below

§  Choose q (as before)

Figure 5:

We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.

PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.

Figure 6:

2 Factor Analysis

Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.

Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.

4


PCA: What data to use ?

§  How about PCA on current 500 S&P stocks data for a “period of time” ?

§  Download symbols from S&P 500 website and create a vector.

§  Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible).

§  R and Python have various packages for financial data download – quantMod (R) – pandas.io.data.DataReader (Python)

§  Need a package that provides a single dataframe as output from a single call.

Dates MMM ABT … 01-01-1970 109.62 NA … 01-02-1970 107.12 NA NA … … … … 12-31-2012 NA 108.66 104.32


Data Download – R only

§  I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. –  for loops are slow in R

§  Can any package return data for S&P stocks as a single dataframe ? – Use fImport package of R to download daily data. – stocksData<-yahooSeries(symbols_nospaces,from="1970-01-01",to="2012-12-31”)

#symbols_nospaces is S&P stock symbols – Extract columns with closing dates.

§  Write to a csv file for repeated runs (takes a long time to download)

§  Read the file in R/Python to get the data –  read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.

§  Many symbols have NA’s for dates where data is not available.

§  Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?

#Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".Close",sep="") print(colname) stockData_df[,i+1]<-get(colname) colnames(stockData_df)[i+1]<-symbols_nospaces[i]


Combined Data Preparation – R and Python

§  Read the file in R/Python to get the data –  read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.

§  Many symbols have NA’s for dates where data is not available.

§  Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?

§  Find first occurrence of “1988” in dataframe’s Dates column. – str.contains(“1970”) in Python – agrep(…,fixed=TRUE) in R.

§  Extract stock columns which do not have NA on the first trading day of 1988 –  !is.na in R/ not math.isnan in Python – Get 200 stocks which satisfy above requirement

§  Result: combined data for 200 stocks from 1988-2012 in R/Python dataframes. – Drop rows with NA for any stock.

•  na.omit()/drop.na() – 6162 entries for 200 stocks in total.

§  Both R and Python are remarkably similar for this step.


Code in R and Python for data preparation R code extractData<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./stocksdata_dataframe.txt", header=T) yrpattern<-paste(yearrange[1],"-*",sep="”) stockdates<-stocksreturns$Dates sindex<-agrep(yrpattern,stockdates,fixed=TRUE)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<-data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"Dates"

j<-2

stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i][sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}

9

Python code def extractData(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"Dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==True): break startindex=i colnames=stocksreturns.columns stocksreturns_short=DataFrame(stocksreturns.ix[startindex:size(x)]['"Dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]][startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colnames[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 return stocksreturns_short


R/Python packages for PCA

§  Size of data is only about 23 MB in txt format. – No memory-bound issues running on my laptop.

§  R has many choices (one too many) for PCA: – prcomp/princomp/PCA/dudi.pca/acp – prcomp scales and centers data (very convenient)

•  prcomp(stocksreturns_short,scale=TRUE,center=TRUE,retx=TRUE) •  Reconstruct data with predict function. •  prcomp uses svd beneath the covers

§  Python seems to have several choices for PCA as well. – matplotlib.mca.pca – MDP (module for data processing) PCA – numpy.eig/scipy.eig etc

§  Both packages seem to have adequate support for PCA in multiple ways.

§  Our approach: Use SVD in both R/Python: – Do same operations and compare runtimes. – svd in numpy returns transpose(V), while R returns V – Both R and Python return d as a vector; trivial to make a diagonal matrix for

reconstruction of data. – Things start in Python from 0; in R from 1 J

covariance matrix/eigenvalues/eigenvectors approach.


PCA on combined data using SVD

§  PCA is a SVD operation:

§  X is stocks data (6162x200)

§  D*D is eigenvalues. (p=200)

§  Reconstruction of data by zeroing out regions as shown below

§  Choose q (explained ahead)

Figure 5:

We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.

PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.

Figure 6:

2 Factor Analysis

Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.

Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.

4


PCA on combined data – Approach

§  Perform a SVD of stock returns data.

§  Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues

– Eigenvalues=d*d from SVD

§  Zero out remaining eigenvectors/eigenvalues –  In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is

done using references)

§  Reconstruct data with matrix-multiply operations. – X_reconstructed=U*D*t(V) – Measure the std_dev(data_reconstructed-original_data)


Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=False) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent:

sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals):

neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v))

13


Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("Number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data

14


PCA on combined data results

•  138 stocks out of 200 account for 90% of the sum of all the eigenvalues •  Reconstruct data with 138 stocks has negligible error (10^-5)


Yearly PCA

§  Instead of doing a PCA on combined data from 1988-2012, how about yearly PCA ? – PCA on yearly data

§  Separate the combined dataframe into yearly dataframes (1 for each year).

§  Number of observations vary for each year.

§  Calculate number of eigenvalues accounting for 50%, 75% ,90% ,100% of sum of all eigenvalues (same operation as PCA on combined data)

§  Do a reconstruction for each proportion/each year. (step operation as PCA on combined data)

– 25 separate PCA’s – 100 reconstructions in total.

Dates MMM ABT …

01-01-1988 109.62 NA …

01-02-1988 107.12 NA NA

01-03-1988 … NA …

Dataframe 1..

Dates MMM ABT …

01-01-2012 109.62 NA …

01-02-2012 107.12 NA NA

01-03-2012 … NA …

…..Dataframe 25


Code in Python for extracting yearly dataframes

Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"Dates"'].str.contains(str(years)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i])

17

R code yrpattern<-paste(years,"-*",sep="") yrindices<-agrep(yrpattern,stocksreturns_short$Dates,fixed=TRUE) val_stockyr<-data.frame(stocksreturns_short$Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"Dates" for(i in 2:(numstocks+1)) { val_stockyr[,i]<-stocksreturns_short[,i][yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<-colnames(stocksreturns_short)[i] } diff_stockyr<-data.frame(matrix(NA,nrow=(dim(log_stockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j])


Analysis of yearly PCA data

§  Number of eigenvalues with 50% of total sum drops to 1 in 2008

§  Stock movement is highly correlated due to macro-economic trends.


PCA on yearly data

§  Being a C/C++/Fortran HPC programmer, I use for loops in R/Python

§  Not efficient for R (for loop is an object; assignment in R is a copy operation)

§  Python’s assignment is done with references so it works better with for loops, and lesser overhead of functions.

§  Development Environment: –  ipython-2.7 with pandas and numpy installed through ports package – Rstudio 0.97 with R binary downloaded for Mac – No attempt to optimize the build for either R and Python.

§  Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro.

§  Timings may change with less reliance on for loops in the code.


Parallelizing yearly PCA

§  Can we use parallelism in R and Python productively ?

§  Both R and Python provide several ways for parallelization – Multiple cores – Distributed parallelism using MPI or sockets

§  Use coarse-grained parallelism to speed up our computations. – Look into how both packages allow use of multicores on modern day processors

§  Very easy to apply coarse-grained parallelism to yearly PCA – Divide years amongst threads/processes.

§  For R use doMC/foreach package that works on the multiple cores.

§  Python threads does not work well due to global interpreter lock (GIL). – Use iPython ipcluster parallelization framework.

§  Further evaluation using MPI on distributed clusters needed.


Parallelizing yearly PCA in R

§  foreach depends on a backend for execution § We register DoMC (multiple cores) as backend for the yearly PCA in this case § MPI can also be used as a backend for distributed clusters. §  Snow package another option (higher level for distributed clusters). § Not just limited to for loops:

§  Use mclapply for multi-core lapply etc.

#Sequential code

for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct

}}

#Parallel code

registerDoMC(4) #Register multicore as backend with 4 cores

foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct

§  }}


Timing Results

Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)

Threads


Parallelizing yearly PCA in Python

§  Using iPython’s Direct Interface – Start backend of iPython – % ipcluster-2.7 start –n 4 (4 is the number of processes)

§  Rewrite pcaData function so that it can be used with the map API of Python – pcaData(stocksreturns_short,year) – pcaData extracts data for year from stocksreturns_short, performs a SVD and then

reconstructions with eigenvalues percentages as before. §  Processes (unlike threads in R) makes us reimport all the modules inside the function.

– Higher memory footprint – More heavyweight compared to threads.

§  Create a list for each process’s function arguments. §  Parallelize across years as in R

– Each process computes a subset of SVD’s and reuses a single SVD for 4 reconstructions.

§  #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcaData,x,range(1988,2013)) print(datetime.now()-starttime)


Timing results in Python

§  Starting ipcluster=8 leads to processes hanging. §  ipython with multiple processes led to some memory issues. §  Scalability of Python shows similar trend as R

Threads


Summary

§  Both R and Python offer good choices for PCA

§  R has many packages for tasks such as downloading financial data, PCA etc. – Python has a good support as well.

§  R offers a cohesive framework –  Installing packages is pain-free – Parallelization in R is very simple.

§  R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops).

§  Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work.

– Pandas/Numpy adds dataframe capabilities to Python’s native string handling capabilities to provide a strong platform for data analysis.


Future Work

§  Profiling of code at statement level etc.

§  How does R/Python work for memory-bound/compute-bound problems ?

§  Work with Distributed matrices (disnumpy for Python,r-pbd for R)

§  Use MPI as backend for parallelization on a cluster

§  Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)

Documents

Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis