Principal Component Analysis

Principal Component Analysis#

PCA is an unsupervised learning algorithm which is widely used to reduce the dimensions of large datasets. It can also be used for feature extraction such that we can use a limited number of features. PCA helps to identify feature(s) that contributes maximally to the variance in the data.

PCA transforms the original data to a lower dimensional space while minimizing the information loss.

Here we’ll use a dataset from this manuscript to receate some of the analysis presented in the manuscript. This manuscript presents analysis of 8 phenotypic traits in 169 japonica rice varieties. The data used here is available in the supplementary material of the manuscript download. Save this excel file in csv format.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Read data in csv format having parameters for different aspect of the plant over two years. Columns 3 to 10 and 11 to 18 for 2014 and 2015, respectively. The first column is “Line number” so we’ll use that as index for our dataframe and second column is plant variety. There is no header row in the csv file. Let’s read the data and preprocess it.

df1 = pd.read_csv("pnas_suppl.csv", index_col=0,header=None)

df1

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
0
Line 1	GORIKI	98	112.0	7.0	26.7	19.6	12.7	41.3	206.3	101	117.1	8.3	25.4	21.0	13.7	46.7	221.7
Line 2	HINOMARU	87	95.7	9.0	23.4	18.6	12.7	39.0	199.0	87	99.8	7.7	23.6	17.3	12.0	41.8	213.0
Line 3	MANGOKU	93	109.7	9.0	24.6	20.7	16.7	36.0	210.0	91	106.0	10.7	23.6	18.3	14.7	32.0	177.7
Line 4	KAIRYO SHINKOU	89	82.6	9.0	22.8	16.1	9.7	26.3	139.7	87	86.3	7.3	25.3	19.6	12.7	35.7	182.0
Line 5	SENICHI	98	103.1	11.0	23.1	17.1	10.3	24.3	136.0	98	102.0	14.0	26.0	20.0	11.7	26.7	140.3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Line 165	HIMENOMOCHI	87	74.8	7.0	20.0	13.7	10.3	25.7	137.3	81	82.8	10.3	21.5	15.8	13.7	30.0	165.7
Line 166	SHINSHUU	92	91.3	16.0	20.1	13.4	11.7	9.7	93.0	91	85.8	11.3	20.3	14.5	10.3	16.0	102.0
Line 167	AICHI ASAHI_2***	113	84.3	12.0	23.4	16.7	9.0	23.0	123.7	107	88.3	10.7	24.8	18.6	10.0	25.7	141.0
Line 168	RAIDEN	113	111.8	11.0	26.4	19.7	11.7	36.7	183.0	109	108.3	8.7	24.7	18.9	12.3	36.3	165.3
Line 169	HOUMANSHINDEN INE	124	107.9	15.0	22.7	18.2	16.0	17.7	175.0	120	122.0	12.3	23.6	18.7	13.7	42.3	220.7

169 rows × 17 columns

Data pre-processing#

Next, we’ll extract the data for the two years separately and merge them vertically in one dataframe. This reformating would be useful for subsequent analysis. We’ll also add column header to our dataframe. These headers would be the name of different parameters.

df_2015 = df1[[1,10,11,12,13,14,15,16,17]]
df_2015.columns=["Variety","Days-to-heading", "Culm length (cm)","Panicle No.", "Panicle length (cm)", "Rachis length (cm)", "Primary branch No.", "Sec. branch No.", "Spikelet No."]
df_2015["Year"]=2015
df_2015.head()

	Variety	Days-to-heading	Culm length (cm)	Panicle No.	Panicle length (cm)	Rachis length (cm)	Primary branch No.	Sec. branch No.	Spikelet No.	Year
0
Line 1	GORIKI	101	117.1	8.3	25.4	21.0	13.7	46.7	221.7	2015
Line 2	HINOMARU	87	99.8	7.7	23.6	17.3	12.0	41.8	213.0	2015
Line 3	MANGOKU	91	106.0	10.7	23.6	18.3	14.7	32.0	177.7	2015
Line 4	KAIRYO SHINKOU	87	86.3	7.3	25.3	19.6	12.7	35.7	182.0	2015
Line 5	SENICHI	98	102.0	14.0	26.0	20.0	11.7	26.7	140.3	2015

Remove that columns from df1 that have been copied to df2 and add headers (same as that in df2) to df1. Also, add the year column to df1 such that df1 would now have data for only 2014.

df_2014 = df1.drop(columns=[10,11,12,13,14,15,16,17])
df_2014.columns=["Variety","Days-to-heading", "Culm length (cm)","Panicle No.", "Panicle length (cm)", "Rachis length (cm)", "Primary branch No.", "Sec. branch No.", "Spikelet No."]
df_2014["Year"]=2014
df_2014.head()

	Variety	Days-to-heading	Culm length (cm)	Panicle No.	Panicle length (cm)	Rachis length (cm)	Primary branch No.	Sec. branch No.	Spikelet No.	Year
0
Line 1	GORIKI	98	112.0	7.0	26.7	19.6	12.7	41.3	206.3	2014
Line 2	HINOMARU	87	95.7	9.0	23.4	18.6	12.7	39.0	199.0	2014
Line 3	MANGOKU	93	109.7	9.0	24.6	20.7	16.7	36.0	210.0	2014
Line 4	KAIRYO SHINKOU	89	82.6	9.0	22.8	16.1	9.7	26.3	139.7	2014
Line 5	SENICHI	98	103.1	11.0	23.1	17.1	10.3	24.3	136.0	2014

Now we have two dataframes having plant parameters for the two years i.e. df1 and df2 corresponding to 2014 and 2015, respectively. Let’s combine these two dataframes vertically to get the full data as one dataframe.

df_final = pd.concat([df_2014,df_2015])
df_final

	Variety	Days-to-heading	Culm length (cm)	Panicle No.	Panicle length (cm)	Rachis length (cm)	Primary branch No.	Sec. branch No.	Spikelet No.	Year
0
Line 1	GORIKI	98	112.0	7.0	26.7	19.6	12.7	41.3	206.3	2014
Line 2	HINOMARU	87	95.7	9.0	23.4	18.6	12.7	39.0	199.0	2014
Line 3	MANGOKU	93	109.7	9.0	24.6	20.7	16.7	36.0	210.0	2014
Line 4	KAIRYO SHINKOU	89	82.6	9.0	22.8	16.1	9.7	26.3	139.7	2014
Line 5	SENICHI	98	103.1	11.0	23.1	17.1	10.3	24.3	136.0	2014
...	...	...	...	...	...	...	...	...	...	...
Line 165	HIMENOMOCHI	81	82.8	10.3	21.5	15.8	13.7	30.0	165.7	2015
Line 166	SHINSHUU	91	85.8	11.3	20.3	14.5	10.3	16.0	102.0	2015
Line 167	AICHI ASAHI_2***	107	88.3	10.7	24.8	18.6	10.0	25.7	141.0	2015
Line 168	RAIDEN	109	108.3	8.7	24.7	18.9	12.3	36.3	165.3	2015
Line 169	HOUMANSHINDEN INE	120	122.0	12.3	23.6	18.7	13.7	42.3	220.7	2015

338 rows × 10 columns

df_temp = df_final.reset_index(drop=True)
sns.pairplot(data=df_temp.iloc[:,range(1,10)], hue="Year", palette="deep")

<seaborn.axisgrid.PairGrid at 0x1d82c9db9d0>

_images/9bfb3f55d704cf1b928ca8d4a18b9fba7e23a5284fcaf0ad28f4f8b52bc8e15d.png

Model building#

Perform the PCA using the normalized feature vectors to generate five components. We’ll first create an object of the PCA class with five as the value for n_components argument. Subsequently we’ll use the normalized matrix to generate principle components and save the results to a dataframe. We’ll first do PCA of the 2015 data only.

# Exclude the Variety column
df_2015.iloc[:,[1,2,3,4,5,6,7,8]]

	Days-to-heading	Culm length (cm)	Panicle No.	Panicle length (cm)	Rachis length (cm)	Primary branch No.	Sec. branch No.	Spikelet No.
0
Line 1	101	117.1	8.3	25.4	21.0	13.7	46.7	221.7
Line 2	87	99.8	7.7	23.6	17.3	12.0	41.8	213.0
Line 3	91	106.0	10.7	23.6	18.3	14.7	32.0	177.7
Line 4	87	86.3	7.3	25.3	19.6	12.7	35.7	182.0
Line 5	98	102.0	14.0	26.0	20.0	11.7	26.7	140.3
...	...	...	...	...	...	...	...	...
Line 165	81	82.8	10.3	21.5	15.8	13.7	30.0	165.7
Line 166	91	85.8	11.3	20.3	14.5	10.3	16.0	102.0
Line 167	107	88.3	10.7	24.8	18.6	10.0	25.7	141.0
Line 168	109	108.3	8.7	24.7	18.9	12.3	36.3	165.3
Line 169	120	122.0	12.3	23.6	18.7	13.7	42.3	220.7

169 rows × 8 columns

Normalization

pca_2015_standard = StandardScaler().fit_transform(df_2015.iloc[:,[1,2,3,4,5,6,7,8]])

Fit the data and create a dataframe of principle components

pca_2015_model = PCA(n_components=5) #number of principal components
pca_2015_fit = pca_2015_model.fit_transform(pca_2015_standard) 

df_2015_pca = pd.DataFrame(data = pca_2015_fit, columns = ['pc1', 'pc2','pc3','pc4','pc5'])

The variance in the data as explained by each of the component can be check by the explained_variance_ratio_ of the pca object. The output is the variance associated with each of the principal components (five in this case). We can also calculate the total variance explained by the five components. As we see below the five components together account for 95.9% variance in the data.

print(pca_2015_model.explained_variance_ratio_)
print(pca_2015_model.explained_variance_ratio_.cumsum())

[0.62283251 0.16410666 0.07756149 0.05617533 0.03833603]
[0.62283251 0.78693917 0.86450066 0.92067599 0.95901202]

pca_2015_model.components_

array([[-0.16505669, -0.36898558,  0.24285994, -0.3816588 , -0.408246  ,
        -0.3591661 , -0.40835848, -0.41175211],
       [-0.7266037 , -0.2441704 , -0.47751906, -0.21155821, -0.13636855,
         0.2424033 ,  0.15019558,  0.19932913],
       [-0.22625375,  0.20730652,  0.74838672, -0.30732605, -0.09178569,
         0.41958235,  0.14012806,  0.21723589],
       [-0.41965494, -0.32972906,  0.31061083,  0.53659936,  0.50113939,
         0.04226267, -0.23352974, -0.15260259],
       [ 0.30823769, -0.18116406, -0.18564851, -0.1950358 ,  0.12892964,
         0.73701089, -0.43027916, -0.23391603]])

df_2015_pca.head()

	pc1	pc2	pc3	pc4	pc5
0	-4.241790	0.275651	0.358924	-0.495200	-0.275406
1	-2.030945	1.637908	0.070001	-0.686920	-0.911864
2	-1.854638	0.671051	1.212392	0.068783	0.847572
3	-1.925394	1.483075	-0.648590	0.746669	0.069074
4	-0.947741	-1.384563	0.557716	1.403352	-0.043060

df_2015_pca.plot.scatter(x="pc1",y="pc2")
plt.show()

_images/915ec2f0ad43e0a79d50c86382e623092e4d9994c95244d707843d9e69d74b77.png

Let’s plot the percentage variance explained vs the components. This plot would have to graphs with bars showing the vairance explained by each component and line showing the cumulative sum of the percent variance explained.

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax1.bar(range(len(pca_2015_model.explained_variance_ratio_)), \
        pca_2015_model.explained_variance_ratio_, alpha=0.5)
ax2.plot(pca_2015_model.explained_variance_ratio_.cumsum())
ax2.scatter(range(len(pca_2015_model.explained_variance_ratio_)), \
            pca_2015_model.explained_variance_ratio_.cumsum())

ax2.set_ylim([0,1])

ax1.set_ylabel("Percent Explained")
ax2.set_ylabel("Cummulative Sum")
ax1.set_xlabel("Principal Components")
plt.xticks(ticks=range(len(pca_2015_model.explained_variance_ratio_)), \
           labels=["PC1", "PC2", "PC3", "PC4", "PC5"])
plt.show()

_images/9e0f167426d2a45877afad69a475eb1c86316f0a2b46a17d7f003d700ef012f0.png

PCA Loadings#

The loadings value range from -1 to 1 such that its value close to either end (1 or -1) is indicative of a strong influence of a particular feature to a specific component. Loadings can be thought of as analogous to correlation coefficient. To get the loadings we need to mutiple the eigenvectors with the square-root of egienvalues (reference).

pca_2015_model.explained_variance_

array([5.01231878, 1.3206679 , 0.62418533, 0.45207762, 0.30851375])

loadings2 = pca_2015_model.components_.T * np.sqrt(pca_2015_model.explained_variance_)
loading_matrix = pd.DataFrame(loadings2, columns=['pc1', 'pc2','pc3','pc4','pc5'], index=df_2015.columns.values[1:-1])
loading_matrix

	pc1	pc2	pc3	pc4	pc5
Days-to-heading	-0.369532	-0.835015	-0.178753	-0.282162	0.171208
Culm length (cm)	-0.826093	-0.280601	0.163783	-0.221699	-0.100626
Panicle No.	0.543720	-0.548766	0.591266	0.208845	-0.103117
Panicle length (cm)	-0.854466	-0.243123	-0.242804	0.360792	-0.108331
Rachis length (cm)	-0.913990	-0.156715	-0.072516	0.336950	0.071613
Primary branch No.	-0.804109	0.278571	0.331493	0.028416	0.409365
Sec. branch No.	-0.914241	0.172605	0.110709	-0.157018	-0.238994
Spikelet No.	-0.921839	0.229070	0.171628	-0.102605	-0.129926

The signs for the loadings are arbitary. The absolute number represents the correlation with the corresponding pincipal component. The opposite signs for the components indicate negative correlation within the components.

Biplot#

Biplot is a useful visualization methods for PCA as gives information about relationship of different features with principal components. This plot comprises of two plots (hence the name):

A scatter plot showing transformed data along the first two principle components
PCA loadings as arrows

Visual analysis of this plot can give information about the extent to which different features affect a specific principal components.

coeff = loading_matrix.values
plt.scatter(pca_2015_fit[:,0], pca_2015_fit[:,1], marker='.', alpha=0.5)
for i in range(8):
    plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', \
              alpha=0.5, linestyle = '-',linewidth = 1.5, overhang=0.2)

_images/90f3b57aaeee43960fa326c8a94944349dd4dfbc492ed8d56924f95b83c93662.png

Customising the biplot to match the figure given in the manuscript. The circle in the graph represents 100% variance explained.

#coeff = np.transpose(pca_2015.components_[0:2, :])
coeff = loading_matrix.values
for i in range(8):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', alpha = 0.9, \
                  linestyle = '-',linewidth = 1.5, overhang=0.2)
        plt.text(coeff[i,0]*1.25, coeff[i,1]*1.25, \
                 df_2015.columns.values[1:-1][i], color = 'k', ha = 'center', \
                 va = 'center',fontsize=10)
circle1 = plt.Circle((0, 0), 1, color='grey',fill=False)
plt.gca().add_patch(circle1)
plt.plot([0, 0], [-1, 1], color='grey', linestyle='-', linewidth=1)
plt.plot([-1, 1], [0, 0], color='grey', linestyle='-', linewidth=1)

plt.gca().set_box_aspect(1)

plt.xlabel(f"PC1 ({pca_2015_model.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PC2 ({pca_2015_model.explained_variance_ratio_[1]*100:.2f}%)")

plt.yticks(ticks=[-1,-0.5,0,0.5,1])
plt.title("PCA loading for the 2015 data")

plt.show()

_images/5966db058ebfef47f8cb3d02099ace080ab4e8e38622b1e604ff107909859dbc.png

Compare the above graph with Figure 1B of this manuscript. For a discussion on the results see the supplementary material.

Exercise#

Create a figure showing PCA loadings for the 2014 data.