Principal Component Analysis

Principal Component Analysis#

PCA is an unsupervised learning algorithm which is widely used to reduce the dimensions of large datasets. It can also be used for feature extraction such that we can use a limited number of features. PCA helps to identify feature(s) that contributes maximally to the variance in the data.

PCA transforms the original data to a lower dimensional space while minimizing the information loss.

Here we’ll use a dataset from this manuscript to receate some of the analysis presented in the manuscript. This manuscript presents analysis of 8 phenotypic traits in 169 japonica rice varieties. The data used here is available in the supplementary material of the manuscript download. Save this excel file in csv format.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Read data in csv format having parameters for different aspect of the plant over two years. Columns 3 to 10 and 11 to 18 for 2014 and 2015, respectively. The first column is “Line number” so we’ll use that as index for our dataframe and second column is plant variety. There is no header row in the csv file. Let’s read the data and preprocess it.

df1 = pd.read_csv("pnas_suppl.csv", index_col=0,header=None)
df1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0
Line 1 GORIKI 98 112.0 7.0 26.7 19.6 12.7 41.3 206.3 101 117.1 8.3 25.4 21.0 13.7 46.7 221.7
Line 2 HINOMARU 87 95.7 9.0 23.4 18.6 12.7 39.0 199.0 87 99.8 7.7 23.6 17.3 12.0 41.8 213.0
Line 3 MANGOKU 93 109.7 9.0 24.6 20.7 16.7 36.0 210.0 91 106.0 10.7 23.6 18.3 14.7 32.0 177.7
Line 4 KAIRYO SHINKOU 89 82.6 9.0 22.8 16.1 9.7 26.3 139.7 87 86.3 7.3 25.3 19.6 12.7 35.7 182.0
Line 5 SENICHI 98 103.1 11.0 23.1 17.1 10.3 24.3 136.0 98 102.0 14.0 26.0 20.0 11.7 26.7 140.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Line 165 HIMENOMOCHI 87 74.8 7.0 20.0 13.7 10.3 25.7 137.3 81 82.8 10.3 21.5 15.8 13.7 30.0 165.7
Line 166 SHINSHUU 92 91.3 16.0 20.1 13.4 11.7 9.7 93.0 91 85.8 11.3 20.3 14.5 10.3 16.0 102.0
Line 167 AICHI ASAHI_2*** 113 84.3 12.0 23.4 16.7 9.0 23.0 123.7 107 88.3 10.7 24.8 18.6 10.0 25.7 141.0
Line 168 RAIDEN 113 111.8 11.0 26.4 19.7 11.7 36.7 183.0 109 108.3 8.7 24.7 18.9 12.3 36.3 165.3
Line 169 HOUMANSHINDEN INE 124 107.9 15.0 22.7 18.2 16.0 17.7 175.0 120 122.0 12.3 23.6 18.7 13.7 42.3 220.7

169 rows × 17 columns

Data pre-processing#

Next, we’ll extract the data for the two years separately and merge them vertically in one dataframe. This reformating would be useful for subsequent analysis. We’ll also add column header to our dataframe. These headers would be the name of different parameters.

df_2015 = df1[[1,10,11,12,13,14,15,16,17]]
df_2015.columns=["Variety","Days-to-heading", "Culm length (cm)","Panicle No.", "Panicle length (cm)", "Rachis length (cm)", "Primary branch No.", "Sec. branch No.", "Spikelet No."]
df_2015["Year"]=2015
df_2015.head()
Variety Days-to-heading Culm length (cm) Panicle No. Panicle length (cm) Rachis length (cm) Primary branch No. Sec. branch No. Spikelet No. Year
0
Line 1 GORIKI 101 117.1 8.3 25.4 21.0 13.7 46.7 221.7 2015
Line 2 HINOMARU 87 99.8 7.7 23.6 17.3 12.0 41.8 213.0 2015
Line 3 MANGOKU 91 106.0 10.7 23.6 18.3 14.7 32.0 177.7 2015
Line 4 KAIRYO SHINKOU 87 86.3 7.3 25.3 19.6 12.7 35.7 182.0 2015
Line 5 SENICHI 98 102.0 14.0 26.0 20.0 11.7 26.7 140.3 2015

Remove that columns from df1 that have been copied to df2 and add headers (same as that in df2) to df1. Also, add the year column to df1 such that df1 would now have data for only 2014.

df_2014 = df1.drop(columns=[10,11,12,13,14,15,16,17])
df_2014.columns=["Variety","Days-to-heading", "Culm length (cm)","Panicle No.", "Panicle length (cm)", "Rachis length (cm)", "Primary branch No.", "Sec. branch No.", "Spikelet No."]
df_2014["Year"]=2014
df_2014.head()
Variety Days-to-heading Culm length (cm) Panicle No. Panicle length (cm) Rachis length (cm) Primary branch No. Sec. branch No. Spikelet No. Year
0
Line 1 GORIKI 98 112.0 7.0 26.7 19.6 12.7 41.3 206.3 2014
Line 2 HINOMARU 87 95.7 9.0 23.4 18.6 12.7 39.0 199.0 2014
Line 3 MANGOKU 93 109.7 9.0 24.6 20.7 16.7 36.0 210.0 2014
Line 4 KAIRYO SHINKOU 89 82.6 9.0 22.8 16.1 9.7 26.3 139.7 2014
Line 5 SENICHI 98 103.1 11.0 23.1 17.1 10.3 24.3 136.0 2014

Now we have two dataframes having plant parameters for the two years i.e. df1 and df2 corresponding to 2014 and 2015, respectively. Let’s combine these two dataframes vertically to get the full data as one dataframe.

df_final = pd.concat([df_2014,df_2015])
df_final
Variety Days-to-heading Culm length (cm) Panicle No. Panicle length (cm) Rachis length (cm) Primary branch No. Sec. branch No. Spikelet No. Year
0
Line 1 GORIKI 98 112.0 7.0 26.7 19.6 12.7 41.3 206.3 2014
Line 2 HINOMARU 87 95.7 9.0 23.4 18.6 12.7 39.0 199.0 2014
Line 3 MANGOKU 93 109.7 9.0 24.6 20.7 16.7 36.0 210.0 2014
Line 4 KAIRYO SHINKOU 89 82.6 9.0 22.8 16.1 9.7 26.3 139.7 2014
Line 5 SENICHI 98 103.1 11.0 23.1 17.1 10.3 24.3 136.0 2014
... ... ... ... ... ... ... ... ... ... ...
Line 165 HIMENOMOCHI 81 82.8 10.3 21.5 15.8 13.7 30.0 165.7 2015
Line 166 SHINSHUU 91 85.8 11.3 20.3 14.5 10.3 16.0 102.0 2015
Line 167 AICHI ASAHI_2*** 107 88.3 10.7 24.8 18.6 10.0 25.7 141.0 2015
Line 168 RAIDEN 109 108.3 8.7 24.7 18.9 12.3 36.3 165.3 2015
Line 169 HOUMANSHINDEN INE 120 122.0 12.3 23.6 18.7 13.7 42.3 220.7 2015

338 rows × 10 columns

df_temp = df_final.reset_index(drop=True)
sns.pairplot(data=df_temp.iloc[:,range(1,10)], hue="Year", palette="deep")
<seaborn.axisgrid.PairGrid at 0x1d82c9db9d0>
_images/9bfb3f55d704cf1b928ca8d4a18b9fba7e23a5284fcaf0ad28f4f8b52bc8e15d.png

Model building#

Perform the PCA using the normalized feature vectors to generate five components. We’ll first create an object of the PCA class with five as the value for n_components argument. Subsequently we’ll use the normalized matrix to generate principle components and save the results to a dataframe. We’ll first do PCA of the 2015 data only.

# Exclude the Variety column
df_2015.iloc[:,[1,2,3,4,5,6,7,8]]
Days-to-heading Culm length (cm) Panicle No. Panicle length (cm) Rachis length (cm) Primary branch No. Sec. branch No. Spikelet No.
0
Line 1 101 117.1 8.3 25.4 21.0 13.7 46.7 221.7
Line 2 87 99.8 7.7 23.6 17.3 12.0 41.8 213.0
Line 3 91 106.0 10.7 23.6 18.3 14.7 32.0 177.7
Line 4 87 86.3 7.3 25.3 19.6 12.7 35.7 182.0
Line 5 98 102.0 14.0 26.0 20.0 11.7 26.7 140.3
... ... ... ... ... ... ... ... ...
Line 165 81 82.8 10.3 21.5 15.8 13.7 30.0 165.7
Line 166 91 85.8 11.3 20.3 14.5 10.3 16.0 102.0
Line 167 107 88.3 10.7 24.8 18.6 10.0 25.7 141.0
Line 168 109 108.3 8.7 24.7 18.9 12.3 36.3 165.3
Line 169 120 122.0 12.3 23.6 18.7 13.7 42.3 220.7

169 rows × 8 columns

Normalization

pca_2015_standard = StandardScaler().fit_transform(df_2015.iloc[:,[1,2,3,4,5,6,7,8]])

Fit the data and create a dataframe of principle components

pca_2015_model = PCA(n_components=5) #number of principal components
pca_2015_fit = pca_2015_model.fit_transform(pca_2015_standard) 

df_2015_pca = pd.DataFrame(data = pca_2015_fit, columns = ['pc1', 'pc2','pc3','pc4','pc5'])

The variance in the data as explained by each of the component can be check by the explained_variance_ratio_ of the pca object. The output is the variance associated with each of the principal components (five in this case). We can also calculate the total variance explained by the five components. As we see below the five components together account for 95.9% variance in the data.

print(pca_2015_model.explained_variance_ratio_)
print(pca_2015_model.explained_variance_ratio_.cumsum())
[0.62283251 0.16410666 0.07756149 0.05617533 0.03833603]
[0.62283251 0.78693917 0.86450066 0.92067599 0.95901202]
pca_2015_model.components_
array([[-0.16505669, -0.36898558,  0.24285994, -0.3816588 , -0.408246  ,
        -0.3591661 , -0.40835848, -0.41175211],
       [-0.7266037 , -0.2441704 , -0.47751906, -0.21155821, -0.13636855,
         0.2424033 ,  0.15019558,  0.19932913],
       [-0.22625375,  0.20730652,  0.74838672, -0.30732605, -0.09178569,
         0.41958235,  0.14012806,  0.21723589],
       [-0.41965494, -0.32972906,  0.31061083,  0.53659936,  0.50113939,
         0.04226267, -0.23352974, -0.15260259],
       [ 0.30823769, -0.18116406, -0.18564851, -0.1950358 ,  0.12892964,
         0.73701089, -0.43027916, -0.23391603]])
df_2015_pca.head()
pc1 pc2 pc3 pc4 pc5
0 -4.241790 0.275651 0.358924 -0.495200 -0.275406
1 -2.030945 1.637908 0.070001 -0.686920 -0.911864
2 -1.854638 0.671051 1.212392 0.068783 0.847572
3 -1.925394 1.483075 -0.648590 0.746669 0.069074
4 -0.947741 -1.384563 0.557716 1.403352 -0.043060
df_2015_pca.plot.scatter(x="pc1",y="pc2")
plt.show()
_images/915ec2f0ad43e0a79d50c86382e623092e4d9994c95244d707843d9e69d74b77.png

Let’s plot the percentage variance explained vs the components. This plot would have to graphs with bars showing the vairance explained by each component and line showing the cumulative sum of the percent variance explained.

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()

ax1.bar(range(len(pca_2015_model.explained_variance_ratio_)), \
        pca_2015_model.explained_variance_ratio_, alpha=0.5)
ax2.plot(pca_2015_model.explained_variance_ratio_.cumsum())
ax2.scatter(range(len(pca_2015_model.explained_variance_ratio_)), \
            pca_2015_model.explained_variance_ratio_.cumsum())

ax2.set_ylim([0,1])

ax1.set_ylabel("Percent Explained")
ax2.set_ylabel("Cummulative Sum")
ax1.set_xlabel("Principal Components")
plt.xticks(ticks=range(len(pca_2015_model.explained_variance_ratio_)), \
           labels=["PC1", "PC2", "PC3", "PC4", "PC5"])
plt.show()
_images/9e0f167426d2a45877afad69a475eb1c86316f0a2b46a17d7f003d700ef012f0.png

PCA Loadings#

The loadings value range from -1 to 1 such that its value close to either end (1 or -1) is indicative of a strong influence of a particular feature to a specific component. Loadings can be thought of as analogous to correlation coefficient. To get the loadings we need to mutiple the eigenvectors with the square-root of egienvalues (reference).

pca_2015_model.explained_variance_
array([5.01231878, 1.3206679 , 0.62418533, 0.45207762, 0.30851375])
loadings2 = pca_2015_model.components_.T * np.sqrt(pca_2015_model.explained_variance_)
loading_matrix = pd.DataFrame(loadings2, columns=['pc1', 'pc2','pc3','pc4','pc5'], index=df_2015.columns.values[1:-1])
loading_matrix
pc1 pc2 pc3 pc4 pc5
Days-to-heading -0.369532 -0.835015 -0.178753 -0.282162 0.171208
Culm length (cm) -0.826093 -0.280601 0.163783 -0.221699 -0.100626
Panicle No. 0.543720 -0.548766 0.591266 0.208845 -0.103117
Panicle length (cm) -0.854466 -0.243123 -0.242804 0.360792 -0.108331
Rachis length (cm) -0.913990 -0.156715 -0.072516 0.336950 0.071613
Primary branch No. -0.804109 0.278571 0.331493 0.028416 0.409365
Sec. branch No. -0.914241 0.172605 0.110709 -0.157018 -0.238994
Spikelet No. -0.921839 0.229070 0.171628 -0.102605 -0.129926

The signs for the loadings are arbitary. The absolute number represents the correlation with the corresponding pincipal component. The opposite signs for the components indicate negative correlation within the components.

Biplot#

Biplot is a useful visualization methods for PCA as gives information about relationship of different features with principal components. This plot comprises of two plots (hence the name):

  1. A scatter plot showing transformed data along the first two principle components

  2. PCA loadings as arrows

Visual analysis of this plot can give information about the extent to which different features affect a specific principal components.

coeff = loading_matrix.values
plt.scatter(pca_2015_fit[:,0], pca_2015_fit[:,1], marker='.', alpha=0.5)
for i in range(8):
    plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', \
              alpha=0.5, linestyle = '-',linewidth = 1.5, overhang=0.2)
_images/90f3b57aaeee43960fa326c8a94944349dd4dfbc492ed8d56924f95b83c93662.png

Customising the biplot to match the figure given in the manuscript. The circle in the graph represents 100% variance explained.

#coeff = np.transpose(pca_2015.components_[0:2, :])
coeff = loading_matrix.values
for i in range(8):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'k', alpha = 0.9, \
                  linestyle = '-',linewidth = 1.5, overhang=0.2)
        plt.text(coeff[i,0]*1.25, coeff[i,1]*1.25, \
                 df_2015.columns.values[1:-1][i], color = 'k', ha = 'center', \
                 va = 'center',fontsize=10)
circle1 = plt.Circle((0, 0), 1, color='grey',fill=False)
plt.gca().add_patch(circle1)
plt.plot([0, 0], [-1, 1], color='grey', linestyle='-', linewidth=1)
plt.plot([-1, 1], [0, 0], color='grey', linestyle='-', linewidth=1)

plt.gca().set_box_aspect(1)

plt.xlabel(f"PC1 ({pca_2015_model.explained_variance_ratio_[0]*100:.2f}%)")
plt.ylabel(f"PC2 ({pca_2015_model.explained_variance_ratio_[1]*100:.2f}%)")

plt.yticks(ticks=[-1,-0.5,0,0.5,1])
plt.title("PCA loading for the 2015 data")

plt.show()
_images/5966db058ebfef47f8cb3d02099ace080ab4e8e38622b1e604ff107909859dbc.png

Compare the above graph with Figure 1B of this manuscript. For a discussion on the results see the supplementary material.

Exercise#

Create a figure showing PCA loadings for the 2014 data.