Part of the code implementation of principal component analysis theory based on sklearn

Part of the code implementation of principal component analysis theory based on sklearn

Theoretical part

Feature dimensionality reduction

Feature dimensionality reduction is an application of unsupervised learning: reducing n-dimensional data to m-dimensional data (n>m). Can be applied to data compression and other fields

Principal component analysis (PCA)

Principal component analysis is a commonly used feature dimensionality reduction method. For m-dimensional data A, you can reduce the dimensionality to obtain an n-dimensional data B (m>n), which satisfies $B = f(A)$ and $A/approx g(f(A))$, where f(x) is the encoding function and g(x) is the decoding function.

When performing principal component analysis, the optimization goal is $c = argmin ||x-g(c)||_{2}$, where c is the encoding and g(c) is the decoding function

Code

Import data set

import numpy as np
import pandas as pd
digits_train = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra', header=None)
digits_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes', header=None)

Split data and labels

train_x,train_y = digits_train[np.arange(64)],digits_train[64]
test_x,test_y = digits_test[np.arange(64)],digits_test[64]

Principal component analysis

from sklearn.decomposition import PCA
estimator = PCA(n_components=20)
pca_train_x = estimator.fit_transform(train_x)
pca_test_x = estimator.transform(test_x)

Training support vector machine

from sklearn.svm import LinearSVC

Raw data

svc = LinearSVC()
svc.fit(X=train_x,y=train_y)
svc.score(test_x,test_y)
0.9393433500278241

PCA processed data

svc_pca = LinearSVC()
svc_pca.fit(pca_train_x,train_y)
svc_pca.score(pca_test_x,test_y)
0.91819699499165275
Reference: https://cloud.tencent.com/developer/article/1110770 sklearn-based principal component analysis theory partial code implementation-Cloud + Community-Tencent Cloud