Skip to main content
Statistics LibreTexts

Shiyuan Li

1. Letter data

Firstly, I want to use a data set of hand-written letters from UCI machine learning to illustrate the use of classification methods. The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The data contains 1 categorical variable and 16 numerical variables. Clearer explanation of these variables is necessary. The following picture shows the description of the variables.
Here, we first write a Bayes Classifier function.

In this function, we firstly randomize the 20000 sample, and select training and testing data. Here, the “training_size” is the size of our training data, and the “number_you_like” represents the seed we need to set when we randomize our data.
We try choose our sample to be 16000, and when we randomize the data, we choose set.seed(1). Then, our output is as follows:
From the output, we see the Bayes Classifier is not powerful for this data set. So, we want to try SVM to see if we can find a better classifier our not.
We write a SVM classifier function, and as the bayes one, we also randomize the data first, and then choose our sample.
We also choose our sample to be 16000, and set.seed(1). Then, our output is as follows:
From the result, we see SVM is much more powerful the Bayes classifier.

2.Text Classification

2.1 Introduction

The idea of text classification is that: if we are given a set of different kinds of articles, such as news and scientific articles, we can use them as training set to make classification for future articles. Here, I use 5 news groups from BBC to illustrate some methods for doing text classification.

2.2 Classification

In order to do classification, we need to translate our text files to \(n\times p\) matrix, where n denotes the size of our training data and p presents the features to describe one text file. Thus, our first goal is transforming our text into a matrix, and then, we use SVM or KNN to construct a classifier.

2.3 Pre-processing of the text

Here in this experiment, we mainly use Python 2.7 to process the data and use R to construct classifier and do predictions.
Firstly, we need to set the path of our data set and select the classes to do classification.

folder_path = '/Users/shiyuanli/Desktop/bbc'
classes = os.listdir(folder_path)[1:6]

Here the news groups are ’business’, ’entertainment’, ’politics’, ’tech’, and ’sport’.
Next, we get all files from each class and clean the character that is not utf-8 encoded. Then randomly select 500 news as our testing data, and rest of them to be our training data.

from random import shuffle
x = corpus
testing_data = x[-500:]
training_data = x[0:(len(x)-len(testing_data))]

Now, we use the training data as our corpus and select the first 3000 most frequent words to be our vocabulary to represent each news. And then, we can transform the training data to a \(1721\times 3000\) data matrix and testing data to a \(500\times 3000\) data matrix as follows:

count_v1= CountVectorizer(stop_words = 'english', vocabulary = vocabulary1);  
counts_train = count_v1.fit_transform(corpus_adj_3);
count_v2 = CountVectorizer(vocabulary = vocabulary1);  
counts_test = count_v2.fit_transform(testing_data_2);  

2.4 Basic Experiments

Since we have our data matrix, then, we can put the data matrices in to R studio and train a classifier.

train_data = read.csv("train.csv",header = FALSE)
test_data = read.csv("test.csv",header = FALSE)

Here, the train data is our \(1721\times 3000\) data matrix, where each row is one news; and the test data is our \(500\times 3000\) data matrix.

Linear SVM

We firstly perform 10-fold cross validation to get the best slack parameters and make a prediction for the test data using the best model.

tune.out = tune.svm(train_class_2~., data = train, kernel = 'linear',
           cost = c(0.01,0.1,0.5,1,10,100), tunecontrol = tune.control(cross = 10))

Next, we construct a SVM classifier, using the linear kernel and the slack parameter 0.1, and get the confusion matrix of the result.

ptm <- proc.time()
classifier = ksvm(train_class_2~ ., data = train, kernel ="vanilladot",C=0.1)
prediction = predict(classifier, test)
proc.time() - ptm
con_mat = table(prediction,test_class_2)

prediction      business entertainment politics sport tech
  business           112             2        3     0    1
  entertainment        1            71        2     0    1
  politics             4             1       87     0    0
  sport                1             0        2   118    1
  tech                 1             1        1     0   90

Naive Bayes Classifier


ptm <- proc.time()
classifier_kde = naiveBayes(train_class_2 ~ ., data = train, usekernel = TRUE)
prediction_kde = predict(classifier_kde, test_dat)
proc.time() - ptm
con_mat_kde = table(prediction_kde,test_class_2)

prediction_kde  business entertainment politics sport tech
  business            79             0        3     0    2
  entertainment        1            34        0     1    2
  politics             3             0       68     0    4
  sport               35            40       24   117   27
  tech                 1             1        0     0   58

2.5 K Nearest Neighbor

knn_pred = knn(
  train = train_data,
  test = test_dat,
  cl = train_class_2,
  k = 6
knn_con = table(true = test_class_2, model = knn_pred)

true            business entertainment politics sport tech
  business            71             1        1    45    1
  entertainment        2            42        0    31    0
  politics             6             4       64    20    1
  sport                0             0        0   118    0
  tech                 4             7        2    35   45

Now, we get the accuracy of each classification method.

Accuracy Linear SVM Naive Bayes KNN  
  95.6 71.2 72  

2.6 Using PCA to reduce dimensions

Since we spend much time in constructing a classifier and doing prediction, we want to try dimension reduction methods to see how these approaches help us save time and how they affect the prediction error.
In order to select the features, we need to find how many eigenvalues

cov_mat = cov(train_data)
eig = eigen(cov_mat)
vectors = eig$vectors
total_eigen = eig$values
for (i in 1:3000){
  if (sum(total_eigen[1:i])/sum(total_eigen) > 0.96){