Shiyuan Li

1. Letter data

Firstly, I want to use a data set of hand-written letters from UCI machine learning to illustrate the use of classification methods. The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The data contains 1 categorical variable and 16 numerical variables. Clearer explanation of these variables is necessary. The following picture shows the description of the variables.
Here, we first write a Bayes Classifier function.

In this function, we firstly randomize the 20000 sample, and select training and testing data. Here, the “training_size” is the size of our training data, and the “number_you_like” represents the seed we need to set when we randomize our data.
We try choose our sample to be 16000, and when we randomize the data, we choose set.seed(1). Then, our output is as follows:
From the output, we see the Bayes Classifier is not powerful for this data set. So, we want to try SVM to see if we can find a better classifier our not.
We write a SVM classifier function, and as the bayes one, we also randomize the data first, and then choose our sample.
We also choose our sample to be 16000, and set.seed(1). Then, our output is as follows:
From the result, we see SVM is much more powerful the Bayes classifier.

2.Text Classification

2.1 Introduction

The idea of text classification is that: if we are given a set of different kinds of articles, such as news and scientific articles, we can use them as training set to make classification for future articles. Here, I use 5 news groups from BBC to illustrate some methods for doing text classification.

2.2 Classification

In order to do classification, we need to translate our text files to $$n\times p$$ matrix, where n denotes the size of our training data and p presents the features to describe one text file. Thus, our first goal is transforming our text into a matrix, and then, we use SVM or KNN to construct a classifier.

2.3 Pre-processing of the text

Here in this experiment, we mainly use Python 2.7 to process the data and use R to construct classifier and do predictions.
Firstly, we need to set the path of our data set and select the classes to do classification.

folder_path = '/Users/shiyuanli/Desktop/bbc'
classes = os.listdir(folder_path)[1:6]

Here the news groups are ’business’, ’entertainment’, ’politics’, ’tech’, and ’sport’.
Next, we get all files from each class and clean the character that is not utf-8 encoded. Then randomly select 500 news as our testing data, and rest of them to be our training data.

from random import shuffle
x = corpus
shuffle(x)
testing_data = x[-500:]
training_data = x[0:(len(x)-len(testing_data))]

Now, we use the training data as our corpus and select the first 3000 most frequent words to be our vocabulary to represent each news. And then, we can transform the training data to a $$1721\times 3000$$ data matrix and testing data to a $$500\times 3000$$ data matrix as follows:

count_v1= CountVectorizer(stop_words = 'english', vocabulary = vocabulary1);

count_v2 = CountVectorizer(vocabulary = vocabulary1);
counts_test = count_v2.fit_transform(testing_data_2);

2.4 Basic Experiments

Since we have our data matrix, then, we can put the data matrices in to R studio and train a classifier.

setwd("/Users/shiyuanli/Desktop")

Here, the train data is our $$1721\times 3000$$ data matrix, where each row is one news; and the test data is our $$500\times 3000$$ data matrix.

Linear SVM

We firstly perform 10-fold cross validation to get the best slack parameters and make a prediction for the test data using the best model.

tune.out = tune.svm(train_class_2~., data = train, kernel = 'linear',
cost = c(0.01,0.1,0.5,1,10,100), tunecontrol = tune.control(cross = 10))
tune.out$best.model Next, we construct a SVM classifier, using the linear kernel and the slack parameter 0.1, and get the confusion matrix of the result. ptm <- proc.time() classifier = ksvm(train_class_2~ ., data = train, kernel ="vanilladot",C=0.1) prediction = predict(classifier, test) proc.time() - ptm con_mat = table(prediction,test_class_2) con_mat test_class_2 prediction business entertainment politics sport tech business 112 2 3 0 1 entertainment 1 71 2 0 1 politics 4 1 87 0 0 sport 1 0 2 118 1 tech 1 1 1 0 90 Naive Bayes Classifier ptm <- proc.time() classifier_kde = naiveBayes(train_class_2 ~ ., data = train, usekernel = TRUE) prediction_kde = predict(classifier_kde, test_dat) proc.time() - ptm con_mat_kde = table(prediction_kde,test_class_2) test_class_2 prediction_kde business entertainment politics sport tech business 79 0 3 0 2 entertainment 1 34 0 1 2 politics 3 0 68 0 4 sport 35 40 24 117 27 tech 1 1 0 0 58 2.5 K Nearest Neighbor knn_pred = knn( train = train_data, test = test_dat, cl = train_class_2, k = 6 ) knn_con = table(true = test_class_2, model = knn_pred) model true business entertainment politics sport tech business 71 1 1 45 1 entertainment 2 42 0 31 0 politics 6 4 64 20 1 sport 0 0 0 118 0 tech 4 7 2 35 45 Now, we get the accuracy of each classification method.  Accuracy Linear SVM Naive Bayes KNN 95.6 71.2 72 2.6 Using PCA to reduce dimensions Since we spend much time in constructing a classifier and doing prediction, we want to try dimension reduction methods to see how these approaches help us save time and how they affect the prediction error. In order to select the features, we need to find how many eigenvalues cov_mat = cov(train_data) eig = eigen(cov_mat) vectors = eig$vectors
total_eigen = eig\$values
for (i in 1:3000){
if (sum(total_eigen[1:i])/sum(total_eigen) > 0.96){
print(i)
break
}
}