Skip to main content
Statistics LibreTexts

Shiyuan Li

  • Page ID
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    1. Letter data

    Firstly, I want to use a data set of hand-written letters from UCI machine learning to illustrate the use of classification methods. The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The data contains 1 categorical variable and 16 numerical variables. Clearer explanation of these variables is necessary. The following picture shows the description of the variables.
    Here, we first write a Bayes Classifier function.

    In this function, we firstly randomize the 20000 sample, and select training and testing data. Here, the “training_size” is the size of our training data, and the “number_you_like” represents the seed we need to set when we randomize our data.
    We try choose our sample to be 16000, and when we randomize the data, we choose set.seed(1). Then, our output is as follows:
    From the output, we see the Bayes Classifier is not powerful for this data set. So, we want to try SVM to see if we can find a better classifier our not.
    We write a SVM classifier function, and as the bayes one, we also randomize the data first, and then choose our sample.
    We also choose our sample to be 16000, and set.seed(1). Then, our output is as follows:
    From the result, we see SVM is much more powerful the Bayes classifier.

    2.Text Classification

    2.1 Introduction

    The idea of text classification is that: if we are given a set of different kinds of articles, such as news and scientific articles, we can use them as training set to make classification for future articles. Here, I use 5 news groups from BBC to illustrate some methods for doing text classification.

    2.2 Classification

    In order to do classification, we need to translate our text files to \(n\times p\) matrix, where n denotes the size of our training data and p presents the features to describe one text file. Thus, our first goal is transforming our text into a matrix, and then, we use SVM or KNN to construct a classifier.

    2.3 Pre-processing of the text

    Here in this experiment, we mainly use Python 2.7 to process the data and use R to construct classifier and do predictions.
    Firstly, we need to set the path of our data set and select the classes to do classification.

    folder_path = '/Users/shiyuanli/Desktop/bbc'
    classes = os.listdir(folder_path)[1:6]

    Here the news groups are ’business’, ’entertainment’, ’politics’, ’tech’, and ’sport’.
    Next, we get all files from each class and clean the character that is not utf-8 encoded. Then randomly select 500 news as our testing data, and rest of them to be our training data.

    from random import shuffle
    x = corpus
    testing_data = x[-500:]
    training_data = x[0:(len(x)-len(testing_data))]

    Now, we use the training data as our corpus and select the first 3000 most frequent words to be our vocabulary to represent each news. And then, we can transform the training data to a \(1721\times 3000\) data matrix and testing data to a \(500\times 3000\) data matrix as follows:

    count_v1= CountVectorizer(stop_words = 'english', vocabulary = vocabulary1);  
    counts_train = count_v1.fit_transform(corpus_adj_3);
    count_v2 = CountVectorizer(vocabulary = vocabulary1);  
    counts_test = count_v2.fit_transform(testing_data_2);  

    2.4 Basic Experiments

    Since we have our data matrix, then, we can put the data matrices in to R studio and train a classifier.

    train_data = read.csv("train.csv",header = FALSE)
    test_data = read.csv("test.csv",header = FALSE)

    Here, the train data is our \(1721\times 3000\) data matrix, where each row is one news; and the test data is our \(500\times 3000\) data matrix.

    Linear SVM

    We firstly perform 10-fold cross validation to get the best slack parameters and make a prediction for the test data using the best model.

    tune.out = tune.svm(train_class_2~., data = train, kernel = 'linear',
               cost = c(0.01,0.1,0.5,1,10,100), tunecontrol = tune.control(cross = 10))

    Next, we construct a SVM classifier, using the linear kernel and the slack parameter 0.1, and get the confusion matrix of the result.

    ptm <- proc.time()
    classifier = ksvm(train_class_2~ ., data = train, kernel ="vanilladot",C=0.1)
    prediction = predict(classifier, test)
    proc.time() - ptm
    con_mat = table(prediction,test_class_2)
    prediction      business entertainment politics sport tech
      business           112             2        3     0    1
      entertainment        1            71        2     0    1
      politics             4             1       87     0    0
      sport                1             0        2   118    1
      tech                 1             1        1     0   90
    Naive Bayes Classifier
    ptm <- proc.time()
    classifier_kde = naiveBayes(train_class_2 ~ ., data = train, usekernel = TRUE)
    prediction_kde = predict(classifier_kde, test_dat)
    proc.time() - ptm
    con_mat_kde = table(prediction_kde,test_class_2)
    prediction_kde  business entertainment politics sport tech
      business            79             0        3     0    2
      entertainment        1            34        0     1    2
      politics             3             0       68     0    4
      sport               35            40       24   117   27
      tech                 1             1        0     0   58

    2.5 K Nearest Neighbor

    knn_pred = knn(
      train = train_data,
      test = test_dat,
      cl = train_class_2,
      k = 6
    knn_con = table(true = test_class_2, model = knn_pred)
    true            business entertainment politics sport tech
      business            71             1        1    45    1
      entertainment        2            42        0    31    0
      politics             6             4       64    20    1
      sport                0             0        0   118    0
      tech                 4             7        2    35   45

    Now, we get the accuracy of each classification method.

    Accuracy Linear SVM Naive Bayes KNN
    95.6 71.2 72

    2.6 Using PCA to reduce dimensions

    Since we spend much time in constructing a classifier and doing prediction, we want to try dimension reduction methods to see how these approaches help us save time and how they affect the prediction error.
    In order to select the features, we need to find how many eigenvalues

    cov_mat = cov(train_data)
    eig = eigen(cov_mat)
    vectors = eig$vectors
    total_eigen = eig$values
    for (i in 1:3000){
      if (sum(total_eigen[1:i])/sum(total_eigen) > 0.96){

    Shiyuan Li is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?