Skip to main content
Statistics LibreTexts

8.2: A.2- Describing...

  • Page ID
    3597
  • Firstly, look on the basic characteristics of every character:

    Code \(\PageIndex{1}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    summary(data)

    Since SEX and COLOR are categorical, the output in these columns has no sense, but you may want to convert these columns into “true” categorical data. There are multiple possibilities but the simplest is the conversion into factor:

    Code \(\PageIndex{2}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    data1 <- data
    data1$SEX <- factor(data1$SEX, labels=c("female", "male"))
    data1$COLOR <- factor(data1$COLOR, labels=c("red", "blue", "green"))

    (To retain the original data, we copied it first into new object data1. Please check it now with summary() yourself.)

    summary() command is applicable not only to the whole data frame but also to individual characters (or variables, or columns):

    Code \(\PageIndex{3}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    summary(data$WEIGHT)

    It is possible to calculate characteristics from summary() one by one. Maximum and minimum:

    Code \(\PageIndex{4}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    min(data$WEIGHT)
    max(data$WEIGHT)

    ... median:

    Code \(\PageIndex{5}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    median(data$WEIGHT)

    ... mean for WEIGHT and for each character:

    Code \(\PageIndex{6}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    mean(data$WEIGHT)

    and

    Code \(\PageIndex{7}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    colMeans(data)

    ... and also round the result to one decimal place:

    Code \(\PageIndex{8}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    round(colMeans(data), 1)

    (Again, the output of colMeans() has no sense for SEX and COLOR.)

    Unfortunately, the commands above (but not summary()) do not work if the data have missed values (NA):

    Code \(\PageIndex{9}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    data2 <- data
    data2[3, 3] <- NA
    mean(data2$WEIGHT)

    To calculate mean without noticing missing data, enter

    Code \(\PageIndex{10}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    data2 <- data
    mean(data2$WEIGHT, na.rm=TRUE)

    Another way is to remove rows with NA from the data with:

    Code \(\PageIndex{11}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    data2 <- data
    data2.o <- na.omit(data2)

    Then, data2.o will be free from missing values.

    Sometimes, you need to calculate the sum of all character values:

    Code \(\PageIndex{12}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    sum(data$WEIGHT)

    ... or the sum of all values in one row (we will try the second row):

    Code \(\PageIndex{13}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    sum(data[2, ])

    ... or the sum of all values for every row:

    Code \(\PageIndex{14}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    apply(data, 1, sum)

    (These summarizing exercises are here for training purposes only.)

    For the categorical data, it is sensible to look how many times every value appear in the data file (and that also help to know all values of the character):

    Code \(\PageIndex{15}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    table(data$SEX)
    table(data$COLOR)

    Now transform frequencies into percents (100% is the total number of bugs):

    Code \(\PageIndex{16}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    100*prop.table(table(data$SEX))

    One of the most important characteristics of data variability is the standard deviation:

    Code \(\PageIndex{17}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    sd(data$WEIGHT)

    Calculate standard deviation for each numerical column (columns 3 and 4):

    Code \(\PageIndex{18}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    sapply(data[, 3:4], sd)

    If you want to do the same for data with a missed value, you need something like:

    Code \(\PageIndex{19}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    data2 <- data
    sapply(data2[, 3:4], sd, na.rm=TRUE)

    Calculate also the coefficient of variation (CV):

    Code \(\PageIndex{20}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    100*sd(data$WEIGHT)/mean(data$WEIGHT)

    We can calculate any characteristic separately for males and females. Means for insect weights:

    Code \(\PageIndex{21}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    tapply(data$WEIGHT, data$SEX, mean)

    How many individuals of each color are among males and females?

    Code \(\PageIndex{22}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    table(data$COLOR, data$SEX)

    (Rows are colors, columns are males and females.)

    Now the same in percents:

    Code \(\PageIndex{23}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    100*prop.table(table(data$COLOR, data$SEX))

    Finally, calculate mean values of weight separately for every combination of color and sex (i.e., for red males, red females, green males, green females, and so on):

    Code \(\PageIndex{24}\) (R):

    data <- read.table("data/bugs.txt", h=TRUE)
    tapply(data$WEIGHT, list(data$SEX, data$COLOR), mean)
    • Was this article helpful?