# 5.5: Descriptive Statistics Separately for each Group

- Page ID
- 3969

It is very commonly the case that you find yourself needing to look at descriptive statistics, broken down by some grouping variable. This is pretty easy to do in R, and there are three functions in particular that are worth knowing about: `by()`

, `describeBy()`

and `aggregate()`

. Let’s start with the `describeBy()`

function, which is part of the `psych`

package. The `describeBy()`

function is very similar to the `describe()`

function, except that it has an additional argument called `group`

which specifies a grouping variable. For instance, let’s say, I want to look at the descriptive statistics for the `clin.trial`

data, broken down separately by `therapy`

type. The command I would use here is:

**describeBy**( x=clin.trial, group=clin.trial$therapy )

```
##
## Descriptive statistics by group
## group: no.therapy
## vars n mean sd median trimmed mad min max range skew kurtosis
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81
## therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59
## se
## drug* 0.29
## therapy* 0.00
## mood.gain 0.20
## --------------------------------------------------------
## group: CBT
## vars n mean sd median trimmed mad min max range skew
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00
## therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
## mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03
## kurtosis se
## drug* -1.81 0.29
## therapy* NaN 0.00
## mood.gain -1.12 0.15
```

As you can see, the output is essentially identical to the output that the `describe()`

function produce, except that the output now gives you means, standard deviations etc separately for the `CBT`

group and the `no.therapy`

group. Notice that, as before, the output displays asterisks for factor variables, in order to draw your attention to the fact that the descriptive statistics that it has calculated won’t be very meaningful for those variables. Nevertheless, this command has given us some really useful descriptive statistics `mood.gain`

variable, broken down as a function of `therapy`

.

A somewhat more general solution is offered by the `by()`

function. There are three arguments that you need to specify when using this function: the `data`

argument specifies the data set, the `INDICES`

argument specifies the grouping variable, and the `FUN`

argument specifies the name of a function that you want to apply separately to each group. To give a sense of how powerful this is, you can reproduce the `describeBy()`

function by using a command like this:

**by**( data=clin.trial, INDICES=clin.trial$therapy, FUN=describe )

```
## clin.trial$therapy: no.therapy
## vars n mean sd median trimmed mad min max range skew kurtosis
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81
## therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59
## se
## drug* 0.29
## therapy* 0.00
## mood.gain 0.20
## --------------------------------------------------------
## clin.trial$therapy: CBT
## vars n mean sd median trimmed mad min max range skew
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00
## therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
## mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03
## kurtosis se
## drug* -1.81 0.29
## therapy* NaN 0.00
## mood.gain -1.12 0.15
```

This will produce the exact same output as the command shown earlier. However, there’s nothing special about the `describe()`

function. You could just as easily use the `by()`

function in conjunction with the `summary()`

function. For example:

**by**( data=clin.trial, INDICES=clin.trial$therapy, FUN=summary )

```
## clin.trial$therapy: no.therapy
## drug therapy mood.gain
## placebo :3 no.therapy:9 Min. :0.1000
## anxifree:3 CBT :0 1st Qu.:0.3000
## joyzepam:3 Median :0.5000
## Mean :0.7222
## 3rd Qu.:1.3000
## Max. :1.7000
## --------------------------------------------------------
## clin.trial$therapy: CBT
## drug therapy mood.gain
## placebo :3 no.therapy:0 Min. :0.300
## anxifree:3 CBT :9 1st Qu.:0.800
## joyzepam:3 Median :1.100
## Mean :1.044
## 3rd Qu.:1.300
## Max. :1.800
```

Again, this output is pretty easy to interpret. It’s the output of the `summary()`

function, applied separately to `CBT`

group and the `no.therapy`

group. For the two factors (`drug`

and `therapy`

) it prints out a frequency table, whereas for the numeric variable (`mood.gain`

) it prints out the range, interquartile range, mean and median.

What if you have multiple grouping variables? Suppose, for example, you would like to look at the average mood gain separately for all possible combinations of drug and therapy. It is actually possible to do this using the `by()`

and `describeBy()`

functions, but I usually find it more convenient to use the `aggregate()`

function in this situation. There are again three arguments that you need to specify. The `formula`

argument is used to indicate which variable you want to analyse, and which variables are used to specify the groups. For instance, if you want to look at `mood.gain`

separately for each possible combination of `drug`

and `therapy`

, the formula you want is `mood.gain ~ drug + therapy`

. The `data`

argument is used to specify the data frame containing all the data, and the `FUN`

argument is used to indicate what function you want to calculate for each group (e.g., the `mean`

). So, to obtain group means, use this command:

**aggregate**( formula = mood.gain ~ drug + therapy, *# mood.gain by drug/therapy combination*
data = clin.trial, *# data is in the clin.trial data frame*
FUN = mean *# print out group means*
)

```
## drug therapy mood.gain
## 1 placebo no.therapy 0.300000
## 2 anxifree no.therapy 0.400000
## 3 joyzepam no.therapy 1.466667
## 4 placebo CBT 0.600000
## 5 anxifree CBT 1.033333
## 6 joyzepam CBT 1.500000
```

or, alternatively, if you want to calculate the standard deviations for each group, you would use the following command (argument names omitted this time):

**aggregate**( mood.gain ~ drug + therapy, clin.trial, sd )

```
## drug therapy mood.gain
## 1 placebo no.therapy 0.2000000
## 2 anxifree no.therapy 0.2000000
## 3 joyzepam no.therapy 0.2081666
## 4 placebo CBT 0.3000000
## 5 anxifree CBT 0.2081666
## 6 joyzepam CBT 0.2645751
```