5.5: Descriptive Statistics Separately for each Group
- Page ID
- 3969
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)It is very commonly the case that you find yourself needing to look at descriptive statistics, broken down by some grouping variable. This is pretty easy to do in R, and there are three functions in particular that are worth knowing about: by()
, describeBy()
and aggregate()
. Let’s start with the describeBy()
function, which is part of the psych
package. The describeBy()
function is very similar to the describe()
function, except that it has an additional argument called group
which specifies a grouping variable. For instance, let’s say, I want to look at the descriptive statistics for the clin.trial
data, broken down separately by therapy
type. The command I would use here is:
describeBy( x=clin.trial, group=clin.trial$therapy )
##
## Descriptive statistics by group
## group: no.therapy
## vars n mean sd median trimmed mad min max range skew kurtosis
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81
## therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59
## se
## drug* 0.29
## therapy* 0.00
## mood.gain 0.20
## --------------------------------------------------------
## group: CBT
## vars n mean sd median trimmed mad min max range skew
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00
## therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
## mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03
## kurtosis se
## drug* -1.81 0.29
## therapy* NaN 0.00
## mood.gain -1.12 0.15
As you can see, the output is essentially identical to the output that the describe()
function produce, except that the output now gives you means, standard deviations etc separately for the CBT
group and the no.therapy
group. Notice that, as before, the output displays asterisks for factor variables, in order to draw your attention to the fact that the descriptive statistics that it has calculated won’t be very meaningful for those variables. Nevertheless, this command has given us some really useful descriptive statistics mood.gain
variable, broken down as a function of therapy
.
A somewhat more general solution is offered by the by()
function. There are three arguments that you need to specify when using this function: the data
argument specifies the data set, the INDICES
argument specifies the grouping variable, and the FUN
argument specifies the name of a function that you want to apply separately to each group. To give a sense of how powerful this is, you can reproduce the describeBy()
function by using a command like this:
by( data=clin.trial, INDICES=clin.trial$therapy, FUN=describe )
## clin.trial$therapy: no.therapy
## vars n mean sd median trimmed mad min max range skew kurtosis
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00 -1.81
## therapy* 2 9 1.00 0.00 1.0 1.00 0.00 1.0 1.0 0.0 NaN NaN
## mood.gain 3 9 0.72 0.59 0.5 0.72 0.44 0.1 1.7 1.6 0.51 -1.59
## se
## drug* 0.29
## therapy* 0.00
## mood.gain 0.20
## --------------------------------------------------------
## clin.trial$therapy: CBT
## vars n mean sd median trimmed mad min max range skew
## drug* 1 9 2.00 0.87 2.0 2.00 1.48 1.0 3.0 2.0 0.00
## therapy* 2 9 2.00 0.00 2.0 2.00 0.00 2.0 2.0 0.0 NaN
## mood.gain 3 9 1.04 0.45 1.1 1.04 0.44 0.3 1.8 1.5 -0.03
## kurtosis se
## drug* -1.81 0.29
## therapy* NaN 0.00
## mood.gain -1.12 0.15
This will produce the exact same output as the command shown earlier. However, there’s nothing special about the describe()
function. You could just as easily use the by()
function in conjunction with the summary()
function. For example:
by( data=clin.trial, INDICES=clin.trial$therapy, FUN=summary )
## clin.trial$therapy: no.therapy
## drug therapy mood.gain
## placebo :3 no.therapy:9 Min. :0.1000
## anxifree:3 CBT :0 1st Qu.:0.3000
## joyzepam:3 Median :0.5000
## Mean :0.7222
## 3rd Qu.:1.3000
## Max. :1.7000
## --------------------------------------------------------
## clin.trial$therapy: CBT
## drug therapy mood.gain
## placebo :3 no.therapy:0 Min. :0.300
## anxifree:3 CBT :9 1st Qu.:0.800
## joyzepam:3 Median :1.100
## Mean :1.044
## 3rd Qu.:1.300
## Max. :1.800
Again, this output is pretty easy to interpret. It’s the output of the summary()
function, applied separately to CBT
group and the no.therapy
group. For the two factors (drug
and therapy
) it prints out a frequency table, whereas for the numeric variable (mood.gain
) it prints out the range, interquartile range, mean and median.
What if you have multiple grouping variables? Suppose, for example, you would like to look at the average mood gain separately for all possible combinations of drug and therapy. It is actually possible to do this using the by()
and describeBy()
functions, but I usually find it more convenient to use the aggregate()
function in this situation. There are again three arguments that you need to specify. The formula
argument is used to indicate which variable you want to analyse, and which variables are used to specify the groups. For instance, if you want to look at mood.gain
separately for each possible combination of drug
and therapy
, the formula you want is mood.gain ~ drug + therapy
. The data
argument is used to specify the data frame containing all the data, and the FUN
argument is used to indicate what function you want to calculate for each group (e.g., the mean
). So, to obtain group means, use this command:
aggregate( formula = mood.gain ~ drug + therapy, # mood.gain by drug/therapy combination
data = clin.trial, # data is in the clin.trial data frame
FUN = mean # print out group means
)
## drug therapy mood.gain
## 1 placebo no.therapy 0.300000
## 2 anxifree no.therapy 0.400000
## 3 joyzepam no.therapy 1.466667
## 4 placebo CBT 0.600000
## 5 anxifree CBT 1.033333
## 6 joyzepam CBT 1.500000
or, alternatively, if you want to calculate the standard deviations for each group, you would use the following command (argument names omitted this time):
aggregate( mood.gain ~ drug + therapy, clin.trial, sd )
## drug therapy mood.gain
## 1 placebo no.therapy 0.2000000
## 2 anxifree no.therapy 0.2000000
## 3 joyzepam no.therapy 0.2081666
## 4 placebo CBT 0.3000000
## 5 anxifree CBT 0.2081666
## 6 joyzepam CBT 0.2645751