Search

Text Color

Margin Size

Font Type

Enable Dyslexic Font

2.2: Pirate-plots

Last updated

Dec 16, 2022
Save as PDF
- 2.1: Data wrangling and density curves
- 2.3: Models, hypotheses, and permutations for the two sample mean situation

$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$

$\newcommand{\id}{\mathrm{id}}$ $\newcommand{\Span}{\mathrm{span}}$

( \newcommand{\kernel}{\mathrm{null}\,}\) $\newcommand{\range}{\mathrm{range}\,}$

$\newcommand{\RealPart}{\mathrm{Re}}$ $\newcommand{\ImaginaryPart}{\mathrm{Im}}$

$\newcommand{\Argument}{\mathrm{Arg}}$ $\newcommand{\norm}[1]{\| #1 \|}$

$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$\newcommand{\Span}{\mathrm{span}}$

$\newcommand{\id}{\mathrm{id}}$

$\newcommand{\Span}{\mathrm{span}}$

$\newcommand{\kernel}{\mathrm{null}\,}$

$\newcommand{\range}{\mathrm{range}\,}$

$\newcommand{\RealPart}{\mathrm{Re}}$

$\newcommand{\ImaginaryPart}{\mathrm{Im}}$

$\newcommand{\Argument}{\mathrm{Arg}}$

$\newcommand{\norm}[1]{\| #1 \|}$

$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$

$\newcommand{\Span}{\mathrm{span}}$ $\newcommand{\AA}{\unicode[.8,0]{x212B}}$

$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$

$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$

$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vectorC}[1]{\textbf{#1}}$

$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$

$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$

$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$

$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$

$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$

$\newcommand{\avec}{\mathbf a}$

$\newcommand{\bvec}{\mathbf b}$

$\newcommand{\cvec}{\mathbf c}$

$\newcommand{\dvec}{\mathbf d}$

$\newcommand{\dtil}{\widetilde{\mathbf d}}$

$\newcommand{\evec}{\mathbf e}$

$\newcommand{\fvec}{\mathbf f}$

$\newcommand{\nvec}{\mathbf n}$

$\newcommand{\pvec}{\mathbf p}$

$\newcommand{\qvec}{\mathbf q}$

$\newcommand{\svec}{\mathbf s}$

$\newcommand{\tvec}{\mathbf t}$

$\newcommand{\uvec}{\mathbf u}$

$\newcommand{\vvec}{\mathbf v}$

$\newcommand{\wvec}{\mathbf w}$

$\newcommand{\xvec}{\mathbf x}$

$\newcommand{\yvec}{\mathbf y}$

$\newcommand{\zvec}{\mathbf z}$

$\newcommand{\rvec}{\mathbf r}$

$\newcommand{\mvec}{\mathbf m}$

$\newcommand{\zerovec}{\mathbf 0}$

$\newcommand{\onevec}{\mathbf 1}$

$\newcommand{\real}{\mathbb R}$

$\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$

$\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$

$\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$

$\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$

$\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$

$\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$

$\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$

$\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$

$\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$

$\newcommand{\laspan}[1]{\text{Span}\{#1\}}$

$\newcommand{\bcal}{\cal B}$

$\newcommand{\ccal}{\cal C}$

$\newcommand{\scal}{\cal S}$

$\newcommand{\wcal}{\cal W}$

$\newcommand{\ecal}{\cal E}$

$\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$

$\newcommand{\gray}[1]{\color{gray}{#1}}$

$\newcommand{\lgray}[1]{\color{lightgray}{#1}}$

$\newcommand{\rank}{\operatorname{rank}}$

$\newcommand{\row}{\text{Row}}$

$\newcommand{\col}{\text{Col}}$

$\renewcommand{\row}{\text{Row}}$

$\newcommand{\nul}{\text{Nul}}$

$\newcommand{\var}{\text{Var}}$

$\newcommand{\corr}{\text{corr}}$

$\newcommand{\len}[1]{\left|#1\right|}$

$\newcommand{\bbar}{\overline{\bvec}}$

$\newcommand{\bhat}{\widehat{\bvec}}$

$\newcommand{\bperp}{\bvec^\perp}$

$\newcommand{\xhat}{\widehat{\xvec}}$

$\newcommand{\vhat}{\widehat{\vvec}}$

$\newcommand{\uhat}{\widehat{\uvec}}$

$\newcommand{\what}{\widehat{\wvec}}$

$\newcommand{\Sighat}{\widehat{\Sigma}}$

$\newcommand{\lt}{<}$

$\newcommand{\gt}{>}$

$\newcommand{\amp}{&}$

$\definecolor{fillinmathshade}{gray}{0.9}$

An alternative graphical display for comparing multiple groups that we will use is a display called a pirate-plot (Phillips 2017) from the yarrr package²⁵. Figure 2.4 shows an example of a pirate-plot that provides a side-by-side display that contains the density curves, the original observations that generated the density curve as jittered points (jittered both vertically and horizontally a little), the sample mean of each group (wide bar), and vertical lines to horizontal bars that represents the confidence interval for the true mean of that group. For each group, the density curves are mirrored to aid in visual assessment of the shape of the distribution. This mirroring also creates a shape that resembles the outline of a violin with skewed distributions so versions of this display have also been called a “violin plot” or a “bean plot” (I call these “enhanced violin plots” when I use them in journal articles instead of “pirate plots”). All together this plot shows us information on the original observations, center (mean) and its confidence interval, spread, and shape of the distributions of the responses. Our inferences typically focus on the means of the groups and this plot allows us to compare those across the groups while gaining information on the shapes of the distributions of responses in each group.

To use the pirateplot function we need to install and then load the yarrr package. The function works like the boxplot used previously except that options for the type of confidence interval needs to be specified with inf.method = "ci" – otherwise you will get a different kind of interval than you learned in introductory statistics and we don’t want to get caught up in trying to understand the kind of interval it makes by default. And it seems useful to add inf.disp = "line" as an additional option to add bars for the confidence interval²⁶. There are many other options in the function that might be useful in certain situations, but these are the only ones that are really needed to get started with pirate-plots. While we could build this plot using ggplot, the simplicity of this function keeps it a favorite way to display a quantitative variable across groups even though we lose the grammar of graphics way of modifying the plot.

Figure 2.4: Pirate-plot of distances by outfit group. Bold horizontal lines correspond to sample mean of each group, boxes around lines (here they are very tight to the lines for the means) are the 95% confidence intervals.

library(yarrr)
pirateplot(Distance ~ Condition, data = dd, inf.method = "ci", inf.disp = "line")

Figure 2.4 suggests that the distributions are relatively symmetric which would suggest that the means and medians are similar even though only the means are displayed in these plots. In this display, none of the observations are flagged as outliers (it is not a part of this display). It is up to the consumer of the graphic to decide if observations look to be outside of the overall pattern of the rest of the observations. By plotting the observations by groups, we can also explore the narrowest (and likely most scary) overtakes in the data set. The police and racer conditions seem to have all observations over 25 cm and the most close passes were in the novice and polite outfits, including the two 2 cm passes. By displaying the original observations, we are able to explore and identify features that aggregation and summarization in plots can sometimes obfuscate. But the pirate-plots also allow you to compare the shape of the distributions (relatively symmetric and somewhat bell-shaped), variability (they look to have relatively similar variability), and the means of the groups. Our inferences are going to focus on the means but those inferences are only valid if the distributions are either approximately normal or at least have similar shapes and spreads (more on this soon).

It appears that the mean for police is higher than the other groups but that the others are not too different. But is this difference real? We will never know the answer to that question, but we can assess how likely we are to have seen a result as extreme or more extreme than our result, assuming that there is no difference in the means of the groups. And if the observed result is (extremely) unlikely to occur, then we have (extremely) strong evidence against the hypothesis that the groups have the same mean and can then conclude that there is likely a real difference. If we discover that our result was not very unlikely, given the assumption of no difference in the mean of the groups, then we can’t conclude that there is a difference but also can’t conclude that they are equal, just that we failed to find enough evidence against the equal means assumption to discard it as a possibility. Whether the result is unusual or not, we will want to carefully explore how big the estimated differences in the means are – is the difference in means large enough to matter to you? We would be more interested in the implications of the difference in the means when there is strong evidence against the null hypothesis that the means are equal but the size of the estimated differences should always be of some interest. To accompany the pirate-plot that displays estimated means, we need to have numerical values to compare. We can get means and standard deviations by groups easily using the same formula notation as for the plots with the mean and sd functions, if the mosaic package is loaded.

library(mosaic)
mean(Distance ~ Condition, data = dd)

##   casual  commute    hiviz   novice   police   polite    racer 
## 117.6110 114.6079 118.4383 116.9405 122.1215 114.0518 116.7559

sd(Distance ~ Condition, data = dd)

##   casual  commute    hiviz   novice   police   polite    racer 
## 29.86954 29.63166 29.03384 29.03812 29.73662 31.23684 30.60059

We can also use the favstats function to get those summaries and others by groups.

favstats(Distance ~ Condition, data = dd)

##   Condition min    Q1 median  Q3 max     mean       sd   n missing
## 1    casual  17 100.0    117 134 245 117.6110 29.86954 779       0
## 2   commute   8  98.0    116 132 222 114.6079 29.63166 857       0
## 3     hiviz  12 101.0    117 134 237 118.4383 29.03384 737       0
## 4    novice   2 100.5    118 133 274 116.9405 29.03812 807       0
## 5    police  34 104.0    119 138 253 122.1215 29.73662 790       0
## 6    polite   2  95.0    114 133 225 114.0518 31.23684 868       0
## 7     racer  28  98.0    117 135 231 116.7559 30.60059 852       0

Based on these results, we can see that there is an estimated difference of over 8 cm between the smallest mean (polite at 114.05 cm) and the largest mean (police at 122.12 cm). The differences among some of the other groups are much smaller, such as between casual and commute with sample means of 117.611 and 114.608 cm, respectively. Because there are seven groups being compared in this study, we will have to wait until Chapter 3 and the One-Way ANOVA test to fully assess evidence related to some difference among the seven groups. For now, we are going to focus on comparing the mean Distance between casual and commute groups – which is a two independent sample mean situation and something you should have seen before. Remember that the “independent” sample part of this refers to observations that are independently observed for the two groups as opposed to the paired sample situation that you may have explored where one observation from the first group is related to an observation in the second group (the same person with one measurement in each group (we generically call this “repeated measures”) or the famous “twin” studies with one twin assigned to each group). This study has some potential violations of the “independent” sample situation (for example, repeated measurements made during a single ride), but those do not clearly fit into the matched pairs situation, so we will note this potential issue and proceed with exploring the method that assumes that we have independent samples, even though this is not true here. In Chapter 9, methods for more complex study designs like this one will be discussed briefly, but mostly this is beyond the scope of this material.

Here we are going to use the “simple” two independent group scenario to review some basic statistical concepts and connect two different frameworks for conducting statistical inference: randomization and parametric inference techniques. Parametric statistical methods involve making assumptions about the distribution of the responses and obtaining confidence intervals and/or p-values using a named distribution (like the $z$ or $t$ -distributions). Typically these results are generated using formulas and looking up areas under curves or cutoffs using a table or a computer. Randomization-based statistical methods use a computer to shuffle, sample, or simulate observations in ways that allow you to obtain distributions of possible results to find areas and cutoffs without resorting to using tables and named distributions. Randomization methods are what are called nonparametric methods that often make fewer assumptions (they are not free of assumptions!) and so can handle a larger set of problems more easily than parametric methods. When the assumptions involved in the parametric procedures are met by a data set, the randomization methods often provide very similar results to those provided by the parametric techniques. To be a more sophisticated statistical consumer, it is useful to have some knowledge of both of these techniques for performing statistical inference and the fact that they can provide similar results might deepen your understanding of both approaches.

To be able to work just with the observations from two of the conditions (casual and commute) we could remove all the other observations in a spreadsheet program and read that new data set back into R, but it is actually pretty easy to use R to do data management once the data set is loaded. It is also a better scientific process to do as much of your data management within R as possible so that your steps in managing the data are fully documented and reproducible. Highlighting and clicking in spreadsheet programs is a dangerous way to work and can be impossible to recreate steps that were taken from initial data set to the version that was analyzed. In R, we could identify the rows that contain the observations we want to retain and just extract those rows, but this is hard with over five thousand observations. The filter function from the dplyr package (part of the tidyverse suite of packages) is the best way to be able to focus on observations that meet a particular condition; we can “filter” the data set to retain just those rows. The filter function takes the data set via the pipe operate and then we need to define the condition we want to meet to retain those rows. Here we need to define the variable we want to work with, Condition, and then request rows that meet a condition (are %in%) and the aspects that meet that condition (here by concatenating the two levels of “casual” and “commute”), leading to code of:

dd %>% filter(Condition %in% c("casual", "commute"))

We want to save that new filtered data set into a new tibble for future work, so we can use the assignment operator (<-) to save the reduced data set into ddsub:

ddsub <- dd %>% filter(Condition %in% c("casual", "commute"))

There is also the select function that we could also use with an additional pipe operator to just focus on certain columns in the data set, here to just retain the Condition and Distance variables using:

ddsub <- dd %>% 
  filter(Condition %in% c("casual","commute")) %>%
  select(Distance, Condition)

The select function shows up in multiple packages so you might need to use dplyr::select() which tells R to use the version of select that is in dplyr. When you are working to filter or subset your data set you should always check that the correct observations were dropped either using View(ddsub) or by doing a quick summary of the Condition variable in the new tibble.

summary(ddsub$Condition)

##  casual commute   hiviz  novice  police  polite   racer 
##     779     857       0       0       0       0       0

It ends up that R remembers the categories for observations that we removed even though there are 0 observations in them now and that can cause us some problems. When we remove a group of observations, we sometimes need to clean up categorical variables to just reflect the categories that are present. The factor function creates categorical variables based on the levels of the variables that are observed and is useful to run here to clean up Condition to just reflect the categories that are now present.

ddsub <- ddsub %>% mutate(Condition = factor(Condition))
summary(ddsub$Condition)

##  casual commute 
##     779     857

The two categories of interest now were selected because neither looks particularly “racey” or has high visibility but could present a common choice between getting fully “geared up” for the commute or just jumping on a bike to go to work. Now if we remake the boxplots and pirate-plots, they only contain results for the two groups of interest here as seen in Figure 2.5. Note that these are available in the previous version of the plots, but now we will just focus on these two groups.

Boxplot and pirate-plot of the Distance responses on the reduced ddsub data set. — Figure 2.5: Boxplot and pirate-plot of the *Distance* responses on the reduced `ddsub` data set.

boxplot(Distance ~ Condition, data = ddsub) 
pirateplot(Distance ~ Condition, data = ddsub, inf.method = "ci", inf.disp = "line")

The two-sample mean techniques you learned in your previous course all start with comparing the means the two groups. We can obtain the two means using the mean function or directly obtain the difference in the means using the diffmean function (both require the mosaic package). The diffmean function provides $\bar{x}_\text{commute} - \bar{x}_\text{casual}$ where $\bar{x}$ (read as “x-bar”) is the sample mean of observations in the subscripted group. Note that there are two directions that you could compare the means and this function chooses to take the mean from the second group name alphabetically and subtract the mean from the first alphabetical group name. It is always good to check the direction of this calculation as having a difference of $-3.003$ cm versus $3.003$ cm could be important.

mean(Distance ~ Condition, data = ddsub)

##   casual  commute 
## 117.6110 114.6079

diffmean(Distance ~ Condition, data = ddsub)

##  diffmean 
## -3.003105

Support Center

How can we help?