6.4: Answers to exercises
- Page ID
- 3578
Correlation and linear models
Answer to the question of human traits. Inspect the data, load it and check the object:
Code \(\PageIndex{1}\) (Python):
traits <- read.table("data/traits.txt", h=TRUE, row.names=1) str(traits)
Data is binary, so Kendall’s correlation is most natural:
Code \(\PageIndex{2}\) (Python):
traits <- read.table("data/traits.txt", h=TRUE, row.names=1) Cor(traits, method="kendall", dec=1) # asmisc.r traits.c <- cor(traits, method="kendall") Topm(traits.c) # asmisc.r
We will visualize correlation with Pleiad(), one of advantages of it is to show which correlations are connected, grouped—so-called “correlation pleiads”:
Code \(\PageIndex{3}\) (Python):
traits <- read.table("data/traits.txt", h=TRUE, row.names=1) Cor(traits, method="kendall", dec=1) # asmisc.r traits.c <- cor(traits, method="kendall") Pleiad(traits.c, corr=TRUE, lcol=1, legend=FALSE, off=1.12, pch=21, bg="white", cex=1.1) # asmisc.r
(Look on the title page to see correlations. One pleiad, CHIN, TONGUE and THUMB is the most apparent.)
Answer to the question of the linear dependence between height and weight for the artificial data. Correlation is present but the dependence is weak (Figure \(\PageIndex{1}\)):
Code \(\PageIndex{4}\) (Python):
hwc <- read.table("data/hwc.txt", h=TRUE) cor.test(hwc$WEIGHT, hwc$HEIGHT) w.h <- lm(WEIGHT ~ HEIGHT, data=hwc) summary(w.h) plot(WEIGHT ~ HEIGHT, data=hwc, xlab="Height, cm", ylab="Weight, kg") abline(w.h) Cladd(w.h, data=hwc) # asmisc.r
The conclusion about weak dependence was made because of low R-squared which means that predictor variable, height, does not explain much of the dependent variable, weight. In addition, many residuals are located outside of IQR. This is also easy to see on the plot where many data points are distant from the regression line and even from 95% confidence bands.
Answer to spring draba question. Check file, load and check the object:
Code \(\PageIndex{5}\) (Python):
ee <- read.table("http://ashipunov.info/shipunov/open/erophila.txt", h=TRUE) str(ee) # asmisc.r
Now, check normality and correlations with the appropriate method:
Code \(\PageIndex{6}\) (Python):
ee <- read.table("http://ashipunov.info/shipunov/open/erophila.txt", h=TRUE) Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")} sapply(ee[, 2:4], Normality) # asmisc.r Topm(cor(ee[, 2:4], method="spearman")) # asmisc.r with(ee, cor.test(FRUIT.L, FRUIT.MAXW, method="spearman"))
Therefore, FRUIT.L and FRUIT.MAXW are best candidates for linear model analysis. We will plot it first (Figure \(\PageIndex{2}\)):
Code \(\PageIndex{7}\) (Python):
ee <- read.table("http://ashipunov.info/shipunov/open/erophila.txt", h=TRUE) ee.lm <- lm(FRUIT.MAXW ~ FRUIT.L, data=ee) plot(FRUIT.MAXW ~ FRUIT.L, data=ee, type="n") Points(ee$FRUIT.L, ee$FRUIT.MAXW, scale=.5) # asmisc.r Cladd(ee.lm, ee, ab.lty=1) # asmisc.r
(Points() is a “single” variant of PPoints() from the above, and was used because there are multiple overlaid data points.)
Finally, check the linear model and assumptions:
Code \(\PageIndex{8}\) (Python):
ee <- read.table("http://ashipunov.info/shipunov/open/erophila.txt", h=TRUE) ee.lm <- lm(FRUIT.MAXW ~ FRUIT.L, data=ee) summary(ee.lm) plot(ee.lm, which=1)
There is a reliable model (p-value: < 2.2e-16) which has a high R-squared value (sqrt(0.4651) = 0.6819824). Slope coefficient is significant whereas intercept is not. Homogeneity of residuals is apparent, their normality is also out of question:
Code \(\PageIndex{9}\) (Python):
ee <- read.table("http://ashipunov.info/shipunov/open/erophila.txt", h=TRUE) ee.lm <- lm(FRUIT.MAXW ~ FRUIT.L, data=ee) Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")} Normality(ee.lm$residuals)
Answer to the heterostyly question. First, inspect the file, load the data and check it:
Code \(\PageIndex{10}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) str(he)
This is how to visualize the phenomenon of heterostyly for all data:
Code \(\PageIndex{11}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) boxplot((STYLE.L-STAMEN.L) ~ (STYLE.L-STAMEN.L)>0, names=c("short","long"), data=he)
(Please review this plot yourself.)
Now we need to visualize linear relationships of question. There are many overlaid data points so the best way is to employ the PPoints() function (Figure \(\PageIndex{3}\)):
Code \(\PageIndex{12}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) plot(STYLE.L ~ STAMEN.L, data=he, type="n", xlab="Length of stamens, mm", ylab="Length of style, mm") PPoints(he$SPECIES, he$STAMEN.L, he$STYLE.L, scale=.9, cols=1) abline(lm(STYLE.L ~ STAMEN.L, data=he[he$SPECIES=="veris", ])) abline(lm(STYLE.L ~ STAMEN.L, data=he[he$SPECIES=="vulgaris", ]), lty=2) legend("topright", pch=1:2, lty=1:2, legend=c("Primula veris", "P. vulgaris"), text.font=3)
Now to the models. We will assume that length of stamens is the independent variable. Explore, check assumptions and AIC for the full model:
Code \(\PageIndex{13}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) summary(he.lm1 <- lm(STYLE.L ~ STAMEN.L * SPECIES, data=he)) plot(he.lm1, which=1:2) AIC(he.lm1)
Reduced (additive) model:
Code \(\PageIndex{14}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) summary(he.lm2 <- update(he.lm1, . ~ . - STAMEN.L:SPECIES)) AIC(he.lm2)
Full model is better, most likely because of strong interactions. To check interactions graphically is possible also with the interaction plot which will treat independent variable as factor:
Code \(\PageIndex{15}\) (Python):
he <- read.table("http://ashipunov.info/shipunov/open/heterostyly.txt", h=TRUE) with(he, interaction.plot(cut(STAMEN.L, quantile(STAMEN.L)), SPECIES, STYLE.L))
This technical plot (check it yourself) shows the reliable differences between lines of different species. This differences are bigger when stamens are longer. This plot is more suitable for the complex ANOVA but as you see, works also for linear models.
Answer to the question about sundew (Drosera) populations. First, inspect the file, then load it and check the structure of object:
Code \(\PageIndex{16}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) str(dr) # asmisc.r
Since we a required to calculate correlation, check the normality first:
Code \(\PageIndex{17}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")} sapply(dr[,-1], Normality)
Well, to this data we can apply only nonparametric methods:
Code \(\PageIndex{18}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) dr.cor <- cor(dr[,-1], method="spearman", use="pairwise") Topm(dr.cor) # asmisc.r Pleiad(dr.cor, corr=TRUE, legtext=2, legpos="bottom", leghoriz=TRUE, pch=19, cex=1.2) # asmisc.r
(Note that "pairwise" was employed, there are many NAs.)
The last plot (Figure \(\PageIndex{4}\)) shows two most important correlation pleiads: one related with leaf size, and another—with inflorescence.
Since we know now which characters are most correlated, proceed to linear model. Since in the development of sundews stalk formed first, let us accept STALK.L as independent variable (influence), and INFL.L as dependent variable (response):
Code \(\PageIndex{19}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) summary(dr.lm <- lm(INFL.L ~ STALK.L, data=dr)) plot(dr.lm, which=1:2)
Reliable model with high R-squared. However, normality of residuals is not perfect (please check model plots yourself).
Now to the analysis of leaf length. Determine which three populations are largest and subset the data:
Code \(\PageIndex{20}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) (largest3 <- rev(sort(table(dr[, 1])))[1:3]) dr3 <- dr[dr$POP %in% names(largest3), ] dr3$POP <- droplevels(dr3$POP)
Now we need to plot them and check if there are visual differences:
Code \(\PageIndex{21}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) largest3 <- rev(sort(table(dr[, 1])))[1:3] dr3 <- dr[dr$POP %in% names(largest3), ] boxplot(LEAF.L ~ POP, data=dr3)
Yes, they probably exist (please check the plot yourself.)
It is worth to look on similarity of ranges:
Code \(\PageIndex{22}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) largest3 <- rev(sort(table(dr[, 1])))[1:3] dr3 <- dr[dr$POP %in% names(largest3), ] dr3$POP <- droplevels(dr3$POP) tapply(dr3$LEAF.L, dr3$POP, mad, na.rm=TRUE) fligner.test(LEAF.L ~ POP, data=dr3)
The robust range statistic, MAD (median absolute deviation) shows that variations are similar. We also ran the nonparametric analog of Bartlett test to see the statistical significance of this similarity. Yes, variances are statistically similar.
Since we have three populations to analyze, we will need something ANOVA-like, but nonparametric:
Code \(\PageIndex{23}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) largest3 <- rev(sort(table(dr[, 1])))[1:3] dr3 <- dr[dr$POP %in% names(largest3), ] kruskal.test(LEAF.L ~ POP, data=dr3)
Yes, there is at least one population where leaf length is different from all others. To see which, we need a post hoc, pairwise test:
Code \(\PageIndex{24}\) (Python):
dr <- read.table( "http://ashipunov.info/shipunov/open/drosera.txt", h=TRUE) largest3 <- rev(sort(table(dr[, 1])))[1:3] dr3 <- dr[dr$POP %in% names(largest3), ] dr3$POP <- droplevels(dr3$POP) pairwise.wilcox.test(dr3$LEAF.L, dr3$POP)
Population N1 is most divergent whereas Q1 is not really different from L.
Logistic regression
Answer to the question about demonstration of objects. We will go the same way as in the example about programmers. After loading data, we attach it for simplicity:
Code \(\PageIndex{25}\) (Python):
seeing <- read.table("data/seeing.txt") attach(seeing)
Check the model:
Code \(\PageIndex{26}\) (Python):
seeing <- read.table("data/seeing.txt") seeing.logit <- glm(V3 ~ V2, family=binomial, data=seeing) summary(seeing.logit)
(Calling variables, we took into account the fact that R assign names like V1, V2, V3 etc. to “anonymous” columns.)
As one can see, the model is significant. It means that some learning takes place within the experiment.
It is possible to represent the logistic model graphically (Figure \(\PageIndex{5}\)):
Code \(\PageIndex{27}\) (Python):
seeing <- read.table("data/seeing.txt") seeing.logit <- glm(V3 ~ V2, family=binomial, data=seeing) tries <- seq(1, 5, length=50) # exactly 50 numbers from 1 to 5 seeing.p <- predict(seeing.logit, list(V2=tries), type="response") plot(V3 ~ jitter(V2, amount=.1), data=seeing, xlab="", ylab="") lines(tries, seeing.p)
We used predict() function to calculate probabilities of success for non-existed attempts, and also added small random noise with function jitter() to avoid the overlap.
Answer to the juniper questions. Check file, load it, check the object:
Code \(\PageIndex{28}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) jj$LOC <- factor(jj$LOC, labels=paste0("loc", levels(jj$LOC))) str(jj) # asmisc.r
Analyze morphological and ecological characters graphically (Figure \(\PageIndex{6}\)):
Code \(\PageIndex{29}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) j.cols <- colorRampPalette(c("steelblue", "white"))(5)[2:4] boxplot(jj[, 2:7], jj$LOC, legpos="top", boxcols=j.cols) # asmisc.r
Now plot length of needles against location (Figure \(\PageIndex{7}\)):
Code \(\PageIndex{30}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) j.cols <- colorRampPalette(c("steelblue", "white"))(5)[2:4] spineplot(LOC ~ NEEDLE.L, data=jj, col=j.cols)
(As you see, spine plot works with measurement data.)
Since there is a measurement character and several locations, the most appropriate is ANOVA-like approach. We need to check assumptions first:
Code \(\PageIndex{31}\) (Python):
Normality <- function(a) {ifelse(shapiro.test(a)$p.value < 0.05, "NOT NORMAL", "NORMAL")} jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) Normality(jj$NEEDLE.L) tapply(jj$NEEDLE.L, jj$LOC, var) bartlett.test(NEEDLE.L ~ LOC, data=jj)
Code \(\PageIndex{32}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) oneway.test(NEEDLE.L ~ LOC, data=jj) (eta.squared <-summary(lm(NEEDLE.L ~ LOC, data=jj))$adj.r.squared) pairwise.t.test(jj$NEEDLE.L, jj$LOC)
(Note how we calculated eta-squared, the effect size of ANOVA. As you see, this could be done through linear model.)
There is significant difference between the second and two other locations.
And to the second problem. First, we make new variable based on logical expression of character differences:
Code \(\PageIndex{33}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) is.sibirica <- with(jj, (NEEDLE.L < 8 & HEIGHT < 100)) sibirica <- factor(is.sibirica, labels=c("communis", "sibirica")) summary(sibirica)
There are both “species” in the data. Now, we plot conditional density and analyze logistic regression:
Code \(\PageIndex{34}\) (Python):
jj <- read.table( "http://ashipunov.info/shipunov/open/juniperus.txt", h=TRUE) is.sibirica <- with(jj, (NEEDLE.L < 8 & HEIGHT < 100)) sibirica <- factor(is.sibirica, labels=c("communis", "sibirica")) cdplot(sibirica ~ PINE.N, data=jj, col=j.cols[c(1, 3)]) summary(glm(sibirica ~ PINE.N, data=jj, family=binomial))
Conditional density plot (Figure \(\PageIndex{8}\)) shows an apparent tendency, and model summary outputs significance for slope coefficient.
On the next page, there is a table (Table \(\PageIndex{1}\)) with a key which could help to choose the right inferential method if you know number of samples and type of the data.
Type of data | One variable | Two variables | Many variables |
---|---|---|---|
Measurement, normally distributed | t-test |
Difference: t-test (paired and non-paired), F-test (scale) Effect size: Cohen's d, Lyubishchev's K Relation: correlation, linear models |
Linear models, ANOVA, one-way test, Bartlett test (scale) Post hoc: pairwise-test, Tukey HSD Effect size: R-squared |
Measurement and ranked | Wilcoxon test, Shapiro-Wilk test |
Difference: Wilcoxon test (paired and non-paired), sign test, robust rank irder test, Ansari-Bradley test (scale) Effect size: Cliff's delta, Lyubishchev's K Relation: nonparametric correlation |
Linear models, LOESS, Kruskal-Wallis test, Friedman test, Fligner-Killeen test (scale) Post hoc: pairwise Wilcoxon test, pairwise robust rank order test Effect size: R-squared |
Categorical | One sample test of proportions, goodness-of-fit test |
Association: Chi-squared test, Fisher's exact test, test of proportions, G-test, McNemar's test (paired) Effect size: Cramer's V, Tschuprow's T, odds ratio |
Association tests (see on the left); generalized linear models of binomial family (= logistic regression) Post hoc: pairwise table test |
Table \(\PageIndex{1}\) Key to the most important inferential statistical methods (except multivariate). After you narrow the search with couple of methods, proceed to the main text.