# 4.4: Dummy Variable Regression

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\avec}{\mathbf a}$$ $$\newcommand{\bvec}{\mathbf b}$$ $$\newcommand{\cvec}{\mathbf c}$$ $$\newcommand{\dvec}{\mathbf d}$$ $$\newcommand{\dtil}{\widetilde{\mathbf d}}$$ $$\newcommand{\evec}{\mathbf e}$$ $$\newcommand{\fvec}{\mathbf f}$$ $$\newcommand{\nvec}{\mathbf n}$$ $$\newcommand{\pvec}{\mathbf p}$$ $$\newcommand{\qvec}{\mathbf q}$$ $$\newcommand{\svec}{\mathbf s}$$ $$\newcommand{\tvec}{\mathbf t}$$ $$\newcommand{\uvec}{\mathbf u}$$ $$\newcommand{\vvec}{\mathbf v}$$ $$\newcommand{\wvec}{\mathbf w}$$ $$\newcommand{\xvec}{\mathbf x}$$ $$\newcommand{\yvec}{\mathbf y}$$ $$\newcommand{\zvec}{\mathbf z}$$ $$\newcommand{\rvec}{\mathbf r}$$ $$\newcommand{\mvec}{\mathbf m}$$ $$\newcommand{\zerovec}{\mathbf 0}$$ $$\newcommand{\onevec}{\mathbf 1}$$ $$\newcommand{\real}{\mathbb R}$$ $$\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}$$ $$\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}$$ $$\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}$$ $$\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}$$ $$\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}$$ $$\newcommand{\laspan}[1]{\text{Span}\{#1\}}$$ $$\newcommand{\bcal}{\cal B}$$ $$\newcommand{\ccal}{\cal C}$$ $$\newcommand{\scal}{\cal S}$$ $$\newcommand{\wcal}{\cal W}$$ $$\newcommand{\ecal}{\cal E}$$ $$\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}$$ $$\newcommand{\gray}[1]{\color{gray}{#1}}$$ $$\newcommand{\lgray}[1]{\color{lightgray}{#1}}$$ $$\newcommand{\rank}{\operatorname{rank}}$$ $$\newcommand{\row}{\text{Row}}$$ $$\newcommand{\col}{\text{Col}}$$ $$\renewcommand{\row}{\text{Row}}$$ $$\newcommand{\nul}{\text{Nul}}$$ $$\newcommand{\var}{\text{Var}}$$ $$\newcommand{\corr}{\text{corr}}$$ $$\newcommand{\len}[1]{\left|#1\right|}$$ $$\newcommand{\bbar}{\overline{\bvec}}$$ $$\newcommand{\bhat}{\widehat{\bvec}}$$ $$\newcommand{\bperp}{\bvec^\perp}$$ $$\newcommand{\xhat}{\widehat{\xvec}}$$ $$\newcommand{\vhat}{\widehat{\vvec}}$$ $$\newcommand{\uhat}{\widehat{\uvec}}$$ $$\newcommand{\what}{\widehat{\wvec}}$$ $$\newcommand{\Sighat}{\widehat{\Sigma}}$$ $$\newcommand{\lt}{<}$$ $$\newcommand{\gt}{>}$$ $$\newcommand{\amp}{&}$$ $$\definecolor{fillinmathshade}{gray}{0.9}$$
##### Model 3 - Dummy Variable Regression

$Y_{ij} = \mu + \mu_{i} + \epsilon_{ij}, \text{ fitted as } Y_{ij} = \beta_{0} + \beta_{Level \ 1} + \beta_{Level \ 2} + \ldots + \beta_{Level \ T-1} + \epsilon_{ij}$

where $$\beta_{Level \ 1}, \beta_{Level \ 2}, \ldots, \beta_{Level \ T-1}$$ are regression coefficients for $$T-1$$ indicator-coded regression "dummy" variables that are correspond to the $$T-1$$ categorical factor levels. The $$T^{th}$$ factor level mean is given by the regression intercept $$\beta_{0}$$.

The General Linear Model (GLM) applied to data with categorical predictors can be viewed from a regression modeling perspective as an ordinary multiple linear regression (MLR) with "dummy" coding, also known as indicator coding, for the categorical treatment levels. Typically, software performing the MLR will automatically include an intercept, which corresponds to the first column of the design matrix and is a column of $$1$$'s. This automatic inclusion of the intercept can lead to complications when interpreting the regression coefficients.

The SAS Mixed procedure, and also the GLM procedure which we may encounter later, use the "Dummy Variable Regression" model. For the $$Y$$ data used in sections 4.2 and 4.3, the design matrix for this model can be entered into IML as:

/* Dummy Variable Regression Model */
x = {
1    1    0,
1    1    0,
1    0    1,
1    0    1,
1    0    0,
1    0    0};


Notice that in the above design matrix, there are only two indicator columns even though there are three treatment levels in the study. It is because, similar to the matrix below, if we were to have a design matrix with another indicator column representing the third treatment level, the resulting 4 columns would form a set of linearly dependent columns, a mathematical condition that will hinder the computation process any further as explained below. $\begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{bmatrix}$

The above matrix containing all 4 columns has the property that the sum of columns 2-4 will equal the first column representing the intercept. As a result, a mathematical condition called singularity is created and the matrix computations will not run. So one of the treatment levels is omitted from the coding in the design matrix above for IML and the eliminated level is called the ‘reference’ level. In SAS, typically, the treatment level with the highest label is defined as the reference level and so, in this study, it is treatment level 3.

Note that the parameter vector for the dummy variable regression model is $\boldsymbol{\beta} = \begin{bmatrix} \mu_{1} \\ \mu_{2} \\ \mu_{3} \end{bmatrix}$.

Running IML, with the design matrix for the dummy variable regression model, we get the following output;

Regression Coefficients
Beta_0 5.5
Beta_1 -4
Beta_2 -2

The coefficient $$\beta_{0}$$ is the mean for treatment level 3. The mean for treatment level 1 is then calculated from $$\hat{\beta}_{0} + \hat{\beta}_{1} = 1.5$$. Likewise, the mean for treatment level 2 is calculated as $$\hat{\beta}_{0} + \hat{\beta}_{2} = 3.5$$.

Notice that the $$F$$ statistic calculated from this model is the same as that produced from the Cell Means model.

ANOVA
Treatment df SS MS F
2 16 8 16
Error 3 1.5 0.5
Total 5 17.5

## Using Technology

##### Minitab Example

We can confirm our ANOVA table now by running the analysis in software such as Minitab.

Steps in Minitab

First input the data:

In Minitab, different coding options allow the choice of the design matrix which can be done as follows:

Stat > ANOVA > General Linear Model > Fit General Linear Model and place the variables in the appropriate boxes:

Then select Coding… and choose the (1,0) coding as shown below:

Select OK to exit the nested windows. This produces the regular ANOVA output:

###### Analysis of Variance

And also the Regression Equation:

###### Regression Equation

y = 5.500 - 4.000 trt_level1 - 2.000 trt_level2 + 0.0 trt_level3

##### SAS Example
Steps in SAS

In SAS, the default coding is indicator coding, so when you specify the option

model y=trt / solution;


you get the regression coefficients:

Solution for Fixed Effects
Effect trt Estimate Standard Error DF t Value Pr > |t|
Intercept 5.5000 0.5000 3 11.00 0.0016
trt level1 -4.0000 0.7071 3 -5.66 0.0109
trt level2 -2.0000 0.7071 3 -2.83 0.0663
trt level3 0

And the same ANOVA table:

Type 3 Analysis of Variance
Source DF Sum of Squares Mean Square Expected Mean Square Error Term Error DF F Value Pr > F
trt 2 16.000000 8.000000 Var(Residual)+Q(trt) MS(Residual) 3 16.00 0.0251
Residual 3 1.500000 0.500000 Var(Residual)

The Intermediate calculations for this model are:

xprimex
6 2 2
2 2 0
2 0 2
check
1 -2.22E-16 0
3.331E-16 1 0
0 0 1
xprimey
21
3
7
SumY2
89.5
CF
73.5
xprimexinv
0.5 -0.5 -0.5
-0.5 1 0.5
-0.5 0.5 1
##### R Example
Steps in R

1. Define response variable and design matrix

y<-matrix(c(2,1,3,4,6,5), ncol=1)
x = matrix(c(1,1,0,1,1,0,1,0,1,1,0,1,1,0,0,1,0,0),ncol=3,nrow=6,byrow=TRUE)


2. Regression coefficients

beta<-solve(t(x)%*%x)%*%(t(x)%*%y)
#  beta
#      [,1]
# [1,]  5.5
# [2,] -4.0
# [3,] -2.0


3. Calculate the entries of the ANOVA Table

n<-nrow(y)
p<-ncol(x)
J<-matrix(1,n,n)
ss_tot = (t(y)%*%y) - (1/n)*(t(y)%*%J)%*%y #17.5
ss_trt = t(beta)%*%(t(x)%*%y) - (1/n)*(t(y)%*%J)%*%y #16
ss_error = ss_tot - ss_trt #1.5
total_df=n-1 #5
trt_df=p-1 #2
error_df=n-p #3
MS_trt = ss_trt/(p-1) #8
MS_error = ss_error / error_df #0.5
F=MS_trt/MS_error #16


4. Creating the ANOVA table

ANOVA <- data.frame(
c ("","Treatment","Error", "Total"),
c("DF", trt_df,error_df,total_df),
c("SS", ss_trt, ss_error, ss_tot),
c("MS", MS_trt, MS_error, ""),
c("F",F,"",""),
stringsAsFactors = FALSE)
names(ANOVA) <- c(" ", "  ", " ","","")


5. Print the ANOVA table

print(ANOVA)
# 1           DF   SS  MS  F
# 2 Treatment  2   16   8 16
# 3     Error  3  1.5 0.5
# 4     Total  5 17.5


6. Intermediates in the matrix computations

xprimex<-t(x)%*%x
#  xprimex
#      [,1] [,2] [,3]
# [1,]    6    2    2
# [2,]    2    2    0
# [3,]    2    0    2
xprimey<-t(x)%*%y
#   xprimey
#      [,1]
# [1,]   21
# [2,]    3
# [3,]    7
xprimexinv<-solve(t(x)%*%x)
#  xprimexinv
#      [,1] [,2] [,3]
# [1,]  0.5 -0.5 -0.5
# [2,] -0.5  1.0  0.5
# [3,] -0.5  0.5  1.0
check<-xprimexinv%*%xprimex
#  check
#               [,1]          [,2] [,3]
# [1,]  1.000000e+00  0.000000e+00    0
# [2,] -1.110223e-16  1.000000e+00    0
# [3,]  0.000000e+00 -1.110223e-16    1
SumY2<-t(beta)%*%(t(x)%*%y) # 89.5
CF<-(1/n)*(t(y)%*%J)%*%y # 73.5


7. Regression Equation and ANOVA table

trt_level1<-x[,2]
trt_level2<-x[,3]
model<-lm(y~trt_level1+trt_level2)


8. With the command summary(model) we can get the following output:

Call:
lm(formula = y ~ trt_level1 + trt_level2)
Residuals:
1    2    3    4    5    6
0.5 -0.5 -0.5  0.5  0.5 -0.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.5000     0.5000  11.000  0.00161 **
trt_level1   -4.0000     0.7071  -5.657  0.01094 *
trt_level2   -2.0000     0.7071  -2.828  0.06628 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7071 on 3 degrees of freedom
Multiple R-squared:  0.9143,    Adjusted R-squared:  0.8571
F-statistic:    16 on 2 and 3 DF,  p-value: 0.02509


From the output, we can see the estimates for the coefficients are b0=5.5, b1=-4, b2=-2 and the F-statistic is 16 with a p-value of 0.02509.

By using the estimates we can write the regression equation:

y=5.5-4 trt_level1-2 trt_level2+0 trt_level3

9. With the command anova(model) we can get the following output

Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value  Pr(>F)
trt_level1  1   12.0    12.0      24 0.01628 *
trt_level2  1    4.0     4.0       8 0.06628 .
Residuals   3    1.5     0.5                   ---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Note: R is giving the sequential sum of squares in the ANOVA table.

This page titled 4.4: Dummy Variable Regression is shared under a CC BY-NC 4.0 license and was authored, remixed, and/or curated by Penn State's Department of Statistics.