# Regression through the origin

### Regression through the origin

Sometimes due to the nature of the problem (e.g. (i) physical law where one variable is proportional to another variable, and the goal is to determine the constant of proportionality; (ii) X = sales, Y = profit from sales), or, due to empirical considerations ( in the full regression model the intercept \(\beta_0\) turns out to be insignificant), one may fit the model \(Y_i\) = \(\beta_1X_i\) + \(\varepsilon_i\), where \(\varepsilon_i\) are assumed to be uncorrelated, and have mean 0 and variance \(\sigma^2\). Then estimates are:

$$\widehat \beta_1 = \widetilde b_1 = \frac{\sum_{i=1}^n X_iY_i}{\sum_{i=1}^n X_i^2}, \qquad \widetilde{SSE} = \sum_{i=1}^n (Y_i - \widetilde b_1 X_i)^2 = \sum_i Y_i^2 - \widetilde b_1^2 \sum_i X_i^2. $$

Also, \(E(\widetilde b_1) = \beta_1\), \(E(\widetilde{SSE}) = (n-1)\sigma^2\), so that \(\widetilde{MSE} = \frac{1}{n-1}\widetilde{SSE}\) is an unbiased estimator of \(\sigma^2\) and d.f.\((\widetilde{MSE}) = n-1\). Var\((\widetilde b_1) = \frac{\sigma^2}{\sum_i X_i^2}\), and is estimated by \(s^2(\widetilde b_1) = \frac{\widetilde{MSE}}{\sum_i X_i^2}\).

- 100(1 - \(\alpha\))% confidence interval for \(\beta_1\) : \(\widetilde b_1 \pm t(1-\alpha/2; n-1) s(\widetilde b_1)\).
- Estimate of mean response for \(X = X_h\) : \(\widetilde Y_h = \widetilde b_1 X_h\) with estimated standard error \(s(\widetilde Y_h) = \sqrt{\widetilde{MSE} \frac{X_h^2}{\sum_i X_i^2}}\).
- 100(1 - \(\alpha)\)% confidence interval for mean response : \(\widetilde Y_h \pm t(1-\alpha/2; n-1) s(\widetilde Y_h)\).
- ANOVA decomposition : \(\widetilde{SSTO} = \widetilde{SSR} + \widetilde{SSE}\), where \(\widetilde{SSTO} = \sum_i Y_i^2\), with d.f. \((\widetilde{SSTO}) = n\), \(\widetilde{SSR} = \widetilde b_1^2 \sum_i X_i^2\) with d.f.\((\widetilde{SSR}) = 1\). Reject \(H_0 :

\beta_1 = 0\) if F-ratio \(F^* = \frac{\widetilde{MSR}}{\widetilde{MSE}} > F(1-\alpha;1,n-1)\).

### Inverse prediction, or calibration problem

In some experimental studies it is important to know the value of X in order to obtain ( on an average ) a pre-specified value of Y. The following example illustrates such a situtation.

X | 10 | 15 | 15 | 20 | 20 | 20 | 25 | 25 | 28 | 30 |

Y | 160 | 171 | 175 | 182 | 184 | 181 | 188 | 193 | 195 | 200 |

Here Y = tensile strength of paper, X = amount (percentage) of hardwood in the pulp.

Want to find \(X_{h(new)}\) for given value of \(Y_{h(new)}\).

Estimate \(\widehat X_{h(new)} = \frac{Y_{h(new)} - b_0}{b_1}\). Estimated standard error of prediction is \(s(\widehat X_{h(new)})\) where

$$ s^2(\widehat X_{h(new)}) = \frac{MSE}{b_1^2} \left[1+ \frac{1}{n} + \frac{(\widehat X_{h(new)} - \overline{X})^2}{\sum_i(X_i - \overline{X})^2}\right]. $$

Then 100(1 - \(\alpha\))% prediction interval for \(X_{h(new)}\) is given by \(\widehat X_{h(new)} \pm t(1-\alpha/2;n-2) s(\widehat X_{h(new)})\).

Fitted model : \(\widehat Y\) = 143.8244 + 1.8786X . SSE = 38.8328, SSTO = 1300.9, SSR = 1262.1, \(R^2\) = 0.9701, \(\sum_i (X_i - \overline{X})^2\) = 357.6, MSE = 4.8541, \(\overline{X}\) = 20.8, \(\overline{Y}\) = 182.9.

### Contributors

- Yingwen Li
- Debashis Paul