Skip to main content
Statistics LibreTexts

8.1: The Issue of Boundedness

  • Page ID
    57744
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    The ordinary least squares (OLS) model requires that the range of the dependent variable \(Y\) is from \(-\infty\) to \(\infty\). Yet in the social and health sciences, analysts routinely encounter outcomes that are logically or practically constrained — from proportions and percentages (bounded at 0 and 1) to Likert rating scales (e.g., 1–7) or counts with a clear upper limit (e.g., number of correct items on a 10-question test) to counts with no upper limit (e.g., dents in a car or mistakes in a book). Applying standard OLS to such "bounded" data raises a significant concern: the model, being linear and unbounded, can produce predicted values that lie outside the possible range of the actual data, rendering them nonsensical and compromising inference. We saw this at the end of Section 7.3: Cows in Děčín.

    This section addresses this common dilemma. When is it permissible to use the simplicity and interpretability of OLS when Y is bounded, and what adaptations or diagnostics are required? We will focus on practical strategies for once-bounded (e.g., \(Y\) ≥ 0) and twice-bounded (e.g., 0 ≤ \(Y\) ≤ 1) outcomes. While specialized alternatives like Tobit, beta, or fractional regression exist, the techniques covered here provide a principled approach for determining when OLS remains a robust and informative tool, and how to implement it correctly when it does. The key lies not in abandoning OLS outright (at least not yet), but in understanding its limitations, checking for critical pathologies, and knowing when a transformation or adjusted interpretation can salvage its utility.

    Note that the techniques covered in this chapter are useful beyond ordinary least squares. Frequently, you will have data that you want to model using Technique X, but that technique requires an unbounded dependent variable. (OLS is not the only one!) You can use these techniques to adjust your data, while still using Technique X.

    ✦•················• ✦ •··················•✦

    We finished Chapter 7 with a model of vote proportions for ballot measures concerning keeping cows in the city. We applied that model to an upcoming vote in Děčín to predict the outcome. Finally, we used Monte Carlo methods to estimate the probability that the ballot measure would pass. In the end, we predicted that the ballot measure had a 20% chance of passing, with a point-prediction of 42% of the voters in favor of the bill.

    Results, however, suggest that there may be something gravely wrong with this model. To see this more clearly, let us predict the proportion of voters in support of a hypothetical 1994 ballot measure in Venkovský (religious percent = 85) that also banned chickens.

    From the results summarized in the table, the point-prediction for this 1994 Venkovský ballot measure is

    \begin{align*}
    \hat{p} &= 0.1512 + -0.0201 (\texttt{yearPassed}) + -0.0373 (\texttt{chicken}) + 0.0095 (\texttt{religPct}) \\[1em]
    &= 0.1512 + -0.0201 (-6) + -0.0373 (1) + 0.0095 (85) \\[1em]
    &= 1.0379
    \end{align*}

    Thus, this model predicts that the ballot measure will pass with over 103% of the vote — a physically impossible outcome. What went wrong? How can we fix this model so that this cannot happen?

    First, nothing "went wrong," per se. The model did exactly what it was supposed to do. The prediction, however, is based on assuming the effect (slope) is constant. If the slope is constant, one can find large enough (or small enough) values for the independent variables to make the prediction arbitrarily large or small. When we are predicting a bounded dependent variable, this will necessarily lead to an impossible prediction, such as a 103.79% support rate.

    Thus, the issue is either with the linear (constant slope) aspect of the prediction equation or with the bounded nature of the dependent variable (bounded below by 0 and above by 1).

    So, to improve the model, we can either model using non-linear coefficient functions or eliminate the boundedness. At this point, the easier of the two is to eliminate this boundedness; that is, we need to change the dependent variable so that all values make physical sense. This is done through the process of variable transformation. There are three steps:

    1. First, transform the dependent variable from a restricted range to an unrestricted range.
    2. Second, perform the analysis on this transformed variable.
    3. Finally, back-transform the estimated values (not estimated effects) into the original units.

    The key is the transformation. It must change the range of \(Y\) from its current limited version to an unlimited version, denoted \(\widetilde{Y}\). Luckily, there are two transformations that take care of most of our needs, in general: the logit (LOH-jit) and the logarithm transformations.

    Data Bounded by 0 and 1

    One type of data you may come across in your research is proportion data, data where the values are bounded below and above (by 0 and 1, respectively); that is, if \(Y\) is the dependent variable, then 0 ≤ \(Y\) ≤ 1). One appropriate function that transforms this bounded domain into an unbounded range is the logit function:

    \begin{equation}
    \tilde{y} = \mathrm{logit}(y) \stackrel{\text{def}}{=} \log \left( \frac{y}{1-y} \right)\label{eq:trans-logit}
    \end{equation}

    The logit function transforms (maps) variables bounded by 0 and 1 into unbounded variables; in symbols,

    logit : (0,1) \(\mapsto\) ℝ

    The logit's inverse, which maps it from logit units back into level units is called the logistic function:

    \begin{equation}
    y = \mathrm{logistic}(\tilde{y}) \stackrel{\text{def}}{=} \frac{1}{1 + \exp(-\tilde{y})} = \frac{\exp(\tilde{y})}{1+\exp(\tilde{y})}\label{eq:trans-logistic}
    \end{equation}

    The logistic function transforms unbounded variables into variables bounded by 0 and 1:

    logistic : ℝ \(\mapsto\) (0,1)

    The figure below shows a graphic of the logistic function. The logit is the inverse.

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{1}\): Graphic of the logistic function. The logit function is the inverse of the logistic. Note that the graph is symmetric about the point (0, 0.5).

    While other transforms are available (and will be covered when we get to GLMs), the logit is frequently used for the following three reasons:

    1. The transformation and its inverse are both functions (the transform is a bijective function). This means that the results are always commensurate to the original problem.
    2. The transformation is symmetric. This means that stretching above 0 is the same as below.
    3. The function is exact, as opposed to the probit transform that requires numerical approximations. This increases the speed and accuracy of your predictions.

    A careful reader will note that the domain of \(Y\) includes neither 0 nor 1. This is because there is no way of transforming a closed (or a half-closed) interval into an open interval such as ℝ while ensuring that the inverse is also a function. This is a provable fact of mathematics (Strichartz 2000).

    Data that are Either 0 or 1

    But, what do we do if there are y-values that are zero or one?

    One solution is to add (subtract) an extremely small number, \(\delta\), to the zero (one). A second solution is to completely drop those data from the analysis.

    Caution

    Neither of these solutions is perfect. If you insist on using linear regression, then you should use both methods and see how much your answer changes.

    A general rule of thumb is that if your underlying research model is correct, then the results should not vary wildly based on similar models. That is, if we know \(Y\) depends on \(X1\) and \(X2\), then all appropriate modeling techniques should give approximately the same results. If they do not, then there is something seriously wrong with our assumptions about the underlying relationships — the model.

    A third solution is to change the proportion into a bounded count and use a different paradigm (Chapter 15: Binary Dependent Variables). While this is the best option, it requires a bit more background before we can cover it.

    Example \(\PageIndex{1}\): Voting in Děčín

    Let us return to the cows data file and the example of Section 7.3: Cows in Děčín. The voters of Děčín are being sent to the polls to vote on a constitutional referendum that proposes to limit the number of cows kept in the city. This was not the first time that Ruritanians were sent to the polls to vote on this or a closely related issue. Given the information from previous votes, what is the probability that this ballot measure will pass in Děčín?

    Solution.
    Let us now answer this question more correctly. Recall that without performing a transformation of the dependent variable, there existed predictions that fell outside possible reality. To fix this, let us transform the dependent variable using the logit function, repeat the analysis, back-transform these transformed results to the original units, and compare the results.

    The first step is to transform the dependent variable. As the dependent variable is a proportion, let us use the logit transform (from the KnoxStats package). If we decide to call the new variable logitWin, then the command will be

    logitWin = logit(propWin)
    

    Now, this is our new dependent variable. As such, we perform the same analysis as in Section 7.3: Cows in Děčín:

    modLgt = lm(logitWin ~ yearPassed + chickens + religPct)
    

    The summary(modLgt) command provides the results summarized in the table below.

                                Estimate   Std. Error   t-value      p-value
    Constant Term                -1.8909       0.2898     -6.53    << 0.0001
    Year Passed (after 2000)     -0.0885       0.0157     -5.64    << 0.0001
    Contains a Chicken Ban       -0.2318       0.0878     -2.64       0.0134
    Percent Religious in Kraj     0.4750       0.0047     10.06    << 0.0001
    

    \(\blacksquare\)

    Interpreting the Coefficients ✨

    How shall we interpret the results? There are a few ways. As always, a good graphic is the best.

    However, an older manner relies on the "log odds ratio." The odds ratio is frequently used to illustrate the strength of the association between two variables. For every increase of 1 in the "percent religious in the kraj," the log of the odds of the vote passing increases by 0.4750. Said another way, the odds of the ballot measure passing increases by approximately 60.80% for each increase of 1 percentage point in religiosity. [Note: \(\exp[0.4750] = 1.6080\).]

    An increase of 2pp in religiosity increases the odds by about 157%. [Again, note: \(\exp \left[2 \times 0.4750\right] = 2.5857\).] Thus, if the original odds were 3-to-1 against, increasing the religiosity by 2pp means that the odds are now about 7.75-to-1 against.

    Note

    Beyond this, one cannot directly compare the magnitudes of these coefficients with the magnitudes of the previous coefficients; these effect estimates are in different units. The coefficients seen in the original model predict in the original units (proportions). The coefficients in the logit model predict in logit (of proportions) units.

    Furthermore, merely taking the logistic of the coefficients will not put them in level units; the transform is non-linear, as we designed, thus the effect of any variable depends on the values of all variables. To compare the two models, we need to perform predictions (remembering to back-transform them).

    Predicting the proportion of the vote for the Děčín ballot measure is almost as easy as it was before. The only additional step is that we need to back-transform the prediction to get it in proportion units.

    So, according to this transformed model, what is the expected vote in Děčín? To answer this, we again need the Děčín information:

    yearPassed = 9
    chickens = 0
    religPct = 48

    With this information, and under the usual assumption that the model is correct, we have our prediction of \(-0.4091\) logits. Back-transforming this value gives a prediction of logistic(-0.4091)=40% of the population will vote in favor of this ballot measure — just slightly different from our original prediction of 42%.

    DECIN = data.frame(yearPassed=9, chickens=0, religPct=48)
    voteLgt = predict(modLgt, newdata=DECIN)
    voteEst = logistic(voteLgt)
    

    However, remember that the original question was not this point estimate, it was a probability of the ballot measure passing. To determine this probability, we just need to repeat the same steps as we did answering this question before, but remembering to back-transform the results.

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{2}\): Histogram of the results of the Monte Carlo experiment described in the text. Note that the distribution has a slight right-skew as a result of the transformation process. Also note that there are no predicted vote outcomes less than 0 or greater than 1, as compared to the original untransformed model of Section 7.3: Cows in Děčín. In fact, the lowest prediction is 9.0%, while the largest is 81.6%.

    The Monte Carlo results of the transformed model indicate that there is a 15% chance that the ballot measure will pass in Děčín. The histogram of a million predictions is presented in Figure \(\PageIndex{2}\). From this information, we can conclude that there is a definite possibility that the cow ballot measure will pass in Děčín (15%), with a predicted 40% vote in favor.

    If we were into betting, we could also conclude that this model predicts that the odds of this ballot measure passing is

    \begin{equation}\frac{1-p}{p}=\frac{1-0.15}{0.15}\end{equation}

    Which is about 5.67-to-1 against. Thus, a "fair" bet would pay $5.67 for every $1.00 bet in favor of the ballot measure and $1/5.67 = $0.176 for every dollar bet against the ballot measure passing.

    Regardless, since the probability of the measure passing is 15%, a pass would not be wholly unexpected. Its passing is more likely than flipping a fair coin three times and having it come up heads all three times (15% vs. 12.5%) — definitely not unheard of.

    The 95% prediction interval for the Děčín referendum outcome, according to our model, is from 23.5% to 59.0%. The observed value of 53% is within that interval.

    Note

    From this past discussion, we were able to estimate success probabilities and fair betting odds. This is yet another use of statistical modeling.

    Note that we are estimating the probability of an event. Unless that probability is 0 or 1, there is always a chance the event will (or will not) happen. Thus, the passing of the Děčín referendum in 2009 does not directly detract from our model. There was a 15% chance it would pass, according to our model.

    Caution

    Stay aware of what your statistical model says and does not say — the choice is humility or humiliation.

    Data Bounded Below by 0

    When the dependent variable represents a proportion (bounded by 0 and 1), we can use the logit function to transform it into an unbounded variable, perform the usual analysis, and back-transform those results into level units (the previous section). However, not all bounded variables fit this bounding, e.g., age, height, and income. These variables are bounded below by 0 and have no theoretical upper bound. For such variables, we may want to use the logarithm transform.

    The logarithm function transforms variables bounded below by 0 into unbounded variables; in symbols,

    log : \((0, \infty) \mapsto\) ℝ

    Its inverse is the exponential function,

    exp : ℝ \(\mapsto (0,\infty)\)

    Both functions are bijections and strictly increasing and so are appropriate functions for transforming our variables.

    Data that are 0

    Note that values of 0 are problematic for the logarithm in much the same way that values of 0 and 1 were problematic for the logistic function. The solutions are similar.

    Example \(\PageIndex{2}\): Wealth and Corruption

    The gross domestic product (GDP) per capita is one of many measures of average wealth in countries. If extant theory is correct, then the wealth in the country is directly affected by the level of honesty in the government — countries with high levels of honesty (low levels of corruption) should be wealthier than those with low levels of honesty (high levels of corruption). Furthermore, if theory is correct, the level of democracy in a country should also influence the country's level of wealth — countries with higher levels of democracy should be wealthier than countries with lower levels of democracy.

    Let us determine if reality (in the form of the data in the gdp data file) supports the current theory or if current theory needs to explain the severe discrepancies. Furthermore, let us estimate the GDP per capita for Ruritania and provide a 95% confidence interval for that estimate.

    Solution.
    For this section, recall that the level of honesty in government for Ruritania is 5.1 and the level of democracy is -7. With that information, I leave it as an exercise for you to model the data without transforming the dependent variable and discovering the predicted GDP per capita for Ruritania is $26,795.64. This seems awesome for Ruritania. The 95% prediction interval is from $5232 to $48,360. That's rather wide. It is a function of the high level of variation in the data.

    However, to see a problem with the model, let us estimate the GDP per capita for Papua New Guinea (democracy=10, hig=2.1). According to the model, the predicted GDP per capita is -$2337, which is not physically possible. If nothing else, this prediction should suggest to you that the data needs transformation before being modeled.

    The process to estimate the GDP per capita in Ruritania using a transformed model is formulaic for us by now: transform the dependent variable by applying the logarithm function, model the transformed variable, estimate in the transformed units, back-transformed into level units — here, dollars.

    One feature of R that is shared by few other statistical packages is that you do not have to actually create a new variable; you can perform the transformation within the modeling command:

    modLog = lm(log(gdpcap) ~ democracy + hig)
    

    I'm not sure this is a strength. I prefer to clearly define my variables elsewhere.

    The results table for this model is provided here.

                             Estimate   StdErr   t-value     p-value
    Constant term              6.9333   0.1479     46.89   << 0.0001
    Level of Democracy        -0.0028   0.0113     -0.25      0.8055
    Honesty in Government      0.4702   0.0359     13.11   << 0.0001
    

    Again, as we have transformed the dependent variable, the coefficients are not in units of dollars. As such, their magnitudes cannot be directly compared to those in the untransformed model. Their directions, however, can be compared because the transformation we used was strictly increasing. Thus, this model tells us that higher levels of honesty in government correspond to countries with higher GDPs per capita (in this sample). Additionally, countries with higher democracy scores correspond to countries with lower GDPs per capita (in this sample).

    The first finding is so strong in this sample that we can conclude that there is evidence of this relationship in the population. This second finding, which conflicts with current theory, is not statistically significant at the usual \(\alpha=0.05\) level. Thus, we cannot conclude that the effect in the population is negative, positive, or null (zero). All we can conclude is that we did not detect an effect with this data. Whether this is due to a lack of effect in the population, the sample selected, the sample size, no one can tell.

    With this model, we can estimate the GDP per capita in Ruritania using the standard method, but remembering that we must back-transform the final estimate. That is, if we used the commands

    RUR = data.frame(hig=5.1, democracy=-7)
    estLog = predict(modLog, newdata=RUR)
    

    then we would report Ruritania's GDP per capita as an estimated value of $11,508 (using exp(estLog)).

    \(\blacksquare\)

    Interpreting the Coefficients ✨

    From your mathematics course, you may recall that \(\log(1+x) \approx x\) for small values of \(x\). This means we can interpret the coefficients in the log-model as percent increases/decreases. For instance, the coefficient for the level of democracy in the country is -0.0028. We can interpret this as "one increase in the level of democracy decreases the GDP per capita by about 0.28%, on average." The coefficient of the level of honesty in government is 0.4702. We could interpret this as "one increase in the level of honesty in the government increases the GDP per capita by approximately 47%, on average."

    What do we mean by "small values of \(x\)"?

    Anything less than 0.20 is usually fine. As such, our interpretation of the honesty-in-government coefficient probably should not have been done. A log-coefficient value of 0.4702 really corresponds to a percent increase of only 38.5%. It is more accurate, but less easily calculated.

    Here is my code to explore the relationship \(\log(1+x) \approx x\):

    x = seq(0,1, length=1e4)
    y = log(1+x)
    plot(x,y, col="blue1")
    abline(0,1, col="orange")
    

    ▪──── ⚔ ────▪

    The question asked us to calculate the estimate, but to also provide a 95% confidence interval. One way of doing this is to use Monte Carlo methods. The steps are all the same, with the additional step of back-transforming the estimates (last line).

    Here is the code for parametric bootstrapping:

    b.int   =  6.933298
    b.dem   = -0.002776
    b.hig   =  0.470225
    
    s.int   =  0.147873
    s.dem   =  0.011253
    s.hig   =  0.035855
    
    e.int   = rnorm(trials, m=b.int, s=s.int)
    e.dem   = rnorm(trials, m=b.dem, s=s.dem)
    e.hig   = rnorm(trials, m=b.hig, s=s.hig)
    
    outcome = e.int + e.dem*-7 + e.hig*5.1
    est     = exp(outcome)
    

    The assignments in the second and third groups are the coefficient estimates and standard errors from the model. The histogram of these results is provided in Figure \(\PageIndex{3}\). To calculate a 95% confidence interval, we merely find the values of est for which 2.5% and 97.5% of the data are less.

    quantile(est, c(0.025,0.975))
    

    From this, we can conclude that our model estimates the GDP per capita for Ruritania is $11,508, with a 95% confidence interval being from $7075 to $18,733. It is interesting to note that the actual GDP per capita in Ruritania is $55,000, which is well above our confidence interval. Thus, our question is this: Is our model that weak, or is Ruritania doing that well?

    fig-ch01_patchfile_01.jpg
    Figure \(\PageIndex{3}\): Results of the Monte Carlo experiment estimating the GDP per capita for Ruritania and its 95% confidence interval. Note that 5% of the estimates fall in the rejection (orange) region, 2.5% above and 2.5% below. The median of this distribution is designated by \(tilde{X}\).
    Note

    Here, I use the original estimate as the point estimate for the GDP per capita of Ruritania ($11,508). It would have also been appropriate to use the mean of the Monte Carlo trials ($11,870) or the median of the Monte Carlo trials ($11,510). All three are acceptable measures of the center. It is usual, however, to use the original prediction.

    Here is an interesting question:

    In the previous example, we estimated a confidence interval. How could we estimate a prediction interval?

    To answer this, we need to remember the only difference between confidence and prediction intervals. In a confidence interval, we are estimating an expected value. In a prediction interval, we are predicting a new outcome. That new outcome is a combination of the expected value and the \(\sigma^2\) from the \(\varepsilon\) term.

    And so, to get a prediction interval, we use the following. Check to see the difference between this and the previous script.

    b.int   =  6.933298
    b.dem   = -0.002776
    b.hig   =  0.470225
    b.err   =  0
    
    s.int   =  0.147873
    s.dem   =  0.011253
    s.hig   =  0.035855
    s.err   =  0.8841
    
    e.int   = rnorm(trials, m=b.int, s=s.int)
    e.dem   = rnorm(trials, m=b.dem, s=s.dem)
    e.hig   = rnorm(trials, m=b.hig, s=s.hig)
    e.err   = rnorm(trials, m=b.err, s=s.err)
    
    outcome = e.int + e.dem*-7 + e.hig*5.1 + e.err
    est     = exp(outcome)
    

    From this, the 95% prediction interval is from $1907 to $69,345. Note that it is much wider than the confidence interval. Also, note that this should not surprise us at all. Prediction intervals are always wider than the corresponding confidence interval.

    Additional Bounds

    Thus far, we have looked at transforming the dependent variable when it is bounded above and below by 0 and 1 (two bounds), and when it is only bounded below by 0 (one bound). Other bounds are possible. In this section, we figure out how to handle all types of bounds. The basic steps are to determine if the variable is bounded on one side or two. If one, then perform an algebraic transformation so that the new variable is bounded below by 0, then use the log transform. If two, then perform an algebraic transformation so that the new variable is bounded by 0 and 1, then use the logit transform. In either case, you will need to remember to back-transform the predictions with this algebraic transformation.

    Note

    The only bounds I frequently come across in my own research are those bounded by 0 and 1, bounded by 0 and 100 (percentages), bounded by 0 and 4 (GPAs), and bounded below by 0. The quick solution for percentages is to divide them by 100 to make them proportions, then multiply the predictions by 100 to turn the predictions back into percentages.

    Bounded by L and U

    What if our data has a theoretic lower bound \(L\) and a theoretic upper bound \(U\)? As it is bounded above and below, we will change it into a proportion and use the logit transform as above. The algebraic transformation is

    \begin{equation}
    a(y) = p = \frac{y-L}{U-L}
    \end{equation}

    The back-transform is

    \begin{equation}
    a^{-1}(p) = y = p(U-L) + L
    \end{equation}

    Example \(\PageIndex{3}\): GRE Quant Score

    The scores on the quantitative portion of the Graduate Record Examination (GRE) range from \(L=130\) to \(U=170\).

    If we wished to properly model a person's GRE quantitative score, we would first subtract \(130\) from each score, then divide by \(170-130=40\). The new variable would range from 0 to 1, a proportion.

    Example \(\PageIndex{4}\): GPAs

    The grade point averages (GPAs) at Knox College are bounded below by \(L=0\) and above by \(U=4\).

    To appropriately model GPAs, we would have to subtract \(0\), then divide by \(4\). This new variable would now be a proportion.

    Bounded Below by L

    It may be that your dependent variable is bounded below by a specific value, \(L\), but not bounded above. As it is bounded on only one side, we will transform it into a variable bounded below by 0 and then apply the logarithm transform as above, remembering to back-transform with the additional transformation. The algebraic transformation is

    \begin{equation}
    a(y) = p = y - L
    \end{equation}

    The back-transform is

    \begin{equation}
    a^{-1}(p) = y = p + L
    \end{equation}

    Example \(\PageIndex{5}\): Excess Wage

    Hourly workers make at least $7.25 per hour.

    To model the excess hourly wage, we would subtract \(L=7.25\) from each hourly wage. This new variable is bounded below by \(0\), so we can apply the log transformation to it.

    Bounded Above by U

    It may be that your dependent variable is theoretically bounded above by \(U\). As there is only one bound, we will perform an algebraic transformation so that it is bounded below by 0 and then apply the log transform as above, remembering to back-transform with the additional transformation. The algebraic transformation is

    \begin{equation}
    a(y) = p = U - y
    \end{equation}

    The back-transform is

    \begin{equation}
    a^{-1}(p) = y = U - p
    \end{equation}

    Example \(\PageIndex{6}\): Ocean Depths

    In the ocean, different species live at different depths. In fact, we can predict the depth based solely on the species observed. Ocean depth is bounded above by 0 and has no theoretic lower bound (although it certainly has a genuine lower bound at the Challenger Deep in the Mariana Trench, which has a depth of -35,994 ft). To transform the depths into a variable upon which we can perform a log transform, we subtract each value from \(U=0\). After we predict, we will have to back-transform by again subtracting each prediction from \(U=0\).

    Of course, the transformation in this last example is equivalent to measuring depth in terms of "distance below the surface," which is a positive number requiring no additional transformation.


    This page titled 8.1: The Issue of Boundedness is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Ole Forsberg.

    • Was this article helpful?