4.8: Evaluating a Continuous Distribution
- Page ID
- 57559
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The answer to the previous question is that we evaluate how normal or not normal the distribution is.
What are you looking for in the distribution? You need a picture —a histogram— and a table –a frequency table showing how many people obtained each score. You are looking for the shape of the distribution. There should be a bell curve and tails on each side.
First step – look at it.
The first step is to look at the distribution. If you see a normal distribution with a bell-shaped curve, then you have a normal distribution. If you see there is a “lean” or the clump of scores more to the left or right of the graph, not in the center of the graph, and if there are elongated tails on the left or right side of the graph, the distribution is skewed. When you look at the distribution, look for “zits” or bumps on the distribution. We call those bumps outliers. Think of outliers as zits — small bumps on the face. You’ll see clumps of data, then a gap, then a single bump at the end of the distribution.
This is a heuristic or visual process, and by no means is it a definitive process, but it serves as a first step to see if your distribution is close or far away from normal.
Second step – get a skew value.
A second step is to obtain a skewness value (from your statistics program, like SPSS). If the skewness value is 0, the distribution is symmetrical, and you have a normal distribution. Positive values mean that the distribution is positively skewed. Negative values mean that the distribution is negatively skewed. A positive or negative value simply indicates direction. However, the actual value does matter. Some say look for skewness values above or below 1.0 as indications of a skewed distribution. My recommendation is to look for values above or below 5.0. More about this later.
Third step – look for zits or outliers.
A third step is to look for outliers, otherwise known as those “damn” outliers. The presence of an outlier is an indicator of a possible skewed distribution. Outliers are instances where the observation is extremely high or low compared to the mean score. An “outlier” must be a considerable outlier. Determining what is a “considerable” outlier is an analysis that is its own beast. More about this later.
Fourth step – Standard deviation greater than the mean.
A fourth step is this rule of thumb – if the SD is greater than the mean, that’s bad. This is a big rule, in my opinion. It is essentially this: SD > M. For those of you familiar with the mean and standard deviation, the mean is the average of the scores, and the standard deviation is a measure of the spread of scores above and below the mean.
This situation is a big deal because you get values that do not exist on the scale. If we have a mean of 3, a standard deviation of 1, and the scale range is 1 to 5, then the spread of scores around the mean makes sense because it is within the scale range. Here, the mean is 3, and the standard deviation indicates that the score variability is 2 (one standard deviation below the mean) to 4 (one standard deviation above the mean).
The situation gets interesting when you have a mean of 3 and a scale range of 1 to 5, but the standard deviation is 4. Now, what happens? Then, the spread of scores around the mean does not make sense because your upper and lower spread is outside of the scale range. Here, the mean is 3, and the standard deviation indicates that the score variability is -1 (one standard deviation below the mean) to 7 (one standard deviation above the mean). However, a value of -1 cannot exist in the scale range, nor can a value of 7. Both values are definitely outside of the scale range. So, there is no way those values can exist.
Are there situations where the SD is greater than the mean? Yes, especially when there is no upper bound, like a ratio variable, for example, cancer cells, income, or number of arrests. Especially when the lower bound is 0, where the values can only go up, can you get a negative number below the lower bound of 0? Yes, if there is meaning to the negative number. For example, financial or economics. A negative number usually means debt owed or a trade deficit.
If the SD is greater than the mean, then the distribution is definitely skewed. At this point, you would need to consult with a statistician.
Fifth step – low sample size.
A fifth step is to check the sample size. Small sample sizes tend to result in skewed distributions. Few data points mean more chances for an outlier to emerge because there are not enough points to establish a center. More data points mean more chances for scores to fill out the distribution, so there is less chance for an outlier to emerge because there will be data points all along the distribution. A sample size rule of thumb is 30 is ideal. Anything below 20 is a problem. Why? We will revisit this issue when discussing the central limit theorem.
Sixth step – is the distribution supposed to be skewed?
The sixth step is to ask yourself – “Does the skew make sense?” There are plenty of instances, variables, and constructs that are not normally distributed. In fact, I would be very surprised if some of these variables were normally distributed. For example, the number of disciplinary referrals per student in each grade. We know that most students are good kids; there are only a few not-so-great kids. Most students with zero disciplinary referrals will be on the low, zero-end. Some students will be about one or two disciplinary referrals. But then some will have, say, 10 or more. The distribution will be positively skewed because the tail and the skew will be toward the right or high end of the distribution.
It's the same scenario for the number of drinks on a given weekend on a college campus. Most students will likely have zero or one beer. Then, you get a crowd of students who will have a keg of beer — another positively skewed distribution. Take HIV viral load counts. In a given clinic, it is likely that most patients will have low HIV viral load counts, but there will be a few starting treatments with very high HIV viral load counts (and hopefully, the load will go down and recover). It would be surprising if these constructs were normally distributed. You don’t want an average number of drinks on a college campus, whatever that average number will be. Same for HIV viral load. So, these skewed distributions are supposed to be skewed, so there is no need to be alarmed if the distribution is skewed; in fact, you might be alarmed if there is a normal distribution.
Are continuous distributions different for ordinal, interval, and ratio variables?
The evaluation criteria are the same for each of those variables. The variables have different propensities, resulting in a skewed distribution. Ordinal and ratio variables tend to have skewed distributions. Interval variables tend to have normal distributions.
Ordinal variables might have skewed distributions depending on how many participants tend to get certain ranks. Suppose a lot of participants get the top ranking, and the top rank is the number one ranking. In that case, you get a positively skewed distribution because most participants will be clustered toward the left at the low value, and only a few are on the right, higher ranking.
Although the number of ranks can be unlimited, most ordinal variables have narrow ranges of ranks. Sports usually have just first, second, third, fourth, and fifth place. GPA is basically five ranks: A, B, C, D, and F. In Other situations, the ranking system is not so apparent. We rank expertise as novice, intermediate, advanced, expert, and only four ranks. We rank clinical distress in three categories: normal functioning, sub-clinical stress, and clinical stress. For these variables, the distribution tends to be skewed towards one or two ranks at the high or low end. But does it mean that the distribution will be skewed to the point where it is a concern? Not likely, because with most ordinal variables, there is a low number of ranks, which means there are fewer chances for outliers and values to be in the extreme ranges. Scores in the extreme ranges tend to pull distributions toward the extreme, and skewness tends not to be severe with ordinal variables with a small number of ranks.
Interval variables are basically Likert scales. Like ordinal variables, the interval is important, and Likert scale ranges are usually limited. The common Likert scale ranges are 0 to 4, 1 to 5, 1 to 7, and 1 to 10. Like ordinal variables, the limited range means there is less opportunity for an extreme score to occur. Do skewed distributions happen with interval variables? If the range is expanded, yes it could occur. But you have to admit, there is very little use for such an expanded range, and the skewed distribution is the result of an expanded range, not because the variable itself has a distribution that is already skewed. For example, if you have a variable where the interval scale is one to 100. Let’s say the construct is depression, with 100 being the most severe depression. What is the difference between 99 and 100, or for that matter, 90 and 100? Both are indicators of severe depression; the difference of one or 10 at that extreme end does not give us any additional and precise information about the nature of the depression. The sampling context matters. If the majority of your sample came from a general community, most have none to mild to moderate depression. Will a participant from the community have a depression score in the high 90s? It is possible, and if that occurs, that the participant should be removed because the participant’s depression score represents severe depression requiring hospitalization and is unlikely to share the same characteristics as the community sample. The point is this: Likert scales tend not to have skewed distributions because most Likert scales have limited ranges, and limited ranges mean fewer opportunities for extreme scores, resulting in a severely skewed distribution.
Ratio variables tend to have skewed distributions because the range is usually unlimited beyond the zero point. Under most contexts, the range is expected. The number of points scored in a basketball game, the number of alcoholic drinks one usually has at a wedding, and the number of miles a trainee travels to a practicum site. Sometimes, the variable is undefined, and the unlimited range tends to result in a skewed distribution. The number of violent acts is difficult to define, and in a prison setting, because so many acts would likely qualify as a violent act, it would not be surprising if the distribution of the number of violent acts is positively skewed.
Positive versus negative skew.
Is there any need to compare positive and negative skewed distributions? Positive skewed distributions are more likely to occur than negative skewed distributions. Ratio variables are quite common with their unlimited upper bound, which means that the distribution will likely be positively skewed because the extreme values will be found in the upper ranges.
Negatively skewed distributions are less frequent because there are very few cases where the upper bound is fixed, and most participants score in the upper bound. One example would be a Likert scale from 1 to 5 for customer satisfaction, with 1 being not very satisfied and 5 being very satisfied. Most customers give fours and fives, and few customers give ones and twos. So, the distribution would be negatively skewed because most of the responses would be on the high end, and few responses would be on the low end. In these cases, the skew would not be too severe because the Likert scale is a limited range. In most cases, you could collapse the categories, such as collapsing categories one and two into one “not satisfied” category, leave category 3 as is, and collapsing categories four and five into one “satisfied” category. The skew would be reduced because the range of scores is collapsed into only three values, and there would be no opportunity for an extreme score to occur.
So, positively skewed distributions are more common than negatively skewed distributions. But there really are no advantages to the direction of the skew or how frequently a positive or negative skew will occur.
After going through the steps, now what?
So, after your analysis, then what? You need to decide if your distribution is normal or skewed. If it is normal, you can proceed with your statistical analysis. If it is not normal, then you have some decisions to make.
Here is your answer to your question about whether the distribution is normal or skewed – Your distribution is normal enough for your purposes.
Recall that a normal distribution is actually very rare to find. All distributions have some deviations from the normal distribution. And it really takes a LOT of problems for a distribution to make it really skewed. There are no criteria such as “when four of the six steps indicate a distribution is skewed, it means your distribution is skewed.” The six steps above are just ways to guide your analysis of evaluating the quality of your distribution.
Most of the time, your statistical results will be robust even if the distribution shows signs of skewness. In statistical terms, we say the distribution is robust, which means that despite having a distribution that has a few violations of normality and is possibly skewed, you probably get the same result or the same statistical test result, which is significant or not significant, whether you have a non-normal or normal distribution. In other words, you could probably use your variable’s distribution as part of your statistical analysis.
Let’s start from the beginning.
Your main concern is if you have a distribution that is supposed to be somewhat normal, but it turns out that the distribution is quite skewed. Could this scenario occur? Yes. Suppose you are examining the number of months that patients suffering from alcoholism stay sober after discharge. The mean number of months is usually about three months. This distribution will likely lean towards a positive skewed approach because the mean of three months is on the low end of the scale, and there will be a few patients who do better than three months, say, up to one year of sobriety. Overall, though, the distribution is possibly close enough to be normal because you likely do not have that many instances where patients have extremely long sober periods after treatment. Is it possible to have a patient who will be sober for five years after discharge? It’s entirely possible. Some patients might just do very well. It is likely that the distribution is positively skewed because the skew represents the concept that the patient just happened to do very well in treatment.
In this case, the issue is not necessarily a statistical issue. The issue might be a sample size issue. If you simply collect more data on patients who stay sober over time, you might find patients who stay sober for two years, three years, or four years. In this case, the patient with five years sober does not look like an outlier; rather, the more data collected about the patient's length of time sober will eventually fill out the distribution, so the patient with five years does not look like much of an outlier anymore. In these cases, you might have a research design and data collection solution. It is not a problem that statistics can fix.
You will encounter folks (such as faculty) who insist that you should transform the distribution when you have a non-normal distribution. Transforming the distribution means using an algorithm (e.g., math operations such as square root or logs) to reconfigure the non-normal distribution into a normal distribution. There are concerns about this approach. One concern is that it is difficult to interpret transformed data. How do you interpret the square root of a variable versus the actual values of the variable? In my experience, the transformation algorithm does not magically transform your non-normal distribution into a normal one. Usually, the statistical results from a transformed distribution compared to a non-transformed distribution are the same. If they turn out to be different, then there is a problem. What do you do if a non-transformed, non-normal distribution produces a significant result, but a square-root-transformed distribution into a normal distribution produces a non-significant result? Should we go with the significant result but with the non-normal distribution, or should we go with the non-significant results with the transformed distribution that became normal? That’s a difficult decision, and there are no great guidelines for this scenario. Hence, that is why I stay away from using algorithms to transform distributions.
You will encounter folks (such as faculty) who insist that you must switch to a non-parametric test. More about non-parametric tests is in the non-parametric chapter. Suffice to say, non-parametric tests are used when the variable distributions do not meet the assumptions of a parametric test, and in this case, the normal distribution assumption was not met. Every parametric test has a non-parametric equivalent. For example, the parametric test of t-test has a non-parametric equivalent called a 2-sample t-test: Mann-Whitney test. Is this a better solution? For me, not really. Using the non-parametric tests doesn't dramatically fix anything or give you a significant result when you get a non-significant result with the non-normal distribution.
To jump ahead, if your distribution is really skewed or is meant to be skewed, you need to consult a statistician because none of the conventional statistical analyses will produce valid results. The problem is not a skewed distribution or finding the right statistical fix.
The problem is that your variable’s distribution simply cannot be analyzed with conventional statistics. If you have a variable whose distribution is erratic, then conventional statistics or transformation simply won’t work. For example, if you are examining the number of reported abuse incidents during childhood for abuse survivors, that distribution likely becomes very lumpy and skewed. Some survivors have only one incident; some have a few, some have several, and some have hundreds. Not all those incidents are similar. They will vary in intensity and duration per incident.
For a distribution like that, it is difficult to say what would be a better analysis, but it is highly likely that this variable’s distribution needs to be analyzed using an entirely different statistical model. This idea will be developed when we discuss non-parametric statistics, but for now, parametric statistics follow what is known as the general linear model. This model says that all relationships between variables follow a linear trend: as X increases, Y increases or decreases. For a variable such as abuse incidents, it is quite likely that this variable does not follow a general increasing trend. There is no way to determine what the analysis could be here, so you will need to consult a statistician.
Conclusion
When evaluating distributions of variables, consider the following.
- Is your distribution categorical or continuous?
- If categorical, then your evaluation criteria will consist of the following:
- For your sample, is the distribution of the number of participants per group for a given variable like the population under consideration?
- There are only two answers to this question – is the sample similar and expected, or is the sample not similar and unexpectedly different from the population under consideration?
- If the sample meets the population expectation, then proceed with your analyses.
- If not, then there is not much you can do for a statistical fix.
- You would have to collect more data and basically recruit more participants to achieve the number of participants you want for each group for the variable.
- You would have to consider the context and determine if this sample distribution is the best you can do under your research context. If so, then let’s leave it alone.
- If categorical, then your evaluation criteria will consist of the following:
- Is your distribution a continuous distribution, for ordinal, interval, and ratio variables?
- If continuous, then your evaluation is the question: “Does my distribution resemble a normal distribution?” The only two answers to this question are “Yes, it does” or “No, it does not; it is a skewed distribution.”
- Follow the six guidelines to determine if you have a normal or skewed distribution.
- If it is normal or the skew is not that bad, proceed with your statistical analysis or your conventional parametric analyses of t-test, ANOVA, correlations, and regressions.
- If it is skewed and there are concerns….
- You could transform the variable by using square root or logarithmic transformations. But as noted, I don’t like this option.
- Examine your research context and determine if you need to collect more data.
- Or use a different statistical analysis that is not conventional.
- You need to consult with a statistician because there are no easy solutions for this.
- In the end…. Above all, leave it alone.
- Most distributions are going to be skewed to some degree
- The skewed distribution is likely not going to be a single determining factor that severely affects your statistical analysis.