When to use it
Use this method for repeated tests of goodness-of-fit when you've done a goodness-of-fit experiment more than once; for example, you might look at the fit to a \(3:1\) ratio of a genetic cross in more than one family, or fit to a \(1:1\) sex ratio in more than one population, or fit to a \(1:1\) ratio of broken right and left ankles on more than one sports team. One question then is, should you analyze each experiment separately, risking the chance that the small sample sizes will have insufficient power? Or should you pool all the data, ignoring the possibility that the different experiments gave different results? This is when the additive property of the G–test of goodness-of-fit becomes important, because you can do a repeated G–test of goodness-of-fit and test several hypotheses at once.
You use the repeated G–test of goodness-of-fit when you have two nominal variables, one with two or more biologically interesting values (such as red vs. pink vs. white flowers), the other representing different replicates of the same experiment (different days, different locations, different pairs of parents). You compare the observed data with an extrinsic theoretical expectation (such as an expected \(1:2:1\) ratio in a genetic cross).
For example, Guttman et al. (1967) counted the number of people who fold their arms with the right arm on top (\(R\)) or the left arm on top (\(L\)) in six ethnic groups in Israel:
Ethnic group |
R |
L |
Percent R |
Yemen |
168 |
174 |
49.1% |
Djerba |
132 |
195 |
40.4% |
Kurdistan |
167 |
204 |
45.0% |
Libya |
162 |
212 |
43.3% |
Berber |
143 |
194 |
42.4% |
Cochin |
153 |
174 |
46.8% |
The null hypothesis is that half the people would be \(R\) and half would be \(L\). It would be possible to add together the numbers from all six groups and test the fit with a chi-square or G–test of goodness-of-fit, but that could overlook differences among the groups. It would also be possible to test each group separately, but that could overlook deviations from the null hypothesis that were too small to detect in each ethnic group sample, but would be detectable in the overall sample. The repeated goodness-of-fit test tests the data both ways.
I do not know if this analysis would be appropriate with an intrinsic hypothesis, such as the \(p^2:2pq:q^2\) Hardy-Weinberg proportions of population genetics.
How to do the test
First, decide what you're going to do if there is significant variation among the replicates. Ideally, you should decide this before you look at the data, so that your decision is not subconsciously biased towards making the results be as interesting as possible. Your decision should be based on whether your goal is estimation or hypothesis testing. For the arm-folding example, if you were already confident that fewer than \(50\%\) of people fold their arms with the right on top, and you were just trying to estimate the proportion of right-on-top folders as accurately as possible, your goal would be estimation. If this is the goal, and there is significant heterogeneity among the replicates, you probably shouldn't pool the results; it would be misleading to say "\(42\%\) of people are right-on-top folders" if some ethnic groups are \(30\%\) and some are \(50\%\); the pooled estimate would depend a lot on your sample size in each ethnic group, for one thing. But if there's no significant heterogeneity, you'd want to pool the individual replicates to get one big sample and therefore make a precise estimate.
If you're mainly interested in the knowing whether there's a deviation from the null expectation, and you're not as interested in the size of the deviation, then you're doing hypothesis testing, and you may want to pool the samples even if they are significantly different from each other. In the arm-folding example, finding out that there's asymmetry—that fewer than \(50\%\) of people fold with their right arm on top—could say something interesting about developmental biology and would therefore be interesting, but you might not care that much if the asymmetry was stronger in some ethnic groups than others. So you might decide to pool the data even if there is significant heterogeneity.
After you've planned what you're going to do, collect the data and do a G–test of goodness-of-fit for each individual data set. The resulting \(G\)-values are the "individual \(G\)-values." Also record the number of degrees of freedom for each individual data set; these are the "individual degrees of freedom."
Some programs use continuity corrections, such as the Yates correction or the Williams correction, in an attempt to make G–tests more accurate for small sample sizes. Do not use any continuity corrections when doing a replicated G–test, or the \(G\)-values will not add up properly. My spreadsheet for G–tests of goodness-of-fit gtestgof.xls can provide the uncorrected \(G\)-values.
Ethnic group |
R |
L |
Percent
R |
G-value |
Degrees
of freedom |
P value |
Yemen |
168 |
174 |
49.1% |
0.105 |
1 |
0.75 |
Djerba |
132 |
195 |
40.4% |
12.214 |
1 |
0.0005 |
Kurdistan |
167 |
204 |
45.0% |
3.696 |
1 |
0.055 |
Libya |
162 |
212 |
43.3% |
6.704 |
1 |
0.010 |
Berber |
143 |
194 |
42.4% |
7.748 |
1 |
0.005 |
Cochin |
153 |
174 |
46.8% |
1.350 |
1 |
0.25 |
As you can see, three of the ethnic groups (Djerba, Libya, and Berber) have \(P\) values less than \(0.05\). However, because you're doing \(6\) tests at once, you should probably apply a correction for multiple comparisons. Applying a Bonferroni correction leaves only the Djerba and Berber groups as significant.
Next, do a G–test of independence on the data. This give a "heterogeneity \(G\)-value," which for our example is \(G=6.750\), \(5d.f.\), \(P=0.24\). This means that the \(R:L\) ratio is not significantly different among the \(6\) ethnic groups. If there had been a significant result, you'd have to look back at what you decided in the first step to know whether to go on and pool the results or not.
If you're going to pool the results (either because the heterogeneity \(G\)-value was not significant, or because you decided to pool even if the heterogeneity was significant), add the numbers in each category across the repeated experiments, and do a G–test of goodness-of-fit on the totals. For our example, there are a total of \(925 R\) and \(1153 L\), which gives \(G=25.067\), \(1d.f.\), \(P=5.5\times 10^{-7}\). The interpretation of this "pooled \(G\)-value" is that overall, significantly fewer than \(50\%\) of people fold their arms with the right arm on top. Because the G–test of independence was not significant, you can be pretty sure that this is a consistent overall pattern, not just due to extreme deviations in one or two samples. If the G–test of independence had been significant, you'd be much more cautious about interpreting the goodness-of-fit test of the summed data.
If you did the pooling, the next step is to add up the \(G\)-values from the individual goodness-of-fit tests to get the "total \(G\)-value," and add up the individual degrees of freedom to get the total degrees of freedom. Use the CHIDIST function in a spreadsheet or online chi-square calculator to find the \(P\) value for the total \(G\)-value with the total degrees of freedom. For our example, the total \(G\)-value is \(31.817\) and the total degrees of freedom is \(6\), so enter "=CHIDIST(31.817, 6)" if you're using a spreadsheet. The result will be the \(P\) value for the total \(G\); in this case, \(P=1.8\times 10^{-5}\). If it is significant, you can reject the null hypothesis that all of the data from the different experiments fit the expected ratio. Usually, you'll be able to look at the other results and see that the total \(G\)-value is significant because the goodness-of-fit of the pooled data is significant, or because the test of independence shows significant heterogeneity among the replicates, or both. However, it is possible for the total \(G\)-value to be significant even if none of the other results are significant. This would be frustrating; it would tell you that there's some kind of deviation from the null hypotheses, but it wouldn't be entirely clear what that deviation was.
I've repeatedly mentioned that the main advantage of G–tests over chi-square tests is "additivity," and it's finally time to illustrate this. In our example, the \(G\)-value for the test of independence was \(6.750\), with \(5\) degrees of freedom, and the \(G\)-value for the goodness-of-fit test for the pooled data was \(25.067\), with \(1\) degree of freedom. Adding those together gives \(G=31.817\) with \(6\) degrees of freedom, which is exactly the same as the total of the \(6\) individual goodness-of-fit tests. Isn't that amazing? So you can partition the total deviation from the null hypothesis into the portion due to deviation of the pooled data from the null hypothesis of a \(1:1\) ratio, and the portion due to variation among the replicates. It's just an interesting little note for this design, but additivity becomes more important for more elaborate experimental designs.
Chi-square values are not additive. If you do the above analysis with chi-square tests, the test of independence gives a chi-square value of \(6.749\) and the goodness-of-fit test of the pooled data gives a chi-square value of \(25.067\), which adds up to \(31.816\). The 6 individual goodness-of-fit tests give chi-square values that add up to \(31.684\), which is close to \(31.816\) but not exactly the same.
Similar tests
If the numbers are small, you may want to use exact tests instead of G–tests. You'll lose the additivity and the ability to test the total fit, but the other results may be more accurate. First, do an exact test of goodness-of-fit for each replicate. Next, do Fisher's exact test of independence to compare the proportions in the different replicates. If Fisher's test is not significant, pool the data and do an exact test of goodness-of-fit on the pooled data.
Note that I'm not saying how small your numbers should be to make you uncomfortable using G–tests. If some of your numbers are less than \(10\) or so, you should probably consider using exact tests, while if all of your numbers are in the \(10s\) or \(100s\), you're probably okay using G–tests. In part this will depend on how important it is to test the total \(G\)-value.
If you have repeated tests of independence, instead of repeated tests of goodness-of-fit, you should use the Cochran-Mantel-Haenszel test.