This vignette explores the Anderson–Darling k-Sample test. CMH-17-1G [1] provides a formulation for this test that appears different than the formulation given by Scholz and Stephens in their 1987 paper [2].
Both references use different nomenclature, which is summarized as follows:
Term | CMH-17-1G | Scholz and Stephens |
---|---|---|
A sample | i | i |
The number of samples | k | k |
An observation within a sample | j | j |
The number of observations within the sample i | ni | ni |
The total number of observations within all samples | n | N |
Distinct values in combined data, ordered | z(1)…z(L) | Z∗1…Z∗L |
The number of distinct values in the combined data | L | L |
Given the possibility of ties in the data, the discrete version of the test must be used Scholz and Stephens (1987) give the test statistic as:
A2akN=N−1Nk∑i=11niL∑j=1ljN(NMaij−niBaj)2Baj(N−Baj)−Nlj/4
CMH-17-1G gives the test statistic as:
ADK=n−1n2(k−1)k∑i=11niL∑j=1hj(nFij−niHj)2Hj(n−Hj)−nhj/4
By inspection, the CMH-17-1G version of this test statistic contains an extra factor of 1(k−1).
Scholz and Stephens indicate that one rejects H0 at a significance level of α when:
A2akN−(k−1)σN≥tk−1(α)
This can be rearranged to give a critical value:
A2crit=(k−1)+σNtk−1(α)
CHM-17-1G gives the critical value for ADK for α=0.025 as:
ADC=1+σn(1.96+1.149√k−1−0.391k−1)
The definition of σn from the two sources differs by a factor of (k−1).
The value in parentheses in the CMH-17-1G critical value corresponds to the interpolation formula for tm(α) given in Scholz and Stephen’s paper. It should be noted that this is not the student’s t-distribution, but rather a distribution referred to as the Tm distribution.
The cmstatr
package use the package
kSamples
to perform the k-sample Anderson–Darling tests.
This package uses the original formulation from Scholz and Stephens, so
the test statistic will differ from that given software based on the
CMH-17-1G formulation by a factor of (k−1).
For comparison, SciPy’s
implementation also uses the original Scholz and Stephens
formulation. The statistic that it returns, however, is the normalized
statistic, [A2akN−(k−1)]/σN, rather than kSamples
’s
A2akN value. To be
consistent, SciPy also returns the critical values tk−1(α) directly. (Currently,
SciPy also floors/caps the returned p-value at 0.1% / 25%.) The values
of k and σN are available in
cmstatr
’s ad_ksample
return value, if an exact
comparison to Python SciPy is necessary.
The conclusions about the null hypothesis drawn, however, will be the same, whether R or CMH-17-1G or SciPy.