17.5.5.2 Algorithms (Two-Sample Kolmogorov-Smirnov Test)

The procedure below draws on NAG algorithms.

Consider two independent samples X and Y, with the size of n_1\,\! and n_2\,\! .Denoted as x_1,x_2,\ldots ,x_{n_1}\,\! and y_1,y_2,\ldots ,y_{n_1}\,\! respectively. Let F(x) and G(x) represent their respective, unknown distribution functions. Also let  S_1(x)\,\! and  S_2(x)\,\! denote the values of sample empirical distribution functions.

The null hypothesis :F(x)=G(x)

The alternative hypothesis H_0\,\!:F(x)<>G(x) the associated p-value is a two-tailed probability;

or H_1\,\!:F(x)>G(x) the associated p-value is an upper-tailed probability,

or H_1\,\!: F(x)<G(x) the associated p-value is a lower-tailed probability

For the first case of H_1\,\!, the statistics D_{n_1,n_2} \,\! represents the largest absolute deviation of the two empirical distribution functions.

For the second case of H_1\,\!, the statistics D_{n_1,n_2}^{+} \,\! represents the largest positive deviation between the empirical distribution function of the first sample and the empirical distribution function of the second sample, that is D_{n_1,n_2}^{+}=\max \{S_1(x)-S_2(x),0\}\,\! .

For the third case of H_1\,\!, the statistics D_{n_1,n_2}^{-} \,\! represents the largest positive deviation between the empirical distribution function of the second sample and the empirical distribution function of the first sample, that is D_{n_1,n_2}^{-}=\max \{S_1(x)-S_2(x),0\}\,\! .

KS-test2 also returns the standard statistics Z=\sqrt{(n_1*n_2)/(n_1+n_2)}*D\,\!,

where D\,\! maybe D_{n_1,n_2}\,\!,D_{n_1,n_2}^{+} \,\!, D_{n_1,n_2}^{-} \,\!depending on the choice of the alternative hypothesis.

The distribution of the statistic Z\,\! converges asymptotically to a distribution given by Smirnov as n_1\,\! and n_2\,\! increase. The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed.

If max(n_1,n_2)\leq 2500\,\! and n_1*n_2\leq 10000\,\! then an exact method is given by Kim and Jinrich. Otherwise p\,\! is computed using the approximations suggested by Kim and Jenrich (1973)

Note that the method used only exact for continuous theoretical distributions.

This method computes the two-sided probability. The one-sided probabilities are estimated by having the two-sided probability. This is a good estimate for small p\,\!, that is p\leq 0.10\,\!, but it becomes very poor for larger p\,\!.

For more details of the algorithm, please refer to nag_2_sample_ks_test (g08cdc) .