Compute Ball Divergence statistic, which is a generic dispersion measure in Banach spaces.
bd(
x,
y = NULL,
distance = FALSE,
size = NULL,
num.threads = 1,
kbd.type = c("sum", "maxsum", "max")
)
a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames.
a numeric vector, matrix, data.frame.
if distance = TRUE
, the elements of x
will be considered as a distance matrix. Default: distance = FALSE
.
a vector recording sample size of each group.
number of threads. If num.threads = 0
, then all of available cores will be used. Default num.threads = 0
.
a character string specifying the \(K\)-sample Ball Divergence test statistic,
must be one of "sum"
, "summax"
, or "max"
. Any unambiguous substring can be given.
Default kbd.type = "sum"
.
bd
Ball Divergence statistic
Given the samples not containing missing values, bd
returns Ball Divergence statistics.
If we set distance = TRUE
, arguments x
, y
can be a dist
object or a
symmetric numeric matrix recording distance between samples;
otherwise, these arguments are treated as data.
Ball divergence statistic measure the distribution difference of two datasets in Banach spaces. The Ball divergence statistic is proven to be zero if and only if two datasets are identical.
The definition of the Ball Divergence statistics is as follows. Given two independent samples \( \{x_{1}, \ldots, x_{n}\} \) with the associated probability measure \(\mu\) and \( \{y_{1}, \ldots, y_{m}\} \) with \(\nu\), where the observations in each sample are i.i.d. Let \(\delta(x,y,z)=I(z\in \bar{B}(x, \rho(x,y)))\), where \(\delta(x,y,z)\) indicates whether \(z\) is located in the closed ball \(\bar{B}(x, \rho(x,y))\) with center \(x\) and radius \(\rho(x, y)\). We denote: $$ A_{ij}^{X}=\frac{1}{n}\sum_{u=1}^{n}{\delta(X_i,X_j,X_u)}, \quad A_{ij}^{Y}=\frac{1}{m}\sum_{v=1}^{m}{\delta(X_i,X_j,Y_v)}, $$ $$ C_{kl}^{X}=\frac{1}{n}\sum_{u=1}^{n}{\delta(Y_k,Y_l,X_u)}, \quad C_{kl}^{Y}=\frac{1}{m}\sum_{v=1}^{m}{\delta(Y_k,Y_l,Y_v)}. $$ \(A_{ij}^X\) represents the proportion of samples \( \{x_{1}, \ldots, x_{n}\} \) located in the ball \(\bar{B}(X_i,\rho(X_i,X_j))\) and \(A_{ij}^Y\) represents the proportion of samples \( \{y_{1}, \ldots, y_{m}\} \) located in the ball \(\bar{B}(X_i,\rho(X_i,X_j))\). Meanwhile, \(C_{kl}^X\) and \(C_{kl}^Y\) represent the corresponding proportions located in the ball \(\bar{B}(Y_k,\rho(Y_k,Y_l))\). The Ball Divergence statistic is defined as: $$D_{n,m}=A_{n,m}+C_{n,m}$$
Ball Divergence can be generalized to the K-sample test problem. Suppose we
have \(K\) group samples, each group include \(n_{k}\) samples.
The definition of \(K\)-sample Ball Divergence statistic could be
to directly sum up the two-sample Ball Divergence statistics of all sample pairs (kbd.type = "sum"
)
$$\sum_{1 \leq k < l \leq K}{D_{n_{k},n_{l}}},$$
or to find one sample with the largest difference to the others (kbd.type = "maxsum"
)
$$\max_{t}{\sum_{s=1, s \neq t}^{K}{D_{n_{s}, n_{t}}},}$$
to aggregate the \(K-1\) most significant different two-sample Ball Divergence statistics (kbd.type = "max"
)
$$\sum_{k=1}^{K-1}{D_{(k)}},$$
where \(D_{(1)}, \ldots, D_{(K-1)}\) are the largest \(K-1\) two-sample Ball Divergence statistics among
\(\{D_{n_s, n_t}| 1 \leq s < t \leq K\}\). When \(K=2\),
the three types of Ball Divergence statistics degenerate into two-sample Ball Divergence statistic.
See bd.test
for a test of distribution equality based on the Ball Divergence.
Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang. Ball Divergence: Nonparametric two sample test. Ann. Statist. 46 (2018), no. 3, 1109--1137. doi:10.1214/17-AOS1579. https://projecteuclid.org/euclid.aos/1525313077