TY - JOUR
T1 - Distribution-free detection of structured anomalies
T2 - permutation and rank-based scans
AU - Arias-Castro, Ery
AU - Castro, Rui M.
AU - Tánczos, Ervin
AU - Wang, Meng
PY - 2018/4/3
Y1 - 2018/4/3
N2 - The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, it is less straightforward. We investigate two procedures. The first one is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given data size making it computationally much more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution. We show that using one of these calibration procedures results in only a very small loss of power in the context of a natural exponential family. This includes the classical normal location model, popular in signal processing, and the Poisson model, popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory and also on a real dataset from genomics. Supplementary materials for this article are available online.
AB - The scan statistic is by far the most popular method for anomaly detection, being popular in syndromic surveillance, signal and image processing, and target detection based on sensor networks, among other applications. The use of the scan statistics in such settings yields a hypothesis testing procedure, where the null hypothesis corresponds to the absence of anomalous behavior. If the null distribution is known, then calibration of a scan-based test is relatively easy, as it can be done by Monte Carlo simulation. When the null distribution is unknown, it is less straightforward. We investigate two procedures. The first one is a calibration by permutation and the other is a rank-based scan test, which is distribution-free and less sensitive to outliers. Furthermore, the rank scan test requires only a one-time calibration for a given data size making it computationally much more appealing. In both cases, we quantify the performance loss with respect to an oracle scan test that knows the null distribution. We show that using one of these calibration procedures results in only a very small loss of power in the context of a natural exponential family. This includes the classical normal location model, popular in signal processing, and the Poisson model, popular in syndromic surveillance. We perform numerical experiments on simulated data further supporting our theory and also on a real dataset from genomics. Supplementary materials for this article are available online.
KW - Permutation tests
KW - Rank tests
KW - Scan statistic
UR - http://www.scopus.com/inward/record.url?scp=85048120512&partnerID=8YFLogxK
U2 - 10.1080/01621459.2017.1286240
DO - 10.1080/01621459.2017.1286240
M3 - Article
AN - SCOPUS:85048120512
SN - 0162-1459
VL - 113
SP - 789
EP - 801
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 522
ER -