Outlier detection through mixtures with an improper component


Abstract


The  paper  investigates the use of a finite  mixture model with an additional uniform density for outlier detection and robust estimation.  The main contribution of this paper lies in the  analysis of the properties of the improper component and the introduction of a modified EM algorithm  which, beyond providing the maximum likelihood estimates of the mixture parameters, endogenously provides a numerical value for the density of the uniform distribution used for the improper component. The mixing proportion of outliers may be known or unknown.  Applications to robust estimation and outlier detection will be discussed with particular attention to the normal mixture case.

DOI Code: 10.1285/i20705948v13n1p146

Keywords: Gaussian mixture, outlier detection, robust estimation, improper EM algorithm, improper component.

References


Aitkin, M. and Tunnicliffe-Wilson, G. (1980), ’Mixture Models, Outliers, and the EM Algorithm’, Technometrics, 22, No. 3, 325-331.

Banfield, J.D. and Raftery, A.E. (1993) ”Model-based Gaussian and non-Gaussian clus- tering” Biometrics, 49, 803-821.

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.

Burge, P. and Shawe-Taylor, J. (1997). Detecting cellular fraud using adaptive proto- types. In Proc. of AI Approaches to Fraud Detection and Risk Management, pp. 913.

Chandola, V., Banerjee, A. and Kumar, V. (2009) ”Anomaly detection: a survey”. ACM Comput Surv, 41 No. 3, 15:1– 15:58.

Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977), ‘Maximum likelihood for incom- plete data via EM algorithm’, Journal of the Royal Statistical Society, Series B, 39, 1-38.

Ernst, M., & Haesbroeck, G. (2017). Comparison of local outlier detection techniques in spatial multivariate data. Data mining and knowledge discovery, 31(2), 371-399.

Flury, B., Riedwyl, H. (1988). Multivariate Statistics. A Practical Approach. Chapman and Hall, London.

Fraley, C. and Raftery, A. E . (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis.The Computer J. 41: 578-588.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. and Stahel, W. A. (1986) Robust Statistics. The Approach Based on Influence Functions. John Wiley and Sons, New York.

Hennig, C. (2004). Breakdown point for maximum likelihood estimators of location-scale mixtures. Ann. Statist. 32: 1313-1340.

Huber, P. J. (1981) Robust Statistics. John Wiley and Sons, New York.

Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A. (2004). kernlab-an S4 package for kernel methods in R. J. Stat Softw 11(9):1–20.

Kutsuna, T., & Yamamoto, A. (2017). Outlier detection using binary decision diagrams. Data mining and knowledge discovery, 31(2), 548-572.

Limas, M. C., Mer, J. B. O., de Pisn Ascacibar, F. J. M., & Gonzlez, E. P. V. (2004). Outlier detection and data cleaning in multivariate non-normal samples: the PAELLA algorithm. Data Mining and Knowledge Discovery, 9(2), 171-187.

Longford, N.T. and D’Urso, P. (2011) ”Mixture models with an improper component”, J. Appl. Stat. 38, 2511–2521.

Longford, N. T. (2013). Searching for contaminants. Journal of Applied Statistics, 40(9), 2041-2055.

Liu, F.T., Ting, K.M., Zhou, Z.H. (2008). Isolation Forest, IEEE International Confer- ence on Data Mining

(ICDM 08). https://sourceforge.net/projects/iforest/

McLachlan, G.J and Peel, D. (2000), Finite Mixture Models. John Wiley and Sons, New York.

Monetti, A., Versini, G., Dalpiaz, G. and Raniero, F. (1996) ”Sugar Adulterations Con- trol in Concentrated Rectified Grape Musts by Finite Mixture Distribution Analysis of the myo- and scyllo-Inositol Content and D/H Methyl Ratio of Fermentative Ethanol”, J. Agric. Food Chem., 44, 2194-2201.

R Core Team (2018). R: A language and environment for statistical computing. R Foun- dation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2017) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8(1), pp. 205-233

Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T. (2005). ROCR: visualizing classifier performance in R. Bioinformatics 21(20):3940–3941.

Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/ ltorgo/DataMiningWithR

Yamanishi, K., Takeuchi, J. I., Williams, G., & Milne, P. (2004). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min- ing and Knowledge Discovery, 8(3), 275-300.


Full Text: pdf


Creative Commons License
This work is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License.