Algorithms from and for Nature and Life
Studies in Classifi cation, Data Analysis, and Knowledge Organization Berthold Lausen Dirk Van den Poel Alfred Ultsch Editors
Algorithms from
and for Nature and LifeClassifi cation and Data Analysis
Studies in Classification, Data Analysis,
and Knowledge Organization Managing Editors Editorial Board H.-H. Bock, AachenD. Baier, Cottbus W. Gaul, Karlsruhe F. Critchley, Milton Keynes M. Vichi, Rome R. Decker, Bielefeld
C. Weihs, Dortmund
E. Diday, Paris M. Greenacre, Barcelona C.N. Lauro, Naples J. Meulman, Leiden P. Monari, Bologna S. Nishisato, Toronto N. Ohsumi, Tokyo O. Opitz, Augsburg G. Ritter, Passau M. Schader, Mannheim
For further volumes:
Berthold Lausen Dirk Van den Poel Alfred Ultsch Editors Algorithms from
and for Nature and Life
Classification and Data AnalysisEditors Berthold Lausen Dirk Van den Poel Department of Mathematical Sciences Department of Marketing University of Essex Ghent University Colchester, United Kingdom Ghent, Belgium Alfred Ultsch Databionics, FB 12 University of Marburg Marburg, Germany
ISSN 1431-8814
ISBN 978-3-319-00035-0 (eBook) DOI 10.1007/978-3-319-00035-0 Springer Cham Heidelberg New York Dordrecht London Library of Congress Control Number: 2013945874 © Springer International Publishing Switzerland 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.Printed on acid-free paper
Preface
Revised versions of selected papers presented at the Joint Conference of the German Classification Society (GfKl) – 35th Annual Conference – GfKl 2011 – , the German Association for Pattern Recognition (DAGM) – 33rd annual symposium – DAGM 2011 – and the Symposium of the International Federation of Classification Societies (IFCS) – IFCS 2011 – held at the University of Frankfurt (Frankfurt am Main, Germany) August 30 – September 2, 2011, are contained in this volume of “Studies in Classification, Data Analysis, and Knowledge Organization”.
One aim of the conference was to provide a platform for discussions on results concerning the interface that data analysis has in common with other areas such as, e.g., computer science, operations research, and statistics from a scientific perspective, as well as with various application areas when “best” interpretations of data that describe underlying problem situations need knowledge from different research directions.
Practitioners and researchers – interested in data analysis in the broad sense – had the opportunity to discuss recent developments and to establish cross-disciplinary cooperation in their fields of interest. More than 420 persons attended the con- ference, more than 180 papers (including plenary and semiplenary lectures) were presented. The audience of the conference was very international.
Fifty-five of the papers presented at the conference are contained in this. As an unambiguous assignment of topics addressed in single papers is sometimes difficult the contributions are grouped in a way that the editors found appropriate. Within (sub)chapters the presentations are listed in alphabetical order with respect to the authors’ names. At the end of this volume an index is included that, additionally, should help the interested reader.
The editors like to thank the members of the scientific program committee:
D. Baier, H.-H. Bock, R. Decker, A. Ferligoj, W. Gaul, Ch. Hennig, I. Herzog,
E. H¨ullermeier, K. Jajuga, H. Kestler, A. Koch, S. Krolak-Schwerdt, H. Locarek- Junge, G. McLachlan, F.R. McMorris, G. Menexes, B. Mirkin, M. Mizuta,
A. Montanari, R. Nugent, A. Okada, G. Ritter, M. de Rooij, I. van Mechelen,
G. Venturini, J. Vermunt, M. Vichi and C. Weihs and the additional reviewers of vi Preface
A. Cerioli, M. Costa, N. Dean, P. Eilers, S.L. France, J. Gertheiss, A. Geyer-Schulz, W.J. Heiser, Ch. Hohensinn, H. Holzmann, Th. Horvath, H. Kiers, B. Lorenz, H. Lukashevich, V. Makarenkov, F. Meyer, I. Morlini, H.-J. Mucha, U. M¨uller-Funk, J.W. Owsinski, P. Rokita, A. Rutkowski-Ziarko, R. Samworth, I. Schm¨adecke and A. Sokolowski.
Last but not least, we would like to thank all participants of the conference for their interest and various activities which, again, made the 35th annual GfKl conference and this volume an interdisciplinary possibility for scientific discussion, in particular all authors and all colleagues who reviewed papers, chaired sessions or were otherwise involved. Additionally, we gratefully take the opportunity to acknowledge support by Deutsche Forschungsgemeinschaft (DFG) of the Sympo- sium of the International Federation of Classification Societies (IFCS) – IFCS 2011.
As always we thank Springer Verlag, Heidelberg, especially Dr. Martina Bihn, for excellent cooperation in publishing this volume. Colchester, UK
Berthold Lausen Ghent, Belgium Dirk Van den Poel Marburg, Germany Alfred Ultsch
Contents
. . . . . . . . . . . . . . . . . . . .
3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
. . . . . . . . . . .
33
. .
49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
. . . . . . . . . . . . . .
69 . . . . . . . .
79
. . . . . . . . . . . . . . .
87 viii Contents
. . . . . . . .
95
. . . . . . . . 105
.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
. . . . . . . . . . . . 195
. . . . . . . . . . . . . 205.. . . . . . . . . 215
. . . . . . . . . . . . . . . . . 223
Contents ix
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
. . . . . . . . . . . 243. . . . . . . . . . 261
. . . . . . . . . . . . . 269. . . . . . . . . . . . . 279
. . . . . . . . . . . . . . . . 287
. . . . . . . . . . . . 319 . . . . . . . . . . . 329
. . . . . . . . . . . . . . . . . . . . 337 x Contents
. . . . . . . . . . . . . . . . . . . . . . . . . . 365
. . . . . . . . . . . . . . 387
. . . . . . . . . . . 397 . . . . . . . . . . . . . . . . . . . . . . 407. . . . 417
. . . . . . . . . . . . . 427
Contents xi
. . . . . . . . . . . . . . . . . . . . . 465
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 . . . . . . . . . . . . . . . . . . . . . 481 . . . . . . . . . . . . . . . . . 491. . . . . . . . . . . 501
. . . . . . . . . . . . . . . . . . . . . . . . 511
. . . . . . . . . . . . . . . . 519 . . . . . . . . . . . . 529 . . . . . . . . . . . . . . . 539Contributors
Ulas Akkucuk Department of Management, Bogazici University, Istanbul, Turkey,
Alexander Albert Clinic of Cardiovascular Surgery, Heinrich-Heine University,
40225 D¨usseldorf, Germany
Theodore Alexandrov Center for Industrial Mathematics, University of Bremen,
28359 Bremen, Germany
Grigory Alexandrovich Department of Mathematics and Computer Science, Mar-
burg University, Marburg, Germany
Daniel Baier Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,
Hans-Georg Bartel Department of Chemistry, Humboldt University, Brook-
Taylor-Straße 2, 12489 Berlin, Germany,
Nadja Bauer Faculty of Statistics Chair of Computational Statistics, TU Dort-
mund, Dortmund, Germany,
Christoph Bernau Institut f¨ur Medizinische Informationsverarbeitung, Biometrie
und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany
Wolfgang Bessler Center for Finance and Banking, University of Giessen, Licher
Strasse 74, 35394 Giessen, Germany,
Holger Blume Institute of Microelectronic Systems, Appelstr. 4, 30167 Hannover,
Germany,
Alix Boc Universit´e de Montr´eal, C.P. 6128, succursale Centre-ville, Montr´eal, QC
H3C 3J7 Canada, xiv Contributors
Anne-Laure Boulesteix Institut f¨ur Medizinische Informationsverarbeitung,
Biometrie und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany,
Nina B ¨uchel European Research Center for Information Systems (ERCIS), Uni-
versity of M¨unster, M¨unster, Germany,
Carlos Cuevas-Covarrubias Anahuac University, Naucalpan, State of Mexico,
Mexico,
J. Douglas Carroll Rutgers Business School, Newark and New Brunswick,
Newark, NJ, USA
Andrea Cerioli Dipartimento di Economia, Universit`adi Parma, Parma, Italy,
Magdalena Chudy Centre for Digital Music, Queen Mary University of London,
Mile End Road, London, E1 4NS UK,
Antonio D’Ambrosio Department of Mathematics and Statistics, University of
Naples Federico II, Via Cinthia, M.te S. Angelo, Naples, Italy,
Ines Daniel Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,
Jos´e G. Dias UNIDE, ISCTE – University Institute of Lisbon, Lisbon, Portugal
Edif´ıcio
ISCTE, Av. das Forc¸as Armadas, 1649-026 Lisboa, Portugal,
Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile
End Road, London, E1 4NS UK,
Jens Dolata Head Office for Cultural Heritage Rhineland-Palatinate (GDKE),
Große Langgasse 29, 55116 Mainz, Germany,
Florent Domenach Department of Computer Science, University of Nicosia,
Plamen Dragiev D´epartement d’Informatique, Universit´e du Qu´ebec `a Montr´eal,
c.p. 8888, succ. Centre-Ville, Montreal, QC H3C 3P8 Canada Department of Human Genetics, McGill University, 1205 Dr. Penfield Ave., Mon- treal, QC H3A-1B1 Canada
Kai Eckert KR & KM Research Group, University of Mannheim, Mannheim,
Germany,
Jorge Eduardo Ortiz Facultad de Estad´ıstica Universidad Santo Tom´as, Bogot´a,
Colombia,
Contributors xv
Thomas Fober Department of Mathematics and Computer Science, Philipps-
Universit¨at, 35032 Marburg, Germany,
Stephen L. France Lubar School of Business, University of Wisconsin-
Milwaukee, P. O. Box 742, Milwaukee, WI, 53201-0742 USA,
Sarah Frost Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,
Wolfgang Gaul Institute of Decision Theory and Management Science,
Karlsruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,
Jan Gertheiss Department of Statistics, LMU Munich, Akademiestr. 1, 80799
Munich, Germany,
Andreas Geyer-Schulz Information Services and Electronic Markets, IISM, Karl-
sruhe Institute of Technology, Kaiserstrasse 12, D-76128 Karlsruhe, Germany,
Erhard Godehardt Clinic of Cardiovascular Surgery, Heinrich-Heine University,
40225 D¨usseldorf, Germany,
Isobel Claire Gormley University College Dublin, Dublin, Ireland,
Bettina Gr ¨un Department of Applied Statistics, Johannes Kepler University Linz,
Altenbergerstraße 69, 4040 Linz, Austria,
Dereje W. Gudicha Tilburg University, PO Box 50193, 5000 LE Tilburg, The
Netherlands,
Reinhold Hatzinger Institute for Statistics and Mathematics, WU Vienna Uni-
Willem J. Heiser Institute of Psychology, Leiden University, P.O. Box 9555, 2300
RB Leiden, The Netherlands,
Irmela Herzog LVR-Amt f¨ur Bodendenkmalpflege im Rheinland, Bonn
Kay F. Hildebrand European Research Center for Information Systems (ERCIS),
University of M¨unster, M¨unster, Germany,
Stefanie Hillebrand Faculty of Statistics TU Dortmund, 44221 Dortmund,
Germany
Paul Hofmarcher Institute for Statistics and Mathematics, WU (Vienna Uni-
versity of Economics and Business), Augasse 2-6, 1090 Wien, Austria,
xvi Contributors
Christine Hohensinn Faculty of Psychology Department of Psychological
Assessment and Applied Psychometrics, University of Vienna, Vienna, Austria,
Hajo Holzmann Department of Mathematics and Computer Science, Marburg
University, Marburg, Germany Fachbereich Math-ematik und Informatik, Philipps-Universit
R Marburg, Hans-
Kurt Hornik Institute for Statistics and Mathematics, WU (Vienna University of
Economics and Business), Augasse 2-6, 1090 Wien, Austria,
Eyke H ¨ullermeier Department of Mathematics and Computer Science, Philipps-
Universit¨at, 35032 Marburg, Germany,
Eugene Kaciak Brock University, St. Catharines, ON, Canada,
Rebecca Klages Institute of Decision Theory and Management Science, Karl-
sruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,
Gerhard Klebe Department of Mathematics and Computer Science, Philipps-
Universit¨at, 35032 Marburg, Germany
Jan Hendrik Kobarg Center for Industrial Mathematics, University of Bremen,
28359 Bremen, Germany,
Daniel Krausche Institute of Business Administration and Economics, Bran-
denburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany,
Klaus D. Kubinger Faculty of Psychology Department of Psychological Assess-
ment and Applied Psychometrics, University of Vienna, Vienna, Austria,
Katarzyna Kuziak Department of Financial Investments and Risk Management,
Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland,
Pierre Legendre Universit´e de Montr´eal, C.P. 6128, succursale Centre-ville,
Montr´eal, QC H3C 3J7 Canada,
Caterina Liberati Economics Department, University of Milano-Bicocca, P.zza
Ateneo Nuovo n.1, 20126 Milan, Italy,
Artur Lichtenberg Clinic of Cardiovascular Surgery, Heinrich-Heine University,
40225 D¨usseldorf, Germany
Contributors xvii
Loureiro Sandra Maria Correia Marketing, Operations and General
Management Department, ISCTE-IUL Business School, Av., Forc¸as Armadas, 1649-026 Lisbon, Portugal,
Marco Maier Institute for Statistics and Mathematics, WU Vienna Uni-
versity of Economics and Business, Augasse 2-6, 1090 Vienna, Austria,
Patrick Mair Institute for Statistics and Mathematics, WU (Vienna University of
Economics and Business), Augasse 2-6, 1090 Wien, Austria,
Vladimir Makarenkov D´epartement d’Informatique, Universit´e du Qu´ebec `a
Montr´eal, C.P.8888, succursale Centre Ville, Montreal, QC H3C 3P8 Canada,
Paolo Mariani Statistics Department, University of Milano-Bicocca, via Bicocca
degli Arcimboldi, n.8, 20126 Milan, Italy,
Verena Mattern Chair of Algorithm Engineering, TU Dortmund, Dortmund,
Germany,
Damien McParland University College Dublin, Dublin, Ireland,
Miguel Angel Mendez-Mendez Universidad Anahuac, Mexico City, MexicoHans-Joachim Mucha Weierstrass Institute for Applied Analysis and Stochastics
(WIAS), 10117 Berlin, Germany,
Ulrich M ¨uller-Funk European Research Center for Information Systems
(ERCIS), University of M¨unster, M¨unster, Germany,
Thomas Brendan Murphy School of Mathematical Sciences and Complex
and Adaptive Systems Laboratory, University College Dublin, Dublin 4, Ireland,
Robert Nadon Department of Human Genetics, McGill University, 1205
Dr. Penfield Ave., Montreal, QC H3A-1B1 Canada
Akinori Okada Graduate School of Management and Information Sciences, Tama
University, Tokyo, Japan,
Michael Ovelg¨onne Information Services and Electronic Markets, IISM, Karl-
sruhe Institute of Technology, Kaiserstrasse 12, D-76128 Karlsruhe, Germany,
Francesco Palumbo Universit`a degli Studi di Napoli Federico II, Naples, Italy,
Campo El´ıas Pardo Departamento de Estad´ıstica, Universidad Nacional de
Colombia, Bogot´a, Colombia, xviii Contributors
Krzysztof Piontek Department of Financial Investments and Risk Management,
Wroclaw University of Economics, ul. Komandorska 118/120, 53-345 Wroclaw, Poland,
Surajit Ray Department of Mathematics and Statistics, Boston University, Boston,
USA
Manuel Reif Faculty of Psychology Department of Psychological Assess-
ment and Applied Psychometrics, University of Vienna, Vienna, Austria,
Marco Riani Dipartimento di Economia, Universit`adi Parma, Parma,
Italy,
Adrian Richter Institut f¨ur Medizinische Informationsverarbeitung, Biometrie
und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany
G ¨unther R¨otter Institute for Music and Music Science, TU Dortmund, Dortmund,
Germany,
G ¨unter Rudolph Chair of Algorithm Engineering, TU Dortmund, Dortmund,
Germany,
Thomas Rusch84 Institute for Statistics and Mathematics, WU Vienna Uni-
versity of Economics and Business, Augasse 2-6, 1090 Vienna, Austria,
Anna Rutkowska-Ziarko Faculty of Economic Sciences University of Warmia
and Mazury, Oczapowskiego 4, 10-719 Olsztyn, Poland,
Adam Sagan Cracow University of Economics, Krakw, Poland,
Michael Salter-Townshend School of Mathematical Sciences and Complex and
Adaptive Systems Laboratory, University College Dublin, Dublin 4, Ireland,
Julia Schiffner Faculty of Statistics Chair of Computational Statistics, TU Dort-
mund, 44221 Dortmund, Germany,
Diana Schindler Department of Business Administration and Economics, Biele-
Ingo Schm¨adecke Institute of Microelectronic Systems, Appelstr. 4, 30167 Han-
nover, Germany,
Ingo Schmitt Institute of Computer Science, Information and Media Technology,
BTU Cottbus, Postbox 101344, D-03013 Cottbus, Germany,
Alexandra Schwarz German Institute for International Educational Research,
Contributors xix
Frank Siegmund Heinrich-Heine-Universit¨at D¨usseldorf, D¨usseldorf
Martin Stein Information Services and Electronic Markets, IISM, Karlsruhe
Veronika Stelz Department of Statistics, LMU Munich, Akademiestr. 1, 80799
Munich, Germany
Dominik Stork KR & KM Research Group, University of Mannheim, Mannheim,
Germany,
Heiner Stuckenschmidt KR & KM Research Group, University of Mannheim,
Mannheim, Germany,
Mireille Gettler Summa CEREMADE, CNRS, Universit´e Paris Dauphine, Paris,
France,
Ali Tayari Department of Computer Science, University of Nicosia, Flat 204,
Democratias 16, 2370 Nicosia, Cyprus,
Myriam Th¨ommes Humboldt University of Berlin, Spandauer Str. 1, 10099
Berlin, Germany,
Francesca Torti Dipartimento di Economia, Universit`adi Parma, Parma, Italy
Dipartimento di Statistica, Universit`a di Milano Bicocca, Milan, Italy,
Cristina Tortora Universit`a degli Studi di Napoli Federico II, Naples, Italy
CEREMADE, CNRS, Universit´e Paris Dauphine, Paris, France,
Matthias Trendtel Chair for Methods in Empirical Educational Research, TUM
School of Education, Technische Universit¨at M¨unchen, M¨unchen, Germany,
Gerhard Tutz Department of Statistics, LMU Munich, Akademiestr. 1, 80799
Munich, Germany
¨
Ali Unl ¨u Chair for Methods in Empirical Educational Research, TUM
School of Education, Technische Universit¨at M¨unchen, M¨unchen, Germany,
Igor Vatolkin Chair of Algorithm Engineering, TU Dortmund, Dortmund, Ger-
many,
Jeroen K. Vermunt Tilburg University, PO Box 50193, 5000 LE Tilburg, The
Netherlands,
Carmen Villar-Pati ˜no Universidad Anahuac, Mexico City, Mexico, xx Contributors
Sergio B. Villas-Boas Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,
Dominique Vincent Institute of Decision Theory and Management Science, Karl-
sruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,
Adilson Elias Xavier Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,
Vinicius Layter Xavier Federal University of Rio de Janeiro, Rio de Janeiro,
Brazil,
Sascha Voekler Institute of Business Administration and Economics, Brandenburg
University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany,
Claus Weihs Faculty of Statistics Chair of Computational Statistics, TU
Dortmund, 44221 Dortmund, Germany,
Peter Winker Department of Statistics and Econometrics, Justus-
Liebig-University Giessen, Licher Str. 74, 35394 Giessen, Germany,
Christoph Winkler Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe,
Germany,
Dominik Wolff Center for Finance and Banking, University of Giessen, Licher
Strasse 74, 35394 Giessen, Germany,
Satoru Yokoyama Faculty of Economics Department of Business Administration,
Teikyo University, Utsunomiya, Japan,
Part I Invited
Size and Power of Multivariate Outlier Detection Rules Andrea Cerioli, Marco Riani, and Francesca Torti
Abstract Multivariate outliers are usually identified by means of robust distances.
A statistically principled method for accurate outlier detection requires both avail- ability of a good approximation to the finite-sample distribution of the robust distances and correction for the multiplicity implied by repeated testing of all the observations for outlyingness. These principles are not always met by the currently available methods. The goal of this paper is thus to provide data analysts with useful information about the practical behaviour of some popular competing techniques. Our conclusion is that the additional information provided by a data-driven level of trimming is an important bonus which ensures an often considerable gain in power.
1 Introduction
Obtaining reliable information on the quality of the available data is often the first of the challenges facing the statistician. It is thus not surprising that the systematic study of methods for detecting outliers and immunizing against their effect has a long history in the statistical literature. See, e.g., p. 271) that “Robustness of statistical methods in the sense of insensitivity to grossly wrong measurements is probably as old as the experimental approach to science”. Perhaps less known is the fact that
A. Cerioli ( ) M. Riani Dipartimento di Economia, Universit`a di Parma, Parma, Italy e-mail: F. Torti Dipartimento di Economia, Universit`a di Parma, Parma, Italy
4 A. Cerioli et al.
similar concerns were also present in the Ancient Greece more than 2,400 years ago, as reported by Thucydides in his History of The Peloponnesian War (III 20): “The Plataeans, who were still besieged by the Peloponnesians and Boeotians, . . . made ladders equal in length to the height of the enemy’s wall, which they calculated by the help of the layers of bricks on the side facing the town . . . A great many counted at once, and, although some might make mistakes, the calculation would be oftener right than wrong; for they repeated the process again and again . . . In this manner
With multivariate data outliers are usually identified by means of robust dis- tances. A statistically principled rule for accurate multivariate outlier detection requires: (a) An accurate approximation to the finite-sample distribution of the robust distances under the postulated model for the “good” part of the data; (b) Correction for the multiplicity implied by repeated testing of all the observa- tions for outlyingness.
These principles are not always met by the currently available methods. The goal of this paper is to provide data analysts with useful information about the practical behaviour of popular competing techniques. We focus on methods based on alternative high-breakdown estimators of multivariate location and scatter, and compare them to the results from a rule adopting a more flexible level of trimming, for different data dimensions. The present thus extends that of
where only low dimensional data are considered. Our conclusion is that
the additional information provided by a data-driven approach to trimming is an important bonus often ensuring a considerable gain in power. This gain may be even larger when the number of variables increases.
2 Distances for Multivariate Outlier Detection
2.1 Mahalanobis Distances and the Wilks’ Rule
Let y
1 ; : : : ; y n be a sample of v-dimensional observations from a population with
mean vector and covariance matrix ˙ . The basic population model for which most of the results described in this paper were obtained is that y i i (1)
N. ; ˙/ D 1; : : : ; n:
1
Size and Power of Multivariate Outlier Detection Rules
5
The sample mean is denoted by ˙ is the unbiased sample estimate of ˙. The O and O
Mahalanobis distance of observation y i is
2
1
d ˙ O .y
i i (2)
i D .y O / O /:
2 For simplicity, we omit the fact that d is squared and we call it a distance.
i Wilks showed in a seminal paper that, under the multivariate normal
model
the Mahalanobis distances follow a scaled Beta distribution:
2
.n v n 1/ v 1
2
d ; i Beta (3)
i D 1; : : : ; n:
n
2
2 Wilks also conjectured that a Bonferroni bound could be used to test outlyingness of the most remote observation without losing too much power. Therefore, for a nominal test size ˛, Wilk’s rule for multivariate outlier identification takes the
2
2
largest Mahalanobis distance among d ; : : : ; d , and compares it to the 1 ˛=n
1 n
quantile of the scaled Beta distribution
. This gives an outlier test of nominal test
size ˛.
Wilks’ rule, adhering to the basic statistical principles (a) and (b) of Sect.
provides an accurate and powerful test for detecting a single outlier even in small and moderate samples, as many simulation studies later confirmed. However, it can break down very easily in presence of more than one outlier, due to the effect of masking. Masking occurs when a group of extreme outliers modifies ˙ in
O and O such a way that the corresponding distances become negligible.
2.2 Robust Distances
One effective way to avoid masking is to replace ˙ in
with high-
O and O breakdown estimators. A robust distance is then defined as
2
1
d Q ˙ Q .y
i i (4)
i D .y Q / Q /;
˙ denote the chosen robust estimators of location and scatter. We can where Q and Q
2
d expect multivariate outliers to be highlighted by large values of Q , even if masked in
i
˙ are not affected the corresponding Mahalanobis distances
because now
Q and Q by the outliers. ˙ is related to the Minimum Covariance Deter-
One popular choice of Q and Q minant (MCD) criterion In the first stage, we fix a coverage bn=2c h < n and we define the MCD subset to be the sub- sample of h observations whose covariance matrix has the smallest determinant.
The MCD estimator of , say , is the average of the MCD subset, whereas Q .MCD/
˙
6 A. Cerioli et al.
subset . A second stage is then added with the aim of increasing efficiency, while preserving the high-breakdown properties of and Q ˙ .
Q .MCD/ .MCD/ Therefore, a one-step reweighting scheme is applied by giving weight w
i
D 0 to observations whose first-stage robust distance exceeds a threshold value. Otherwise the weight is w
i
D 1. We consider the Reweighted MCD (RMCD) estimator of and ˙ , which is defined as P n P n
.y /.y /
w i y i w i i i i Q .RMCD/ Q .RMCD/ i
D1 D1
; Q ˙ ;
RMCD RMCD
Q D D
w w
1 P n where w w i and the scaling , depending on the values of m, n and v, D i
D1
serves the purpose of ensuring consistency at the normal model. The resulting robust distances for multivariate outlier detection are then
2
1 Q Q
d i RMCD / ˙ .y i RMCD / i (5) D .y Q RMCD Q D 1; : : : ; n:
i.RMCD/
v
Multivariate S estimators are another common option for ˙ . For Q and Q Q 2 < and Q ˙ a positive definite symmetric v v matrix, they are defined to be the solution of the minimization problem ˙ j Q j D min under the constraint
n
1 X
2
. Q d / (6)
i D ;
n
i D1
2
d where Q is given in
.x/ is a smooth function satisfying suitable regularity and i
robustness properties, and z/ D Ef .z g for a v-dimensional vector z N.0; I /. The function in
rules the weight given to each observation to achieve
robustness. Different specifications of .x/ lead to numerically and statistically different S estimators. In this paper we deal with two such specifications. The first one is the popular Tukey’s Biweight function
( 2 4 6
x x x 2 4 if
2 C jxj c
2c 6c
.x/ (7)
D 2
c
6
if jxj > c;
where c > 0 is a tuning constant which controls the breakdown point of S estimators; see for details. The second alternative that we consider is the slightly more complex Rocke’s Biflat function, described, e.g., by
to distance values close to the median, but
null weights outside a user-defined interval. Specifically, let !
2
v;.1
/
; (8) D min 1; 1
v Size and Power of Multivariate Outlier Detection Rules
7
2
2
where is the 1 . Then, the weight under Rocke’s Biflat quantile of
v;.1 v /
2
function is 0 whenever a normalized version of the robust distance Q d is outside the
i
interval Œ1 ; 1 C . This definition ensures better performance of S estimators when v is large. Indeed, it can be proved p. 221) that the weights assigned by Tukey’s Biweight function
become almost constant as v
! 1. Therefore, robustness of multivariate S estimators is lost in many practical situations where v is large. Examples of this behaviour will be seen in Sect.
even for v as small as 10.
Given the robust, but potentially inefficient, S estimators of and ˙ , an improvement in efficiency is sometimes advocated by computing refined location and shape estimators which satisfy a more efficient version of
. These estimators, called MM estimators, are defined as the minimizers
of
n
X
1
2
. QQ d /; (9)
i
n
i D1
where
2
i D .y QQ / QQ /
1 QQd ˙ .y i QQ i (10)
.x/ provides higher efficiency than .x/ at the null model and the function
v
Minimization of
is performed over all Q and all QQ ˙ belonging to the set
Q 2 < ˙ of positive definite symmetric v v matrices with j QQ j D 1. The MM estimator of is then Q
˙ . Practical Q , while the estimator of ˙ is a rescaled version of QQ implementation of MM estimators is available using Tukey’s Biweight function only
Therefore, we follow the same convention in the performance comparison to be described in Sect.
2.3 The Forward Search
The idea behind the Forward Search (FS) is to apply a flexible and data-driven trimming strategy to combine protection against outliers and high efficiency of estimators. For this purpose, the FS divides the data into a good portion that agrees with the postulated model and a set of outliers, if any The method starts from a small, robustly chosen, subset of the data and then fits subsets of increasing size, in such a way that outliers and other observations not following the general structure are revealed by diagnostic monitoring. Let m be the size of
.m/
the starting subset. Usually m be the subset of D v C 1 or slightly larger. Let S data fitted by the FS at step m (m ; : : : ; n), yielding estimates ˙ .m/
D m O .m/, O and distances
2
1 O O
d .m/ i ˙ .m/ i i D fy O .m/g fy O .m/g D 1; : : : ; n:
i
8 A. Cerioli et al.
.m/
These distances are ordered to obtain the fitting subset at step m C 1. Whilst S remains outlier free, they will not suffer from masking.
The main diagnostic quantity computed by the FS at step m is
2 2 .m/
O d .m/ i d .m/ for i ; (11)
min
i W D arg min O i … S
min .m/i.e. the distance of the closest observation to S , among those not belonging to this subset. The rationale is that the robust distance of the observation entering the fitting subset at step m
C 1 will be large if this observation is an outlier. Its peculiarity will
2 then be revealed by a peak in the forward plot of d .m/. i min
All the FS routines, as well as the algorithms for computing most of the com- monly adopted estimators for regression and multivariate analysis, are contained in the FSDA toolbox for MATLAB and are freely downloadable from
or from the web site of the Joint Research Centre of the European
Commission. This toolbox also contains a series of dynamic tools which enable the user to link the information present in the different plots produced by the FS, such
2
as the index or forward plot of robust Mahalanobis distances O d .m/ and the scatter
i plot matrix; see for details.
3 Comparison of Alternative Outlier Detection Rules
Precise outlier identification requires cut-off values for the robust distances when model
show
RMCD RMCD
Q D Q D Q
2
that the usually trusted asymptotic approximation based on the distribution can
v
be largely unsatisfactory. Instead, proposes a much more accurate approximation based on the distributional rules
2
2 1/ v 1
.w v w
d Q ; Beta if w i (12)
D 1
i.RMCD/
2
2
w w .w
C 1 1/v F if w i (13)
v;w
v D 0; w wv where w i and w are defined as in Sect.
show
how the same distributional results can be applied to deal with multiplicity of tests to increase power and to provide control of alternative error rates in the outlier detection process.
In the context of the Forward Search, propose a formal outlier
2
test based on the sequence O d .m/, m ; : : : ; n
In this i D m 1, obtained from min
2
test, the values of O d .m/ are compared to the FS envelope
i min
2
2 V = .m/ ; Size and Power of Multivariate Outlier Detection Rules
9
2
where V is the 100˛ % cut-off point of the .m
m;˛ C 1/th order statistic from the
scaled F distribution
2
.m 1/v
F ; (14)
v;m v
m.m v/ and the factor
2
2 P .X < / v v;m=n
2 C2
.m/ (15)
T
D m=n
2
2
2