Algorithms from and for Nature and Life

  Studies in Classifi cation, Data Analysis, and Knowledge Organization Berthold Lausen Dirk Van den Poel Alfred Ultsch Editors

  

Algorithms from

and for Nature and Life

  Classifi cation and Data Analysis

  

Studies in Classification, Data Analysis,

and Knowledge Organization Managing Editors Editorial Board H.-H. Bock, Aachen

  D. Baier, Cottbus W. Gaul, Karlsruhe F. Critchley, Milton Keynes M. Vichi, Rome R. Decker, Bielefeld

C. Weihs, Dortmund

  E. Diday, Paris M. Greenacre, Barcelona C.N. Lauro, Naples J. Meulman, Leiden P. Monari, Bologna S. Nishisato, Toronto N. Ohsumi, Tokyo O. Opitz, Augsburg G. Ritter, Passau M. Schader, Mannheim

  For further volumes:

  Berthold Lausen Dirk Van den Poel Alfred Ultsch Editors Algorithms from

and for Nature and Life

Classification and Data Analysis

  Editors Berthold Lausen Dirk Van den Poel Department of Mathematical Sciences Department of Marketing University of Essex Ghent University Colchester, United Kingdom Ghent, Belgium Alfred Ultsch Databionics, FB 12 University of Marburg Marburg, Germany

ISSN 1431-8814

  ISBN 978-3-319-00035-0 (eBook) DOI 10.1007/978-3-319-00035-0 Springer Cham Heidelberg New York Dordrecht London Library of Congress Control Number: 2013945874 © Springer International Publishing Switzerland 2013

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of

the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology

now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection

with reviews or scholarly analysis or material supplied specifically for the purpose of being entered

and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of

this publication or parts thereof is permitted only under the provisions of the Copyright Law of the

Publisher’s location, in its current version, and permission for use must always be obtained from Springer.

Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations

are liable to prosecution under the respective Copyright Law.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a specific statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

While the advice and information in this book are believed to be true and accurate at the date of

publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for

any errors or omissions that may be made. The publisher makes no warranty, express or implied, with

respect to the material contained herein.

  Printed on acid-free paper

  Preface

  Revised versions of selected papers presented at the Joint Conference of the German Classification Society (GfKl) – 35th Annual Conference – GfKl 2011 – , the German Association for Pattern Recognition (DAGM) – 33rd annual symposium – DAGM 2011 – and the Symposium of the International Federation of Classification Societies (IFCS) – IFCS 2011 – held at the University of Frankfurt (Frankfurt am Main, Germany) August 30 – September 2, 2011, are contained in this volume of “Studies in Classification, Data Analysis, and Knowledge Organization”.

  One aim of the conference was to provide a platform for discussions on results concerning the interface that data analysis has in common with other areas such as, e.g., computer science, operations research, and statistics from a scientific perspective, as well as with various application areas when “best” interpretations of data that describe underlying problem situations need knowledge from different research directions.

  Practitioners and researchers – interested in data analysis in the broad sense – had the opportunity to discuss recent developments and to establish cross-disciplinary cooperation in their fields of interest. More than 420 persons attended the con- ference, more than 180 papers (including plenary and semiplenary lectures) were presented. The audience of the conference was very international.

  Fifty-five of the papers presented at the conference are contained in this. As an unambiguous assignment of topics addressed in single papers is sometimes difficult the contributions are grouped in a way that the editors found appropriate. Within (sub)chapters the presentations are listed in alphabetical order with respect to the authors’ names. At the end of this volume an index is included that, additionally, should help the interested reader.

  The editors like to thank the members of the scientific program committee:

  D. Baier, H.-H. Bock, R. Decker, A. Ferligoj, W. Gaul, Ch. Hennig, I. Herzog,

  E. H¨ullermeier, K. Jajuga, H. Kestler, A. Koch, S. Krolak-Schwerdt, H. Locarek- Junge, G. McLachlan, F.R. McMorris, G. Menexes, B. Mirkin, M. Mizuta,

  A. Montanari, R. Nugent, A. Okada, G. Ritter, M. de Rooij, I. van Mechelen,

  G. Venturini, J. Vermunt, M. Vichi and C. Weihs and the additional reviewers of vi Preface

  A. Cerioli, M. Costa, N. Dean, P. Eilers, S.L. France, J. Gertheiss, A. Geyer-Schulz, W.J. Heiser, Ch. Hohensinn, H. Holzmann, Th. Horvath, H. Kiers, B. Lorenz, H. Lukashevich, V. Makarenkov, F. Meyer, I. Morlini, H.-J. Mucha, U. M¨uller-Funk, J.W. Owsinski, P. Rokita, A. Rutkowski-Ziarko, R. Samworth, I. Schm¨adecke and A. Sokolowski.

  Last but not least, we would like to thank all participants of the conference for their interest and various activities which, again, made the 35th annual GfKl conference and this volume an interdisciplinary possibility for scientific discussion, in particular all authors and all colleagues who reviewed papers, chaired sessions or were otherwise involved. Additionally, we gratefully take the opportunity to acknowledge support by Deutsche Forschungsgemeinschaft (DFG) of the Sympo- sium of the International Federation of Classification Societies (IFCS) – IFCS 2011.

  As always we thank Springer Verlag, Heidelberg, especially Dr. Martina Bihn, for excellent cooperation in publishing this volume. Colchester, UK

  Berthold Lausen Ghent, Belgium Dirk Van den Poel Marburg, Germany Alfred Ultsch

  Contents

   . . . . . . . . . . . . . . . . . . . .

  3

  

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  19

   . . . . . . . . . . .

  33

  

  . .

  49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  59

  . . . . . . . . . . . . . .

  69 . . . . . . . .

  79

  . . . . . . . . . . . . . . .

  87 viii Contents

   . . . . . . . .

  95

  

. . . . . . . . 105

.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

   . . . . . . . . . . . . 195

. . . . . . . . . . . . . 205

.. . . . . . . . . 215

. . . . . . . . . . . . . . . . . 223

  Contents ix

  

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

. . . . . . . . . . . 243

   . . . . . . . . . . 261

. . . . . . . . . . . . . 269

. . . . . . . . . . . . . 279

. . . . . . . . . . . . . . . . 287

  . . . . . . . . . . . . 319 . . . . . . . . . . . 329

. . . . . . . . . . . . . . . . . . . . 337 x Contents

  . . . . . . . . . . . . . . . . . . . . . . . . . . 365

  . . . . . . . . . . . . . . 387

. . . . . . . . . . . 397

. . . . . . . . . . . . . . . . . . . . . . 407

. . . . 417

. . . . . . . . . . . . . 427

  

  Contents xi

  . . . . . . . . . . . . . . . . . . . . . 465

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473

. . . . . . . . . . . . . . . . . . . . . 481

. . . . . . . . . . . . . . . . . 491

  . . . . . . . . . . . 501

. . . . . . . . . . . . . . . . . . . . . . . . 511

. . . . . . . . . . . . . . . . 519

. . . . . . . . . . . . 529

. . . . . . . . . . . . . . . 539

  Contributors

Ulas Akkucuk Department of Management, Bogazici University, Istanbul, Turkey,

Alexander Albert Clinic of Cardiovascular Surgery, Heinrich-Heine University,

  40225 D¨usseldorf, Germany

  

Theodore Alexandrov Center for Industrial Mathematics, University of Bremen,

  28359 Bremen, Germany

  

Grigory Alexandrovich Department of Mathematics and Computer Science, Mar-

  burg University, Marburg, Germany

  

Daniel Baier Institute of Business Administration and Economics, Brandenburg

  University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,

  

Hans-Georg Bartel Department of Chemistry, Humboldt University, Brook-

  Taylor-Straße 2, 12489 Berlin, Germany,

  

Nadja Bauer Faculty of Statistics Chair of Computational Statistics, TU Dort-

  mund, Dortmund, Germany,

  

Christoph Bernau Institut f¨ur Medizinische Informationsverarbeitung, Biometrie

  und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany

  

Wolfgang Bessler Center for Finance and Banking, University of Giessen, Licher

  Strasse 74, 35394 Giessen, Germany,

  

Holger Blume Institute of Microelectronic Systems, Appelstr. 4, 30167 Hannover,

  Germany,

  

Alix Boc Universit´e de Montr´eal, C.P. 6128, succursale Centre-ville, Montr´eal, QC

  H3C 3J7 Canada, xiv Contributors

  

Anne-Laure Boulesteix Institut f¨ur Medizinische Informationsverarbeitung,

  Biometrie und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany,

  

Nina B ¨uchel European Research Center for Information Systems (ERCIS), Uni-

  versity of M¨unster, M¨unster, Germany,

  

Carlos Cuevas-Covarrubias Anahuac University, Naucalpan, State of Mexico,

  Mexico,

  

J. Douglas Carroll Rutgers Business School, Newark and New Brunswick,

  Newark, NJ, USA

  

Andrea Cerioli Dipartimento di Economia, Universit`adi Parma, Parma, Italy,

Magdalena Chudy Centre for Digital Music, Queen Mary University of London,

  Mile End Road, London, E1 4NS UK,

  

Antonio D’Ambrosio Department of Mathematics and Statistics, University of

  Naples Federico II, Via Cinthia, M.te S. Angelo, Naples, Italy,

  

Ines Daniel Institute of Business Administration and Economics, Brandenburg

  University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,

   Jos´e G. Dias UNIDE, ISCTE – University Institute of Lisbon, Lisbon, Portugal

  Edif´ıcio

  ISCTE, Av. das Forc¸as Armadas, 1649-026 Lisboa, Portugal,

  

Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile

  End Road, London, E1 4NS UK,

  

Jens Dolata Head Office for Cultural Heritage Rhineland-Palatinate (GDKE),

  Große Langgasse 29, 55116 Mainz, Germany,

  

Florent Domenach Department of Computer Science, University of Nicosia,

  

  

Plamen Dragiev D´epartement d’Informatique, Universit´e du Qu´ebec `a Montr´eal,

  c.p. 8888, succ. Centre-Ville, Montreal, QC H3C 3P8 Canada Department of Human Genetics, McGill University, 1205 Dr. Penfield Ave., Mon- treal, QC H3A-1B1 Canada

  

Kai Eckert KR & KM Research Group, University of Mannheim, Mannheim,

  Germany,

  

Jorge Eduardo Ortiz Facultad de Estad´ıstica Universidad Santo Tom´as, Bogot´a,

  Colombia,

  Contributors xv

  

Thomas Fober Department of Mathematics and Computer Science, Philipps-

  Universit¨at, 35032 Marburg, Germany,

  

Stephen L. France Lubar School of Business, University of Wisconsin-

  Milwaukee, P. O. Box 742, Milwaukee, WI, 53201-0742 USA,

  

Sarah Frost Institute of Business Administration and Economics, Brandenburg

  University of Technology Cottbus, Postbox 101344, 03013 Cottbus, Germany,

  

Wolfgang Gaul Institute of Decision Theory and Management Science,

  Karlsruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,

  

Jan Gertheiss Department of Statistics, LMU Munich, Akademiestr. 1, 80799

  Munich, Germany,

  

Andreas Geyer-Schulz Information Services and Electronic Markets, IISM, Karl-

  sruhe Institute of Technology, Kaiserstrasse 12, D-76128 Karlsruhe, Germany,

  

Erhard Godehardt Clinic of Cardiovascular Surgery, Heinrich-Heine University,

  40225 D¨usseldorf, Germany,

  

Isobel Claire Gormley University College Dublin, Dublin, Ireland,

Bettina Gr ¨un Department of Applied Statistics, Johannes Kepler University Linz,

  Altenbergerstraße 69, 4040 Linz, Austria,

  

Dereje W. Gudicha Tilburg University, PO Box 50193, 5000 LE Tilburg, The

  Netherlands,

  

Reinhold Hatzinger Institute for Statistics and Mathematics, WU Vienna Uni-

  

  

Willem J. Heiser Institute of Psychology, Leiden University, P.O. Box 9555, 2300

  RB Leiden, The Netherlands,

  Irmela Herzog LVR-Amt f¨ur Bodendenkmalpflege im Rheinland, Bonn

Kay F. Hildebrand European Research Center for Information Systems (ERCIS),

  University of M¨unster, M¨unster, Germany,

  

Stefanie Hillebrand Faculty of Statistics TU Dortmund, 44221 Dortmund,

  Germany

  

Paul Hofmarcher Institute for Statistics and Mathematics, WU (Vienna Uni-

  versity of Economics and Business), Augasse 2-6, 1090 Wien, Austria,

   xvi Contributors

  

Christine Hohensinn Faculty of Psychology Department of Psychological

  Assessment and Applied Psychometrics, University of Vienna, Vienna, Austria,

  

Hajo Holzmann Department of Mathematics and Computer Science, Marburg

  University, Marburg, Germany Fachbereich Math-ematik und Informatik, Philipps-Universit

  R Marburg, Hans-

  

Kurt Hornik Institute for Statistics and Mathematics, WU (Vienna University of

  Economics and Business), Augasse 2-6, 1090 Wien, Austria,

  

Eyke H ¨ullermeier Department of Mathematics and Computer Science, Philipps-

  Universit¨at, 35032 Marburg, Germany,

  

Eugene Kaciak Brock University, St. Catharines, ON, Canada,

Rebecca Klages Institute of Decision Theory and Management Science, Karl-

  sruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,

  

Gerhard Klebe Department of Mathematics and Computer Science, Philipps-

  Universit¨at, 35032 Marburg, Germany

  

Jan Hendrik Kobarg Center for Industrial Mathematics, University of Bremen,

  28359 Bremen, Germany,

  

Daniel Krausche Institute of Business Administration and Economics, Bran-

  denburg University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany,

  

Klaus D. Kubinger Faculty of Psychology Department of Psychological Assess-

  ment and Applied Psychometrics, University of Vienna, Vienna, Austria,

  

Katarzyna Kuziak Department of Financial Investments and Risk Management,

  Wroclaw University of Economics, ul. Komandorska 118/120, Wroclaw, Poland,

  

Pierre Legendre Universit´e de Montr´eal, C.P. 6128, succursale Centre-ville,

  Montr´eal, QC H3C 3J7 Canada,

  

Caterina Liberati Economics Department, University of Milano-Bicocca, P.zza

  Ateneo Nuovo n.1, 20126 Milan, Italy,

  

Artur Lichtenberg Clinic of Cardiovascular Surgery, Heinrich-Heine University,

  40225 D¨usseldorf, Germany

  Contributors xvii

  

Loureiro Sandra Maria Correia Marketing, Operations and General

  Management Department, ISCTE-IUL Business School, Av., Forc¸as Armadas, 1649-026 Lisbon, Portugal,

  

Marco Maier Institute for Statistics and Mathematics, WU Vienna Uni-

  versity of Economics and Business, Augasse 2-6, 1090 Vienna, Austria,

  

Patrick Mair Institute for Statistics and Mathematics, WU (Vienna University of

  Economics and Business), Augasse 2-6, 1090 Wien, Austria,

  

Vladimir Makarenkov D´epartement d’Informatique, Universit´e du Qu´ebec `a

  Montr´eal, C.P.8888, succursale Centre Ville, Montreal, QC H3C 3P8 Canada,

  

Paolo Mariani Statistics Department, University of Milano-Bicocca, via Bicocca

  degli Arcimboldi, n.8, 20126 Milan, Italy,

  

Verena Mattern Chair of Algorithm Engineering, TU Dortmund, Dortmund,

  Germany,

  

Damien McParland University College Dublin, Dublin, Ireland,

Miguel Angel Mendez-Mendez Universidad Anahuac, Mexico City, Mexico

Hans-Joachim Mucha Weierstrass Institute for Applied Analysis and Stochastics

  (WIAS), 10117 Berlin, Germany,

  

Ulrich M ¨uller-Funk European Research Center for Information Systems

  (ERCIS), University of M¨unster, M¨unster, Germany,

  

Thomas Brendan Murphy School of Mathematical Sciences and Complex

  and Adaptive Systems Laboratory, University College Dublin, Dublin 4, Ireland,

  

Robert Nadon Department of Human Genetics, McGill University, 1205

  Dr. Penfield Ave., Montreal, QC H3A-1B1 Canada

  

Akinori Okada Graduate School of Management and Information Sciences, Tama

  University, Tokyo, Japan,

  

Michael Ovelg¨onne Information Services and Electronic Markets, IISM, Karl-

  sruhe Institute of Technology, Kaiserstrasse 12, D-76128 Karlsruhe, Germany,

  

Francesco Palumbo Universit`a degli Studi di Napoli Federico II, Naples, Italy,

Campo El´ıas Pardo Departamento de Estad´ıstica, Universidad Nacional de

  Colombia, Bogot´a, Colombia, xviii Contributors

  

Krzysztof Piontek Department of Financial Investments and Risk Management,

  Wroclaw University of Economics, ul. Komandorska 118/120, 53-345 Wroclaw, Poland,

  

Surajit Ray Department of Mathematics and Statistics, Boston University, Boston,

  USA

  

Manuel Reif Faculty of Psychology Department of Psychological Assess-

  ment and Applied Psychometrics, University of Vienna, Vienna, Austria,

  

Marco Riani Dipartimento di Economia, Universit`adi Parma, Parma,

  Italy,

  

Adrian Richter Institut f¨ur Medizinische Informationsverarbeitung, Biometrie

  und Epidemiologie, Universit¨at M¨unchen (LMU), M¨unich, Germany

  

G ¨unther R¨otter Institute for Music and Music Science, TU Dortmund, Dortmund,

  Germany,

  

G ¨unter Rudolph Chair of Algorithm Engineering, TU Dortmund, Dortmund,

  Germany,

  

Thomas Rusch84 Institute for Statistics and Mathematics, WU Vienna Uni-

  versity of Economics and Business, Augasse 2-6, 1090 Vienna, Austria,

  

Anna Rutkowska-Ziarko Faculty of Economic Sciences University of Warmia

  and Mazury, Oczapowskiego 4, 10-719 Olsztyn, Poland,

  

Adam Sagan Cracow University of Economics, Krakw, Poland,

Michael Salter-Townshend School of Mathematical Sciences and Complex and

  Adaptive Systems Laboratory, University College Dublin, Dublin 4, Ireland,

  

Julia Schiffner Faculty of Statistics Chair of Computational Statistics, TU Dort-

  mund, 44221 Dortmund, Germany,

  

Diana Schindler Department of Business Administration and Economics, Biele-

  

  

Ingo Schm¨adecke Institute of Microelectronic Systems, Appelstr. 4, 30167 Han-

  nover, Germany,

  

Ingo Schmitt Institute of Computer Science, Information and Media Technology,

  BTU Cottbus, Postbox 101344, D-03013 Cottbus, Germany,

  

Alexandra Schwarz German Institute for International Educational Research,

  Contributors xix

  Frank Siegmund Heinrich-Heine-Universit¨at D¨usseldorf, D¨usseldorf

Martin Stein Information Services and Electronic Markets, IISM, Karlsruhe

  

  

Veronika Stelz Department of Statistics, LMU Munich, Akademiestr. 1, 80799

  Munich, Germany

  

Dominik Stork KR & KM Research Group, University of Mannheim, Mannheim,

  Germany,

  

Heiner Stuckenschmidt KR & KM Research Group, University of Mannheim,

  Mannheim, Germany,

  

Mireille Gettler Summa CEREMADE, CNRS, Universit´e Paris Dauphine, Paris,

  France,

  

Ali Tayari Department of Computer Science, University of Nicosia, Flat 204,

  Democratias 16, 2370 Nicosia, Cyprus,

  

Myriam Th¨ommes Humboldt University of Berlin, Spandauer Str. 1, 10099

  Berlin, Germany,

  Francesca Torti Dipartimento di Economia, Universit`adi Parma, Parma, Italy

  Dipartimento di Statistica, Universit`a di Milano Bicocca, Milan, Italy,

   Cristina Tortora Universit`a degli Studi di Napoli Federico II, Naples, Italy

  CEREMADE, CNRS, Universit´e Paris Dauphine, Paris, France,

  

Matthias Trendtel Chair for Methods in Empirical Educational Research, TUM

  School of Education, Technische Universit¨at M¨unchen, M¨unchen, Germany,

  

Gerhard Tutz Department of Statistics, LMU Munich, Akademiestr. 1, 80799

  Munich, Germany

  ¨

Ali Unl ¨u Chair for Methods in Empirical Educational Research, TUM

  School of Education, Technische Universit¨at M¨unchen, M¨unchen, Germany,

  

Igor Vatolkin Chair of Algorithm Engineering, TU Dortmund, Dortmund, Ger-

  many,

  

Jeroen K. Vermunt Tilburg University, PO Box 50193, 5000 LE Tilburg, The

  Netherlands,

  

Carmen Villar-Pati ˜no Universidad Anahuac, Mexico City, Mexico, xx Contributors

  

Sergio B. Villas-Boas Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,

Dominique Vincent Institute of Decision Theory and Management Science, Karl-

  sruhe Institute of Technology (KIT), Kaiserstr. 12, 76128 Karlsruhe, Germany,

  

Adilson Elias Xavier Federal University of Rio de Janeiro, Rio de Janeiro, Brazil,

Vinicius Layter Xavier Federal University of Rio de Janeiro, Rio de Janeiro,

  Brazil,

  

Sascha Voekler Institute of Business Administration and Economics, Brandenburg

  University of Technology Cottbus, Postbox 101344, D-03013 Cottbus, Germany,

  

Claus Weihs Faculty of Statistics Chair of Computational Statistics, TU

  Dortmund, 44221 Dortmund, Germany,

  

Peter Winker Department of Statistics and Econometrics, Justus-

  Liebig-University Giessen, Licher Str. 74, 35394 Giessen, Germany,

  

Christoph Winkler Karlsruhe Institute of Technology (KIT), 76128 Karlsruhe,

  Germany,

  

Dominik Wolff Center for Finance and Banking, University of Giessen, Licher

  Strasse 74, 35394 Giessen, Germany,

  

Satoru Yokoyama Faculty of Economics Department of Business Administration,

  Teikyo University, Utsunomiya, Japan,

  Part I Invited

  Size and Power of Multivariate Outlier Detection Rules Andrea Cerioli, Marco Riani, and Francesca Torti

Abstract Multivariate outliers are usually identified by means of robust distances.

  A statistically principled method for accurate outlier detection requires both avail- ability of a good approximation to the finite-sample distribution of the robust distances and correction for the multiplicity implied by repeated testing of all the observations for outlyingness. These principles are not always met by the currently available methods. The goal of this paper is thus to provide data analysts with useful information about the practical behaviour of some popular competing techniques. Our conclusion is that the additional information provided by a data-driven level of trimming is an important bonus which ensures an often considerable gain in power.

1 Introduction

  Obtaining reliable information on the quality of the available data is often the first of the challenges facing the statistician. It is thus not surprising that the systematic study of methods for detecting outliers and immunizing against their effect has a long history in the statistical literature. See, e.g., p. 271) that “Robustness of statistical methods in the sense of insensitivity to grossly wrong measurements is probably as old as the experimental approach to science”. Perhaps less known is the fact that

  A. Cerioli ( ) M. Riani Dipartimento di Economia, Universit`a di Parma, Parma, Italy e-mail: F. Torti Dipartimento di Economia, Universit`a di Parma, Parma, Italy

  4 A. Cerioli et al.

  similar concerns were also present in the Ancient Greece more than 2,400 years ago, as reported by Thucydides in his History of The Peloponnesian War (III 20): “The Plataeans, who were still besieged by the Peloponnesians and Boeotians, . . . made ladders equal in length to the height of the enemy’s wall, which they calculated by the help of the layers of bricks on the side facing the town . . . A great many counted at once, and, although some might make mistakes, the calculation would be oftener right than wrong; for they repeated the process again and again . . . In this manner

  

  With multivariate data outliers are usually identified by means of robust dis- tances. A statistically principled rule for accurate multivariate outlier detection requires: (a) An accurate approximation to the finite-sample distribution of the robust distances under the postulated model for the “good” part of the data; (b) Correction for the multiplicity implied by repeated testing of all the observa- tions for outlyingness.

  These principles are not always met by the currently available methods. The goal of this paper is to provide data analysts with useful information about the practical behaviour of popular competing techniques. We focus on methods based on alternative high-breakdown estimators of multivariate location and scatter, and compare them to the results from a rule adopting a more flexible level of trimming, for different data dimensions. The present thus extends that of

  

where only low dimensional data are considered. Our conclusion is that

  the additional information provided by a data-driven approach to trimming is an important bonus often ensuring a considerable gain in power. This gain may be even larger when the number of variables increases.

2 Distances for Multivariate Outlier Detection

2.1 Mahalanobis Distances and the Wilks’ Rule

  Let y

  1 ; : : : ; y n be a sample of v-dimensional observations from a population with

  mean vector and covariance matrix ˙ . The basic population model for which most of the results described in this paper were obtained is that y i i (1)

  N. ; ˙/ D 1; : : : ; n:

  1

  Size and Power of Multivariate Outlier Detection Rules

  5

  The sample mean is denoted by ˙ is the unbiased sample estimate of ˙. The O and O

  Mahalanobis distance of observation y i is

  2

  1

  d ˙ O .y

  i i (2)

i D .y O / O /:

2 For simplicity, we omit the fact that d is squared and we call it a distance.

  i Wilks showed in a seminal paper that, under the multivariate normal

  model

  the Mahalanobis distances follow a scaled Beta distribution:

  2

  .n v n 1/ v 1

  2

  d ; i Beta (3)

i D 1; : : : ; n:

  n

  2

  2 Wilks also conjectured that a Bonferroni bound could be used to test outlyingness of the most remote observation without losing too much power. Therefore, for a nominal test size ˛, Wilk’s rule for multivariate outlier identification takes the

  2

  2

  largest Mahalanobis distance among d ; : : : ; d , and compares it to the 1 ˛=n

  1 n

  quantile of the scaled Beta distribution

  . This gives an outlier test of nominal test

  size ˛.

  Wilks’ rule, adhering to the basic statistical principles (a) and (b) of Sect.

  

  provides an accurate and powerful test for detecting a single outlier even in small and moderate samples, as many simulation studies later confirmed. However, it can break down very easily in presence of more than one outlier, due to the effect of masking. Masking occurs when a group of extreme outliers modifies ˙ in

  O and O such a way that the corresponding distances become negligible.

2.2 Robust Distances

  One effective way to avoid masking is to replace ˙ in

  with high-

  O and O breakdown estimators. A robust distance is then defined as

  2

  1

  d Q ˙ Q .y

  i i (4)

i D .y Q / Q /;

  ˙ denote the chosen robust estimators of location and scatter. We can where Q and Q

  2

  d expect multivariate outliers to be highlighted by large values of Q , even if masked in

  i

  ˙ are not affected the corresponding Mahalanobis distances

  because now

  Q and Q by the outliers. ˙ is related to the Minimum Covariance Deter-

  One popular choice of Q and Q minant (MCD) criterion In the first stage, we fix a coverage bn=2c h < n and we define the MCD subset to be the sub- sample of h observations whose covariance matrix has the smallest determinant.

  The MCD estimator of , say , is the average of the MCD subset, whereas Q .MCD/

  ˙

  6 A. Cerioli et al.

  subset . A second stage is then added with the aim of increasing efficiency, while preserving the high-breakdown properties of and Q ˙ .

  Q .MCD/ .MCD/ Therefore, a one-step reweighting scheme is applied by giving weight w

  i

  D 0 to observations whose first-stage robust distance exceeds a threshold value. Otherwise the weight is w

  i

  D 1. We consider the Reweighted MCD (RMCD) estimator of and ˙ , which is defined as P n P n

  .y /.y /

  w i y i w i i i i Q .RMCD/ Q .RMCD/ i

D1 D1

  ; Q ˙ ;

RMCD RMCD

  Q D D

  w w

  1 P n where w w i and the scaling , depending on the values of m, n and v, D i

  D1

  serves the purpose of ensuring consistency at the normal model. The resulting robust distances for multivariate outlier detection are then

  2

1 Q Q

  d i RMCD / ˙ .y i RMCD / i (5) D .y Q RMCD Q D 1; : : : ; n:

i.RMCD/

  v

  Multivariate S estimators are another common option for ˙ . For Q and Q Q 2 < and Q ˙ a positive definite symmetric v v matrix, they are defined to be the solution of the minimization problem ˙ j Q j D min under the constraint

  n

  1 X

  2

  . Q d / (6)

  i D ;

  n

  i D1

  2

  d where Q is given in

   .x/ is a smooth function satisfying suitable regularity and i

  robustness properties, and z/ D Ef .z g for a v-dimensional vector z N.0; I /. The function in

  rules the weight given to each observation to achieve

  robustness. Different specifications of .x/ lead to numerically and statistically different S estimators. In this paper we deal with two such specifications. The first one is the popular Tukey’s Biweight function

  ( 2 4 6

  x x x 2 4 if

  2 C jxj c

2c 6c

  .x/ (7)

  D 2

  c

  6

  if jxj > c;

  where c > 0 is a tuning constant which controls the breakdown point of S estimators; see for details. The second alternative that we consider is the slightly more complex Rocke’s Biflat function, described, e.g., by

  to distance values close to the median, but

  null weights outside a user-defined interval. Specifically, let !

  2

v;.1

  /

  ; (8) D min 1; 1

  

v Size and Power of Multivariate Outlier Detection Rules

  7

  2

  

2

  where is the 1 . Then, the weight under Rocke’s Biflat quantile of

  v;.1 v /

  2

  function is 0 whenever a normalized version of the robust distance Q d is outside the

  i

  interval Œ1 ; 1 C . This definition ensures better performance of S estimators when v is large. Indeed, it can be proved p. 221) that the weights assigned by Tukey’s Biweight function

  become almost constant as v

  ! 1. Therefore, robustness of multivariate S estimators is lost in many practical situations where v is large. Examples of this behaviour will be seen in Sect.

   even for v as small as 10.

  Given the robust, but potentially inefficient, S estimators of and ˙ , an improvement in efficiency is sometimes advocated by computing refined location and shape estimators which satisfy a more efficient version of

  . These estimators, called MM estimators, are defined as the minimizers

  of

  n

  X

  1

  2

  . QQ d /; (9)

  i

  n

  i D1

  where

  2

i D .y QQ / QQ /

  1 QQd ˙ .y i QQ i (10)

  .x/ provides higher efficiency than .x/ at the null model and the function

   v

  Minimization of

  is performed over all Q and all QQ ˙ belonging to the set

  Q 2 < ˙ of positive definite symmetric v v matrices with j QQ j D 1. The MM estimator of is then Q

  ˙ . Practical Q , while the estimator of ˙ is a rescaled version of QQ implementation of MM estimators is available using Tukey’s Biweight function only

  Therefore, we follow the same convention in the performance comparison to be described in Sect.

  

2.3 The Forward Search

  The idea behind the Forward Search (FS) is to apply a flexible and data-driven trimming strategy to combine protection against outliers and high efficiency of estimators. For this purpose, the FS divides the data into a good portion that agrees with the postulated model and a set of outliers, if any The method starts from a small, robustly chosen, subset of the data and then fits subsets of increasing size, in such a way that outliers and other observations not following the general structure are revealed by diagnostic monitoring. Let m be the size of

  .m/

  the starting subset. Usually m be the subset of D v C 1 or slightly larger. Let S data fitted by the FS at step m (m ; : : : ; n), yielding estimates ˙ .m/

  D m O .m/, O and distances

  2

1 O O

  d .m/ i ˙ .m/ i i D fy O .m/g fy O .m/g D 1; : : : ; n:

  i

  8 A. Cerioli et al.

  .m/

  These distances are ordered to obtain the fitting subset at step m C 1. Whilst S remains outlier free, they will not suffer from masking.

  The main diagnostic quantity computed by the FS at step m is

  2 2 .m/

  O d .m/ i d .m/ for i ; (11)

  min

i W D arg min O i … S

min .m/

  i.e. the distance of the closest observation to S , among those not belonging to this subset. The rationale is that the robust distance of the observation entering the fitting subset at step m

  C 1 will be large if this observation is an outlier. Its peculiarity will

  2 then be revealed by a peak in the forward plot of d .m/. i min

  All the FS routines, as well as the algorithms for computing most of the com- monly adopted estimators for regression and multivariate analysis, are contained in the FSDA toolbox for MATLAB and are freely downloadable from

  

or from the web site of the Joint Research Centre of the European

  Commission. This toolbox also contains a series of dynamic tools which enable the user to link the information present in the different plots produced by the FS, such

  2

  as the index or forward plot of robust Mahalanobis distances O d .m/ and the scatter

  i plot matrix; see for details.

3 Comparison of Alternative Outlier Detection Rules

  Precise outlier identification requires cut-off values for the robust distances when model

  show

RMCD RMCD

  Q D Q D Q

  2

  that the usually trusted asymptotic approximation based on the distribution can

  v

  be largely unsatisfactory. Instead, proposes a much more accurate approximation based on the distributional rules

  2

  2 1/ v 1

  .w v w

  d Q ; Beta if w i (12)

  D 1

i.RMCD/

  2

  2

  w w .w

  C 1 1/v F if w i (13)

  

v;w

v D 0; w w

  v where w i and w are defined as in Sect.

  show

  how the same distributional results can be applied to deal with multiplicity of tests to increase power and to provide control of alternative error rates in the outlier detection process.

  In the context of the Forward Search, propose a formal outlier

  2

  test based on the sequence O d .m/, m ; : : : ; n

  In this i D m 1, obtained from min

  2

  test, the values of O d .m/ are compared to the FS envelope

  i min

  2

  2 V = .m/ ; Size and Power of Multivariate Outlier Detection Rules

  9

  2

  where V is the 100˛ % cut-off point of the .m

  m;˛ C 1/th order statistic from the

  scaled F distribution

  2

  .m 1/v

  F ; (14)

  v;m v

  m.m v/ and the factor

  2

  2 P .X < / v v;m=n

2 C2

  .m/ (15)

  T

  D m=n

  2

  2

  2