Mathematical Analysis for Machine Learning and Data Mining pdf pdf

  

This page intentionally left blank This page intentionally left blank This page intentionally left blank This page intentionally left blank

  Published by World Scientific Publishing Co. Pte. Ltd.

  USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE Library of Congress Cataloging-in-Publication Data Names: Simovici, Dan A., author.

  Title: Mathematical analysis for machine learning and data mining / by Dan Simovici (University of Massachusetts, Boston, USA). Description: [Hackensack?] New Jersey : World Scientific, [2018] | Includes bibliographical references and index. Identifiers: LCCN 2018008584 | ISBN 9789813229686 (hc : alk. paper) Subjects: LCSH: Machine learning--Mathematics. | Data mining--Mathematics. Classification: LCC Q325.5 .S57 2018 | DDC 006.3/101515--dc23 LC record available at https://lccn.loc.gov/2018008584 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

  Copyright © 2018 by World Scientific Publishing Co. Pte. Ltd.

  

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,

electronic or mechanical, including photocopying, recording or any information storage and retrieval

system now known or to be invented, without written permission from the publisher.

  

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance

Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy

is not required from the publisher.

  For any available supplementary material, please visit http://www.worldscientific.com/worldscibooks/10.1142/10702#t=suppl Desk Editors: V. Vishnu Mohan/Steven Patt Typeset by Stallion Press Email: enquiries@stallionpress.com v Making mathematics accessible to the educated layman, while keeping high scientific standards, has always been considered a treacherous navigation between the Scylla of professional contempt and the Charybdis of public misunderstanding.

  Gian-Carlo Rota

  

This page intentionally left blank This page intentionally left blank This page intentionally left blank This page intentionally left blank

Preface

  Mathematical Analysis can be loosely described as is the area of mathemat- ics whose main object is the study of function and of their behaviour with respect to limits. The term “function” refers to a broad collection of gen- eralizations of real functions of real arguments, to functionals, operators, measures, etc.

  There are several well-developed areas in mathematical analysis that present a special interest for machine learning: topology (with various fla- vors: point-set topology, combinatorial and algebraic topology), functional analysis on normed and inner product spaces (including Banach and Hilbert spaces), convex analysis, optimization, etc. Moreover, disciplines like mea- sure and integration theory which play a vital role in statistics, the other pillar of machine learning are absent from the education of a computer scientists. We aim to contribute to closing this gap, which is a serious handicap for people interested in research.

  The machine learning and data mining literature is vast and embraces a diversity of approaches, from informal to sophisticated mathematical pre- sentations. However, the necessary mathematical background needed for approaching research topics is usually presented in a terse and unmotivated manner, or is simply absent. This volume contains knowledge that comple- ments the usual presentations in machine learning and provides motivations (through its application chapters that discuss optimization, iterative algo- rithms, neural networks, regression, and support vector machines) for the study of mathematical aspects.

  Each chapter ends with suggestions for further reading. Over 600 ex- ercises and supplements are included; they form an integral part of the material. Some of the exercises are in reality supplemental material. For these, we include solutions. The mathematical background required for

  Mathematical Analysis for Machine Learning and Data Mining

  making the best use of this volume consists in the typical sequence calcu- lus — linear algebra — discrete mathematics, as it is taught to Computer Science students in US universities.

  Special thanks are due to the librarians of the Joseph Healy Library at the University of Massachusetts Boston whose diligence was essential in completing this project. I also wish to acknowledge the helpfulness and competent assistance of Steve Patt and D. Rajesh Babu of World Scientific.

  Lastly, I wish to thank my wife, Doina, a steady source of strength and loving support.

  Dan A. Simovici Boston and Brookline

  January 2018

  

Contents

  Preface vii

  Part I. Set-Theoretical and Algebraic Preliminaries

  1

  1. Preliminaries 3 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

  3 1.2 Sets and Collections . . . . . . . . . . . . . . . . . . . . .

  4 1.3 Relations and Functions . . . . . . . . . . . . . . . . . . .

  8 1.4 Sequences and Collections of Sets . . . . . . . . . . . . . .

  16 1.5 Partially Ordered Sets . . . . . . . . . . . . . . . . . . . .

  18 1.6 Closure and Interior Systems . . . . . . . . . . . . . . . .

  28 1.7 Algebras and σ-Algebras of Sets . . . . . . . . . . . . . .

  34 1.8 Dissimilarity and Metrics . . . . . . . . . . . . . . . . . .

  43 1.9 Elementary Combinatorics . . . . . . . . . . . . . . . . . .

  47 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . .

  54 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . .

  64

  2. Linear Spaces 65 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

  65 2.2 Linear Spaces and Linear Independence . . . . . . . . . .

  65 2.3 Linear Operators and Functionals . . . . . . . . . . . . . .

  74 2.4 Linear Spaces with Inner Products . . . . . . . . . . . . .

  85 2.5 Seminorms and Norms . . . . . . . . . . . . . . . . . . . .

  88

  2.6 Linear Functionals in Inner Product Spaces . . . . . . . . 107

  2.7 Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 113 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 116

  

Mathematical Analysis for Machine Learning and Data Mining

  3. Algebra of Convex Sets 117

  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 117

  3.2 Convex Sets and Affine Subspaces . . . . . . . . . . . . . 117

  3.3 Operations on Convex Sets . . . . . . . . . . . . . . . . . 129

  3.4 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

  3.5 Extreme Points . . . . . . . . . . . . . . . . . . . . . . . . 132

  3.6 Balanced and Absorbing Sets . . . . . . . . . . . . . . . . 138

  3.7 Polytopes and Polyhedra . . . . . . . . . . . . . . . . . . . 142 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 150 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 158

Part II. Topology 159

  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 161

  4.12 Homeomorphisms . . . . . . . . . . . . . . . . . . . . . . . 218

  5.3 Limits of Functions on Metric Spaces . . . . . . . . . . . . 261

  5.2 Sequences in Metric Spaces . . . . . . . . . . . . . . . . . 260

  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 255

  5. Metric Space Topologies 255

  4.16 The Epigraph and the Hypograph of a Function . . . . . 237 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 239 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 253

  4.15 Semicontinuous Functions . . . . . . . . . . . . . . . . . . 230

  4.14 Products of Topological Spaces . . . . . . . . . . . . . . . 225

  4.13 Connected Topological Spaces . . . . . . . . . . . . . . . . 222

  4.11 Continuous Functions . . . . . . . . . . . . . . . . . . . . 210

  4.2 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

  4.10 Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

  4.9 Limits of Functions . . . . . . . . . . . . . . . . . . . . . . 201

  4.8 Locally Compact Spaces . . . . . . . . . . . . . . . . . . . 197

  4.7 Separation Hierarchy . . . . . . . . . . . . . . . . . . . . . 193

  4. Topology 161

  4.5 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

  4.4 Neighborhoods . . . . . . . . . . . . . . . . . . . . . . . . 174

  4.3 Closure and Interior Operators in Topological Spaces . . . 166

  4.6 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . 189

  Contents xi

  6.4 Locally Convex Linear Spaces . . . . . . . . . . . . . . . . 338

  7.4 Measurable Functions . . . . . . . . . . . . . . . . . . . . 392

  7.3 Borel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

  7.2 Measurable Spaces . . . . . . . . . . . . . . . . . . . . . . 385

  7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 385

  7. Measurable Spaces and Measures 385

  6.12 Extreme Points and Krein-Milman Theorem . . . . . . . . 373 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 375 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 381

  6.11 The Contingent Cone . . . . . . . . . . . . . . . . . . . . 370

  6.10 Theorems of Alternatives . . . . . . . . . . . . . . . . . . 366

  6.9 Separation of Convex Sets . . . . . . . . . . . . . . . . . . 356

  6.8 The Relative Interior . . . . . . . . . . . . . . . . . . . . . 351

  6.7 Topological Aspects of Convex Sets . . . . . . . . . . . . . 348

  6.6 Linear Operators on Normed Linear Spaces . . . . . . . . 341

  6.5 Continuous Linear Operators . . . . . . . . . . . . . . . . 340

  6.3 Topologies on Inner Product Spaces . . . . . . . . . . . . 337

  5.4 Continuity of Functions between Metric Spaces . . . . . . 264

  6.2 Topologies of Linear Spaces . . . . . . . . . . . . . . . . . 329

  6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 329

  6. Topological Linear Spaces 329

  5.14 Equicontinuity . . . . . . . . . . . . . . . . . . . . . . . . 315 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 318 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 327

  5.13 Series and Schauder Bases . . . . . . . . . . . . . . . . . . 307

  5.12 The Topological Space (R, O) . . . . . . . . . . . . . . . . 303

  5.11 The Hausdorff Metric Hyperspace of Compact Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

  5.10 Contractions and Fixed Points . . . . . . . . . . . . . . . 295

  5.9 Totally Bounded Metric Spaces . . . . . . . . . . . . . . . 291

  5.8 The Stone-Weierstrass Theorem . . . . . . . . . . . . . . . 286

  5.7 Pointwise and Uniform Convergence . . . . . . . . . . . . 283

  5.6 Completeness of Metric Spaces . . . . . . . . . . . . . . . 275

  5.5 Separation Properties of Metric Spaces . . . . . . . . . . . 270

Part III. Measure and Integration 383

  

Mathematical Analysis for Machine Learning and Data Mining

  8.12 L

  8.6 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . 525

  8.7 Integration on Products of Measure Spaces . . . . . . . . 533

  8.8 The Riesz-Markov-Kakutani Theorem . . . . . . . . . . . 540

  8.9 Integration Relative to Signed Measures and Complex Measures . . . . . . . . . . . . . . . . . . . . . . 547

  8.10 Indefinite Integral of a Function . . . . . . . . . . . . . . . 549

  8.11 Convergence in Measure . . . . . . . . . . . . . . . . . . . 551

  p

  8.4 Functions of Bounded Variation . . . . . . . . . . . . . . . 512

  and L

  p

  Spaces . . . . . . . . . . . . . . . . . . . . . . . 556

  8.13 Fourier Transforms of Measures . . . . . . . . . . . . . . . 565

  8.14 Lebesgue-Stieltjes Measures and Integrals . . . . . . . . . 569

  8.15 Distributions of Random Variables . . . . . . . . . . . . . 572

  8.5 Riemann Integral vs. Lebesgue Integral . . . . . . . . . . 517

  8.3 The Dominated Convergence Theorem . . . . . . . . . . . 508

  7.5 Measures and Measure Spaces . . . . . . . . . . . . . . . . 398

  7.10 Signed and Complex Measures . . . . . . . . . . . . . . . 456

  7.6 Outer Measures . . . . . . . . . . . . . . . . . . . . . . . . 417

  7.7 The Lebesgue Measure on R

  n

  . . . . . . . . . . . . . . . . 427

  7.8 Measures on Topological Spaces . . . . . . . . . . . . . . . 450

  7.9 Measures in Metric Spaces . . . . . . . . . . . . . . . . . . 453

  7.11 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . 464 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 470 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 484

  8.2.4 The Integral of Complex-Valued Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 505

  8. Integration 485

  8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 485

  8.2 The Lebesgue Integral . . . . . . . . . . . . . . . . . . . . 485

  8.2.1 The Integral of Simple Measurable Functions . . . 486

  8.2.2 The Integral of Non-negative Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 491

  8.2.3 The Integral of Real-Valued Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 500

  8.16 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 577 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 582 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 593

  Contents xiii

Part IV. Functional Analysis and Convexity 595

  9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 597

  11. Hilbert Spaces 677

  11.10 Positive Operators in Hilbert Spaces . . . . . . . . . . . . 733 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 736 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 745

  11.9 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . 722

  11.8 Functions of Positive and Negative Type . . . . . . . . . . 712

  11.7 Spectra of Linear Operators on Hilbert Spaces . . . . . . 707

  11.6 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . 704

  11.5 The Dual Space of a Hilbert Space . . . . . . . . . . . . . 703

  11.4 Orthonormal Sets in Hilbert Spaces . . . . . . . . . . . . 686

  11.3.3 Projection Operators . . . . . . . . . . . . . . . . 684

  11.3.2 Normal and Unitary Operators . . . . . . . . . . . 683

  11.3.1 Self-Adjoint Operators . . . . . . . . . . . . . . . 681

  11.3 Classes of Linear Operators in Hilbert Spaces . . . . . . . 679

  11.2 Hilbert Spaces — Examples . . . . . . . . . . . . . . . . . 677

  11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 677

  . . . . 663 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 666 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 675

  9.2 Banach Spaces — Examples . . . . . . . . . . . . . . . . . 597

  9. Banach Spaces 597

  10.5 Normal and Tangent Subspaces for Surfaces in R

  . . . . . . . . . . . . 658

  n

  10.4 The Inverse Function Theorem in R

  10.3 Taylor’s Formula . . . . . . . . . . . . . . . . . . . . . . . 649

  10.2 The Fr´echet and Gˆateaux Differentiation . . . . . . . . . . 625

  10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 625

  10. Differentiability of Functions Defined on Normed Spaces 625

  9.6 Spectra of Linear Operators on Banach Spaces . . . . . . 616 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 619 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 623

  9.5 Duals of Normed Linear Spaces . . . . . . . . . . . . . . . 612

  9.4 Compact Operators . . . . . . . . . . . . . . . . . . . . . 610

  9.3 Linear Operators on Banach Spaces . . . . . . . . . . . . 603

  n

  

Mathematical Analysis for Machine Learning and Data Mining

  12. Convex Functions 747

  12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 747

  12.2 Convex Functions — Basics . . . . . . . . . . . . . . . . . 748

  12.3 Constructing Convex Functions . . . . . . . . . . . . . . . 756

  12.4 Extrema of Convex Functions . . . . . . . . . . . . . . . . 759

  12.5 Differentiability and Convexity . . . . . . . . . . . . . . . 760

  12.6 Quasi-Convex and Pseudo-Convex Functions . . . . . . . 770

  12.7 Convexity and Inequalities . . . . . . . . . . . . . . . . . . 775

  12.8 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . 780 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 793 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 815

Part V. Applications 817

  13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 819

  13.2 Local Extrema, Ascent and Descent Directions . . . . . . 819

  13. Optimization 819

  13.4 Optimization without Differentiability . . . . . . . . . . . 827

  13.5 Optimization with Differentiability . . . . . . . . . . . . . 831

  13.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843

  13.7 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . 849 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 854 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 863

  14. Iterative Algorithms 865

  14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 865

  14.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . 865

  14.3 The Secant Method . . . . . . . . . . . . . . . . . . . . . 869

  14.4 Newton’s Method in Banach Spaces . . . . . . . . . . . . 871

  14.5 Conjugate Gradient Method . . . . . . . . . . . . . . . . . 874

  14.6 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . 879

  14.7 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 882 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 884 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 892

  13.3 General Optimization Problems . . . . . . . . . . . . . . . 826

  Contents xv

  15. Neural Networks 893

  15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 893

  15.2 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893

  15.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 895

  15.4 Neural Networks as Universal Approximators . . . . . . . 896

  15.5 Weight Adjustment by Back Propagation . . . . . . . . . 899 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 902 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 907

  16. Regression 909

  16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 909

  16.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 909

  16.3 A Statistical Model of Linear Regression . . . . . . . . . . 912

  16.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 914

  16.5 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 916

  16.6 Lasso Regression and Regularization . . . . . . . . . . . . 917 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 920 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 924

  17. Support Vector Machines 925

  17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 925

  17.2 Linearly Separable Data Sets . . . . . . . . . . . . . . . . 925

  17.3 Soft Support Vector Machines . . . . . . . . . . . . . . . . 930

  17.4 Non-linear Support Vector Machines . . . . . . . . . . . . 933

  17.5 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . 939 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 941 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 947

  Bibliography 949

  Index 957

  

PART I

Set-Theoretical and Algebraic

Preliminaries

  · ∞ = ∞ · x =

  the set of rational numbers

  R

  ∪ {−∞}

  ˆ R

  >0 the set

  R <>0

  ∪ {+∞}

  Q

  I

  ˆ R

  the set of irrational numbers

  Z

  the set of integers

  N

  the set of natural numbers

  The usual order of real numbers is extended to the set ˆ R by

  −∞ < x <

  ∞ = ∞ + x = +∞, and , x − ∞ = −∞ + x = −∞, for every x ∈ R. Also, if x = 0 we assume that x

  the set

  ∪ {+∞}

  1.1 Introduction This introductory chapter contains a mix of preliminary results and nota- tions that we use in further chapters, ranging from set theory, and combi- natorics to metric spaces.

  R

  The membership of x in a set S is denoted by x ∈ S; if x is not a member of the set S, we write x

  ∈ S. Throughout this book, we use standardized notations for certain impor- tant sets of numbers:

  C

  the set of complex numbers

  R

  the set of real numbers

  R

  the set of non-negative real numbers

  >0 the set of positive real numbers

  ˆ R the set R

  R

  the set of non-positive real numbers

  R

  <0 the set of negative real numbers

  ˆ C the set C

  ∪ {∞}

  ˆ R the set R

  ∪ {−∞, +∞}

  • ∞ for every x ∈ R. Addition and multiplication are extended by x +
  • ∞ if x > 0,

  

Mathematical Analysis for Machine Learning and Data Mining

  and −∞ if x > 0, x

  · (−∞) = (−∞) · x = if x < 0.

  ∞ Additionally, we assume that 0 ·∞ = ∞·0 = 0 and 0·(−∞) = (−∞)·0 = 0. Note that ∞ − ∞, −∞ + ∞ are undefined.

  Division is extended by x/ ∞ = x/ − ∞ = 0 for every x ∈ R. The set of complex numbers C is extended by adding a single “infinity” element

  ∞. The sum ∞ + ∞ is not defined in the complex case. If S is a finite set, we denote by |S| the number of elements of S.

  1.2 Sets and Collections We assume that the reader is familiar with elementary set operations: union, intersection, difference, etc., and with their properties. The empty set is denoted by ∅.

  We give, without proof, several properties of union and intersection of sets: (1) S

  ∪ (T ∪ U) = (S ∪ T ) ∪ U (associativity of union), (2) S

  ∪ T = T ∪ S (commutativity of union), (3) S

  ∪ S = S (idempotency of union), (4) S

  ∪ ∅ = S, (5) S

  ∩ (T ∩ U) = (S ∩ T ) ∩ U (associativity of intersection), (6) S

  ∩ T = T ∩ S (commutativity of intersection), (7) S

  ∩ S = S (idempotency of intersection), (8) S

  ∩ ∅ = ∅, for all sets S, T, U . The associativity of union and intersection allows us to denote unam- biguously the union of three sets S, T, U by S

  ∪ T ∪ U and the intersection of three sets S, T, U by S ∩ T ∩ U.

  Definition 1.1. The sets S and T are disjoint if S ∩ T = ∅.

  Sets may contain other sets as elements. For example, the set C

  = {∅, {0}, {0, 1}, {0, 2}, {1, 2, 3}} contains the empty set

  ∅ and {0}, {0, 1},{0, 2},{1, 2, 3} as its elements. We refer to such sets as collections of sets or simply collections. In general, we use calligraphic letters C, D, . . . to denote collections of sets.

  

Preliminaries

  If C and D are two collections, we say that C is included in D, or that C is a subcollection of D, if every member of C is a member of D. This is denoted by C ⊆ D.

  Two collections C and D are equal if we have both C ⊆ D and D ⊆ C. This is denoted by C = D.

  Definition 1.2. Let C be a collection of sets. The union of C, denoted by C

  , is the set defined by C

  = {x | x ∈ S for some S ∈ C}.

  C If C is a non-empty collection, its intersection is the set given by

  C = {x | x ∈ S for every S ∈ C}.

  C If C = if and only if x

  {S, T }, we have x ∈ ∈ S or x ∈ T and C x if and only if x

  ∈ ∈ S and y ∈ T . The union and the intersection of this two-set collection are denoted by S ∪ T and S ∩ T and are referred to as the union and the intersection of S and T , respectively.

  The difference of two sets S, T is denoted by S − T . When T is a subset of S we write T for S

  − T , and we refer to the set T as the complement of T with respect to S or simply the complement of T . The relationship between set difference and set union and intersection is well-known: for every set S and non-empty collection C of sets, we have

  C C S = = − {S − C | C ∈ C} and S − {S − C | C ∈ C}.

  For any sets S, T, U , we have S − (T ∪ U) = (S − T ) ∩ (S − U) and S − (T ∩ U) = (S − T ) ∪ (S − U). With the notation previously introduced for the complement of a set, the above equalities become:

  T ∪ U = T ∩ U and T ∩ U = T ∪ U. For any sets T , U , V , we have

  (U ∪ V ) ∩ T = (U ∩ T ) ∪ (V ∩ T ) and (U ∩ V ) ∪ T = (U ∪ T ) ∩ (V ∪ T ).

  Note that if C and D are two collections such that C ⊆ D, then

  C D D C and .

  ⊆ ⊆ We initially excluded the empty collection from the definition of the in- tersection of a collection. However, within the framework of collections of subsets of a given set S, we extend the previous definition by taking

  

Mathematical Analysis for Machine Learning and Data Mining

  ∅ = S for the empty collection of subsets of S. This is consistent with C the fact that

  ∅ ⊆ C implies ⊆ S. The symmetric difference of sets denoted by

  ⊕ is defined by U ⊕ V = (U − V ) ∪ (V − U) for all sets U, V .

  We leave to the reader to verify that for all sets U, V, T we have (i) U

  ⊕ U = ∅; (ii) U

  ⊕ V = V ⊕ T ; (iii) (U ⊕ V ) ⊕ T = U ⊕ (V ⊕ T ).

  The next theorem allows us to introduce a type of set collection of fundamental importance. Theorem 1.1. Let

  {{x, y}, {x}} and {{u, v}, {u}} be two collections such that {{x, y}, {x}} = {{u, v}, {u}}. Then, we have x = u and y = v.

  Proof. Suppose that {{x, y}, {x}} = {{u, v}, {u}}.

  If x = y, the collection {{x, y}, {x}} consists of a single set, {x}, so the collection

  {{u, v}, {u}} also consists of a single set. This means that {u, v} = {u}, which implies u = v. Therefore, x = u, which gives the desired conclusion because we also have y = v.

  If x = y, then neither (x, y) nor (u, v) are singletons. However, they both contain exactly one singleton, namely

  {x} and {u}, respectively, so x = u. They also contain the equal sets {x, y} and {u, v}, which must be equal. Since v

  ∈ {x, y} and v = u = x, we conclude that v = y. Definition 1.3. An ordered pair is a collection of sets {{x, y}, {x}}.

  Theorem 1.1 implies that for an ordered pair {{x, y}, {x}}, x and y are uniquely determined. This justifies the following definition.

  Definition 1.4. Let {{x, y}, {x}} be an ordered pair. Then x is the first component of p and y is the second component of p.

  From now on, an ordered pair {{x, y}, {x}} is denoted by (x, y). If both x, y

  ∈ S, we refer to (x, y) as an ordered pair on the set S. Definition 1.5. Let X, Y be two sets. Their product is the set X

  × Y that consists of all pairs of the form (x, y), where x ∈ X and y ∈ Y .

  The set product is often referred to as the Cartesian product of sets. Example 1.1. Let X =

  {a, b, c} and let Y = {1, 2}. The Cartesian product

  X × Y is given by

  X

  

Preliminaries

  C Definition 1.6. Let C and D be two collections of sets such that =

  D . D is a refinement of C if, for every D

  ∈ D, there exists C ∈ C such that D ⊆ C. This is denoted by C ⊑ D.

  Example 1.2. Consider the collection C = {(a, ∞) | a ∈ R} and D =

  C D = = R. {(a, b) | a, b ∈ R, a < b}. It is clear that

  Since we have (a, b) ⊆ (a, ∞) for every a, b ∈ R such that a < b, it follows that D is a refinement of C.

  Definition 1.7. A collection of sets C is hereditary if U ∈ C and W ⊆ U implies W

  ∈ C. Example 1.3. Let S be a set. The collection of subsets of S, denoted by P

  (S), is a hereditary collection of sets since a subset of a subset T of S is itself a subset of S.

  The set of subsets of S that contain k elements is denoted by P (S).

  k

  Clearly, for every set S, we have P (S) = {∅} because there is only one subset of S that contains 0 elements, namely the empty set. The set of all finite subsets of a set S is denoted by P fin (S). It is clear that P fin (S) =

  P N k (S).

  k ∈

  Example 1.4. If S = {a, b, c}, then P(S) consists of the following eight sets:

  ∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}. For the empty set, we have P( ∅) = {∅}.

  Definition 1.8. Let C be a collection of sets and let U be a set. The trace of the collection C on the set U is the collection C =

  U {U ∩ C | C ∈ C}.

  We conclude this presentation of collections of sets with two more op- erations on collections of sets. Definition 1.9. Let C and D be two collections of sets. The collections C

  ∨ D, C ∧ D, and C − D are given by C

  ∨ D = {C ∪ D | C ∈ C and D ∈ D}, C

  ∧ D = {C ∩ D | C ∈ C and D ∈ D}, C − D = {C − D | C ∈ C and D ∈ D}.

  Example 1.5. Let C and D be the collections of sets defined by C

  = {{x}, {y, z}, {x, y}, {x, y, z}},

  D =

  

Mathematical Analysis for Machine Learning and Data Mining

  We have C

  ∨ D = {{x, y}, {y, z}, {x, y, z}, {u, y, z}, {u, x, y, z}}, C

  ∧ D = {∅, {x}, {y}, {x, y}, {y, z}}, C

  − D = {∅, {x}, {z}, {x, z}}, D − C = {∅, {u}, {x}, {y}, {u, z}, {u, y, z}}.

  Unlike “ ∪” and “∩”, the operations “∨” and “∧” between collections of sets are not idempotent. Indeed, we have, for example,

  D ∨ D = {{y}, {x, y}, {u, y, z}, {u, x, y, z}} = D. The trace C of a collection C on K can be written as C = C

K K ∧ {K}

  We conclude this section by introducing a special type of collection of subsets of a set. Definition 1.10. A partition of a non-empty set S is a collection π of non-empty subsets of S that are pairwise disjoint and whose union equals S.

  The members of π are referred to as the blocks of the partition π. The collection of partitions of a set S is denoted by PART(S). A parti- tion is finite if it has a finite number of blocks. The set of finite partitions of S is denoted by PART fin (S).

  If π ∈ PART(S) then a subset T of S is π-saturated if it is a union of blocks of π.

  Example 1.6. Let π = {{1, 3}, {4}, {2, 5, 6}} be a partition of S =

  {1, 2, 3, 4, 5, 6}. The set {1, 3, 4} is π-saturated because it is the union of blocks {1, 3} and 4.

  1.3 Relations and Functions Definition 1.11. Let X, Y be two sets. A relation on X, Y is a subset ρ of the set product X

  × Y . If X = Y = S we refer to ρ as a relation on S.

  The relation ρ on S is:

  • reflexive if (x, x) ∈ ρ for every x ∈ S;
  • irreflexive if (x, x) ∈ ρ for every x ∈ S;

  

Preliminaries

  • antisymmetric if (x, y) ∈ ρ and (y, x) ∈ ρ imply x = y for all x, y

  ∈ S;

  • transitive if (x, y) ∈ ρ and (y, z) ∈ ρ imply (x, z) ∈ ρ for all x, y, z ∈ S.

  Denote by REFL(S), SYMM(S), ANTISYMM(S) and TRAN(S) the sets of reflexive relations, the set of symmetric relations, the set of antisymmetric, and the set of transitive relations on S, respectively.

  A partial order on S is a relation ρ that belongs to REFL(S) ∩

  ANTISYMM(S) ∩TRAN(S), that is, a relation that is reflexive, symmetric and transitive.

  Example 1.7. Let δ be the relation that consists of those pairs (p, q) of natural numbers such that q = pk for some natural number k. We have (p, q)

  ∈ δ if p evenly divides q. Since (p, p) ∈ δ for every p it is clear that δ is symmetric. Suppose that we have both (p, q)

  ∈ δ and (q, p) ∈ δ. Then q = pk and p = qh. If either p or q is 0, then the other number is clearly 0. Assume that neither p nor q is 0. Then 1 = hk, which implies h = k = 1, so p = q, which proves that δ is antisymmetric.

  Finally, if (p, q), (q, r) ∈ δ, we have q = pk and r = qh for some k, h ∈ N, which implies r = p(hk), so (p, r)

  ∈ δ, which shows that δ is transitive. Example 1.8. Define the relation λ on R as the set of all ordered pairs (x, y) such that y = x + t, where t is a non-negative number. We have (x, x)

  ∈ λ because x = x + 0 for every x ∈ R. If (x, y) ∈ λ and (y, x) ∈ λ we have y = x+t and x = y+s for two non-negative numbers t, s, which implies 0 = t + s, so t = s = 0. This means that x = y, so λ is antisymmetric. Finally, if (x, y), (y, z)

  ∈ λ, we have y = x + u and z = y + v for two non-negative numbers u, v, which implies z = x + u + v, so (x, z) ∈ λ.

  In current mathematical practice, we often write xρy instead on (x, y) ∈

  ρ, where ρ is a relation of S and x, y ∈ S. Thus, we write pδq and xλy instead on (p, q)

  ∈ δ and (x, y) ∈ λ. Furthermore, we shall use the standard notations “ | ” and “ ” for δ and λ, that is, we shall write p | q and x y if p divides q and x is less or equal to y. This alternative way to denote the fact that (x, y) belongs to ρ is known as the infix notation.

  Example 1.9. Let P(S) be the set of subsets of S. It is easy to verify that the inclusion between subsets “ ⊆” is a partial order relation on P(S). If

  U, V ∈ P(S), we denote the inclusion of U in V by U ⊆ V using the infix notation.

  

Mathematical Analysis for Machine Learning and Data Mining

  Functions are special relation that enjoy the property described in the next definition. Definition 1.12. Let X, Y be two sets. A function (or a mapping) from

  ′ ′ X to Y is a relation f on X, Y such that (x, y), (x, y ) .

  ∈ f implies y = y In other words, the first component of a pair (x, y)

  ∈ f determines uniquely the second component of the pair. We denote the second compo- nent of a pair (x, y)

  ∈ f by f(x) and say, occasionally, that f maps x to y. If f is a function from X to Y we write f : X −→ Y . Definition 1.13. Let X, Y be two sets and let f : X −→ Y .

  The domain of f is the set Dom(f ) = {x ∈ X | y = f(x) for some y ∈ Y }. The range of f is the set

  Ran(f ) = {y ∈ Y | y = f(x) for some x ∈ X}. Definition 1.14. Let X be a set, Y = {0, 1} and let L be a subset of S. The characteristic function is the function 1 L : S

  −→ {0, 1} defined by: 1 if x ∈ L, 1 (x) =

  L

  otherwise for x ∈ S. The indicator function of L is the function I : S rr defined by

  L

  −→ ˆ if x ∈ L,

  I (x) =

  L

  ∞ otherwise for x ∈ S.

  It is easy to see that: 1 (x) = 1 (x) (x),

  P P Q ∩Q · 1

  1 (x) = 1 (x) + 1 (x) (x) (x),

  P P Q P Q ∪Q − 1 · 1

  1 ¯ (x) = 1 (x),

  P P − 1

  for every P, Q ⊆ S and x ∈ S. Theorem 1.2. Let X, Y, Z be three sets and let f : X

  −→ Y and g : Y −→ Z be two functions. The relation gf : X

  −→ Z that consists of all pairs (x, z) such that y = f (x) and g(y) = z for some y

  

Preliminaries

Proof.

  Let (x, z ), (x, z ) , y

  1

  2

  1

  2

  ∈ gf. There exist y ∈ Y such that y

  1 = f (x), y 2 = f (x), g(y 1 ) = z 1 , and g(y 2 ) = z 2 .

  The first two equalities imply y

  1 = y 2 ; the last two yield z 1 = z 2 , so gf is indeed a function.

  Note that the composition of the function f and g has been denoted in Theorem 1.2 by gf rather than the relation product f g. This manner of denoting the function composition is applied throughout this book.

  Definition 1.15. Let X, Y be two sets and let f : X −→ Y . If U is a subset of X, the restriction of f to U is the function g : U

  −→ Y defined by g(u) = f (u) for u ∈ U. The restriction of f to U is denoted by f ↿ U .

  Example 1.10. Let f be the function defined by f (x) = |x| for x ∈ R. Its restriction to R is given by f ↿ (x) =

  <0 R < −x.

  Definition 1.16. A function f : X −→ Y is:

  (i) injective or one-to-one if f (x

  1 ) = f (x 2 ) implies x 1 = x 2 for x 1 , x

  2

  ∈ Dom(f );

  (ii) surjective or onto if Ran(f ) = Y ; (iii) total if Dom(f ) = X. If f is both injective and surjective, then it is a bijective function. Theorem 1.3. A function f : X

  −→ Y is injective if and only if there exists a function g : Y −→ X such that g(f(x)) = x for every x ∈ Dom(f).

  Proof.

  Suppose that f is an injective function. For x ∈ Dom(f) and y = f (x) define g(y) = x. Note that g is well defined for if y = f (x ) =

  1

  f (x ) then x = x due to the injectivity of f . It follows immediately that

  2

  1

  2

  g(f (x)) = x for x ∈ Dom(f).

  Conversely, suppose that there exists a function g : Y −→ X such that g(f (x)) = x for every x

  1 ) = f (x 2 ), then x 1 = g(f (x 1 )) =

  ∈ Dom(f). If f(x g(f (x

  2 )) = x 2 , which proves that f is indeed injective.

  The function g whose existence is established by Theorem 1.3 is said to be the left inverse of f . Thus, a function f : X −→ Y is injective if and only if it has a left inverse.

  To prove a similar result concerning surjective functions we need to state a basic axiom of set theory.

  

Mathematical Analysis for Machine Learning and Data Mining

The Axiom of Choice: Let C = i {C | i ∈ I} be a collection

of non-empty sets indexed by a set I. There exists a function

φ : I i −→ C (known as a choice function) such that φ(i) ∈ C for each i ∈ I.

  Theorem 1.4. A function f : X −→ Y is surjective if and only if there exists a function h : X

  −→ Y such that f(h(y)) = y for every y ∈ Y .

  −1 Proof.

  Suppose that f is a surjective function. The collection (y) {f | y

  ∈ Y } indexed by Y consists of non-empty sets. By the Axiom of Choice there exists a choice function for this collection, that is a function h : Y −→

  −1 −1

  f (y) such that h(y) (y), or f (h(y)) = y for y y ∈ f ∈ Y .

  ∈Y

  Conversely, suppose that there exists a function h : X −→ Y such that f (h(y)) = y for every y

  ∈ Y . Then, f(x) = y for y = h(y), which shows that f is surjective. The function h whose existence is established by Theorem 1.4 is said to be the right inverse of f . Thus, a function has a right inverse if and only if it is surjective. Corollary 1.1. A function f : X

  −→ X is a bijection if and only if there exists a function k : X −→ X that is both a left inverse and a right inverse of f .

  Proof. This statement follows from Theorems 1.3 and 1.4.

  Indeed, if f is a bijection, there exists a left inverse g : X −→ X such that g(f (x)) = x and a right inverse h : X

  −→ X such that f(h(x)) = x for every x ∈ Y . Since h(x) ∈ X we have g(f(h(x))) = h(x), which implies g(x) = h(x) because f (h(x)) = x for x

  ∈ X. This yields g = h. The relationship between the subsets of a set and characteristic func- tions defined on that set is discussed next.

  Theorem 1.5. There is a bijection Ψ : P(S) −→ (S −→ {0, 1}) between the set of subsets of S and the set of characteristic functions defined on S.

  Proof.

  For P . The mapping Ψ is one-to-one.

  P

  ∈ P(S) define Ψ(P ) = 1 Indeed, assume that 1 = 1 , where P, Q

  P Q

  ∈ P(S). We have x ∈ P if and only if 1 (x) = 1, which is equivalent to 1 (x) = 1. This happens if and

  P Q

  only if x ∈ Q; hence, P = Q so Ψ is one-to-one.

  Let f : S

  f =

  −→ {0, 1} be an arbitrary function. Define the set T {x ∈ S

  

Preliminaries

  set T ; hence, Ψ(T ) = f which shows that the mapping Ψ is also onto;

  f f hence, it is a bijection.

  Definition 1.17. A set S is indexed be a set I if there exists a surjection f : I −→ S. In this case we refer to I as an index set.

  If S is indexed by the function f : I −→ S we write the element f(i) just as s i , if there is no risk of confusion.

  Definition 1.18. A sequence of length n on a set X is a function x : {0, 1, . . . , n − 1} −→ X.

  At times we will use the same term to designate a function x : {1, . . . , n} −→ X. Sequences are denoted by bold letters. If x is a sequence of length n we

  th

  refer to x(i) as the i of x; this element of X is denoted usually by x . We

  i