Mathematical Analysis for Machine Learning and Data Mining pdf pdf
This page intentionally left blank This page intentionally left blank This page intentionally left blank This page intentionally left blank
Published by World Scientific Publishing Co. Pte. Ltd.
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE Library of Congress Cataloging-in-Publication Data Names: Simovici, Dan A., author.
Title: Mathematical analysis for machine learning and data mining / by Dan Simovici (University of Massachusetts, Boston, USA). Description: [Hackensack?] New Jersey : World Scientific, [2018] | Includes bibliographical references and index. Identifiers: LCCN 2018008584 | ISBN 9789813229686 (hc : alk. paper) Subjects: LCSH: Machine learning--Mathematics. | Data mining--Mathematics. Classification: LCC Q325.5 .S57 2018 | DDC 006.3/101515--dc23 LC record available at https://lccn.loc.gov/2018008584 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Copyright © 2018 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.For any available supplementary material, please visit http://www.worldscientific.com/worldscibooks/10.1142/10702#t=suppl Desk Editors: V. Vishnu Mohan/Steven Patt Typeset by Stallion Press Email: enquiries@stallionpress.com v Making mathematics accessible to the educated layman, while keeping high scientific standards, has always been considered a treacherous navigation between the Scylla of professional contempt and the Charybdis of public misunderstanding.
Gian-Carlo Rota
This page intentionally left blank This page intentionally left blank This page intentionally left blank This page intentionally left blank
Preface
Mathematical Analysis can be loosely described as is the area of mathemat- ics whose main object is the study of function and of their behaviour with respect to limits. The term “function” refers to a broad collection of gen- eralizations of real functions of real arguments, to functionals, operators, measures, etc.
There are several well-developed areas in mathematical analysis that present a special interest for machine learning: topology (with various fla- vors: point-set topology, combinatorial and algebraic topology), functional analysis on normed and inner product spaces (including Banach and Hilbert spaces), convex analysis, optimization, etc. Moreover, disciplines like mea- sure and integration theory which play a vital role in statistics, the other pillar of machine learning are absent from the education of a computer scientists. We aim to contribute to closing this gap, which is a serious handicap for people interested in research.
The machine learning and data mining literature is vast and embraces a diversity of approaches, from informal to sophisticated mathematical pre- sentations. However, the necessary mathematical background needed for approaching research topics is usually presented in a terse and unmotivated manner, or is simply absent. This volume contains knowledge that comple- ments the usual presentations in machine learning and provides motivations (through its application chapters that discuss optimization, iterative algo- rithms, neural networks, regression, and support vector machines) for the study of mathematical aspects.
Each chapter ends with suggestions for further reading. Over 600 ex- ercises and supplements are included; they form an integral part of the material. Some of the exercises are in reality supplemental material. For these, we include solutions. The mathematical background required for
Mathematical Analysis for Machine Learning and Data Mining
making the best use of this volume consists in the typical sequence calcu- lus — linear algebra — discrete mathematics, as it is taught to Computer Science students in US universities.
Special thanks are due to the librarians of the Joseph Healy Library at the University of Massachusetts Boston whose diligence was essential in completing this project. I also wish to acknowledge the helpfulness and competent assistance of Steve Patt and D. Rajesh Babu of World Scientific.
Lastly, I wish to thank my wife, Doina, a steady source of strength and loving support.
Dan A. Simovici Boston and Brookline
January 2018
Contents
Preface vii
Part I. Set-Theoretical and Algebraic Preliminaries
1
1. Preliminaries 3 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
3 1.2 Sets and Collections . . . . . . . . . . . . . . . . . . . . .
4 1.3 Relations and Functions . . . . . . . . . . . . . . . . . . .
8 1.4 Sequences and Collections of Sets . . . . . . . . . . . . . .
16 1.5 Partially Ordered Sets . . . . . . . . . . . . . . . . . . . .
18 1.6 Closure and Interior Systems . . . . . . . . . . . . . . . .
28 1.7 Algebras and σ-Algebras of Sets . . . . . . . . . . . . . .
34 1.8 Dissimilarity and Metrics . . . . . . . . . . . . . . . . . .
43 1.9 Elementary Combinatorics . . . . . . . . . . . . . . . . . .
47 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . .
54 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . .
64
2. Linear Spaces 65 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
65 2.2 Linear Spaces and Linear Independence . . . . . . . . . .
65 2.3 Linear Operators and Functionals . . . . . . . . . . . . . .
74 2.4 Linear Spaces with Inner Products . . . . . . . . . . . . .
85 2.5 Seminorms and Norms . . . . . . . . . . . . . . . . . . . .
88
2.6 Linear Functionals in Inner Product Spaces . . . . . . . . 107
2.7 Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 113 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 116
Mathematical Analysis for Machine Learning and Data Mining
3. Algebra of Convex Sets 117
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2 Convex Sets and Affine Subspaces . . . . . . . . . . . . . 117
3.3 Operations on Convex Sets . . . . . . . . . . . . . . . . . 129
3.4 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.5 Extreme Points . . . . . . . . . . . . . . . . . . . . . . . . 132
3.6 Balanced and Absorbing Sets . . . . . . . . . . . . . . . . 138
3.7 Polytopes and Polyhedra . . . . . . . . . . . . . . . . . . . 142 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 150 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 158
Part II. Topology 159
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.12 Homeomorphisms . . . . . . . . . . . . . . . . . . . . . . . 218
5.3 Limits of Functions on Metric Spaces . . . . . . . . . . . . 261
5.2 Sequences in Metric Spaces . . . . . . . . . . . . . . . . . 260
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 255
5. Metric Space Topologies 255
4.16 The Epigraph and the Hypograph of a Function . . . . . 237 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 239 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 253
4.15 Semicontinuous Functions . . . . . . . . . . . . . . . . . . 230
4.14 Products of Topological Spaces . . . . . . . . . . . . . . . 225
4.13 Connected Topological Spaces . . . . . . . . . . . . . . . . 222
4.11 Continuous Functions . . . . . . . . . . . . . . . . . . . . 210
4.2 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.10 Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
4.9 Limits of Functions . . . . . . . . . . . . . . . . . . . . . . 201
4.8 Locally Compact Spaces . . . . . . . . . . . . . . . . . . . 197
4.7 Separation Hierarchy . . . . . . . . . . . . . . . . . . . . . 193
4. Topology 161
4.5 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.4 Neighborhoods . . . . . . . . . . . . . . . . . . . . . . . . 174
4.3 Closure and Interior Operators in Topological Spaces . . . 166
4.6 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . 189
Contents xi
6.4 Locally Convex Linear Spaces . . . . . . . . . . . . . . . . 338
7.4 Measurable Functions . . . . . . . . . . . . . . . . . . . . 392
7.3 Borel Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
7.2 Measurable Spaces . . . . . . . . . . . . . . . . . . . . . . 385
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 385
7. Measurable Spaces and Measures 385
6.12 Extreme Points and Krein-Milman Theorem . . . . . . . . 373 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 375 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 381
6.11 The Contingent Cone . . . . . . . . . . . . . . . . . . . . 370
6.10 Theorems of Alternatives . . . . . . . . . . . . . . . . . . 366
6.9 Separation of Convex Sets . . . . . . . . . . . . . . . . . . 356
6.8 The Relative Interior . . . . . . . . . . . . . . . . . . . . . 351
6.7 Topological Aspects of Convex Sets . . . . . . . . . . . . . 348
6.6 Linear Operators on Normed Linear Spaces . . . . . . . . 341
6.5 Continuous Linear Operators . . . . . . . . . . . . . . . . 340
6.3 Topologies on Inner Product Spaces . . . . . . . . . . . . 337
5.4 Continuity of Functions between Metric Spaces . . . . . . 264
6.2 Topologies of Linear Spaces . . . . . . . . . . . . . . . . . 329
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 329
6. Topological Linear Spaces 329
5.14 Equicontinuity . . . . . . . . . . . . . . . . . . . . . . . . 315 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 318 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 327
5.13 Series and Schauder Bases . . . . . . . . . . . . . . . . . . 307
5.12 The Topological Space (R, O) . . . . . . . . . . . . . . . . 303
5.11 The Hausdorff Metric Hyperspace of Compact Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
5.10 Contractions and Fixed Points . . . . . . . . . . . . . . . 295
5.9 Totally Bounded Metric Spaces . . . . . . . . . . . . . . . 291
5.8 The Stone-Weierstrass Theorem . . . . . . . . . . . . . . . 286
5.7 Pointwise and Uniform Convergence . . . . . . . . . . . . 283
5.6 Completeness of Metric Spaces . . . . . . . . . . . . . . . 275
5.5 Separation Properties of Metric Spaces . . . . . . . . . . . 270
Part III. Measure and Integration 383
Mathematical Analysis for Machine Learning and Data Mining
8.12 L
8.6 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . 525
8.7 Integration on Products of Measure Spaces . . . . . . . . 533
8.8 The Riesz-Markov-Kakutani Theorem . . . . . . . . . . . 540
8.9 Integration Relative to Signed Measures and Complex Measures . . . . . . . . . . . . . . . . . . . . . . 547
8.10 Indefinite Integral of a Function . . . . . . . . . . . . . . . 549
8.11 Convergence in Measure . . . . . . . . . . . . . . . . . . . 551
p
8.4 Functions of Bounded Variation . . . . . . . . . . . . . . . 512
and L
p
Spaces . . . . . . . . . . . . . . . . . . . . . . . 556
8.13 Fourier Transforms of Measures . . . . . . . . . . . . . . . 565
8.14 Lebesgue-Stieltjes Measures and Integrals . . . . . . . . . 569
8.15 Distributions of Random Variables . . . . . . . . . . . . . 572
8.5 Riemann Integral vs. Lebesgue Integral . . . . . . . . . . 517
8.3 The Dominated Convergence Theorem . . . . . . . . . . . 508
7.5 Measures and Measure Spaces . . . . . . . . . . . . . . . . 398
7.10 Signed and Complex Measures . . . . . . . . . . . . . . . 456
7.6 Outer Measures . . . . . . . . . . . . . . . . . . . . . . . . 417
7.7 The Lebesgue Measure on R
n
. . . . . . . . . . . . . . . . 427
7.8 Measures on Topological Spaces . . . . . . . . . . . . . . . 450
7.9 Measures in Metric Spaces . . . . . . . . . . . . . . . . . . 453
7.11 Probability Spaces . . . . . . . . . . . . . . . . . . . . . . 464 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 470 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 484
8.2.4 The Integral of Complex-Valued Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 505
8. Integration 485
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 485
8.2 The Lebesgue Integral . . . . . . . . . . . . . . . . . . . . 485
8.2.1 The Integral of Simple Measurable Functions . . . 486
8.2.2 The Integral of Non-negative Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 491
8.2.3 The Integral of Real-Valued Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . 500
8.16 Random Vectors . . . . . . . . . . . . . . . . . . . . . . . 577 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 582 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 593
Contents xiii
Part IV. Functional Analysis and Convexity 595
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 597
11. Hilbert Spaces 677
11.10 Positive Operators in Hilbert Spaces . . . . . . . . . . . . 733 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 736 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 745
11.9 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . 722
11.8 Functions of Positive and Negative Type . . . . . . . . . . 712
11.7 Spectra of Linear Operators on Hilbert Spaces . . . . . . 707
11.6 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . 704
11.5 The Dual Space of a Hilbert Space . . . . . . . . . . . . . 703
11.4 Orthonormal Sets in Hilbert Spaces . . . . . . . . . . . . 686
11.3.3 Projection Operators . . . . . . . . . . . . . . . . 684
11.3.2 Normal and Unitary Operators . . . . . . . . . . . 683
11.3.1 Self-Adjoint Operators . . . . . . . . . . . . . . . 681
11.3 Classes of Linear Operators in Hilbert Spaces . . . . . . . 679
11.2 Hilbert Spaces — Examples . . . . . . . . . . . . . . . . . 677
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 677
. . . . 663 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 666 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 675
9.2 Banach Spaces — Examples . . . . . . . . . . . . . . . . . 597
9. Banach Spaces 597
10.5 Normal and Tangent Subspaces for Surfaces in R
. . . . . . . . . . . . 658
n
10.4 The Inverse Function Theorem in R
10.3 Taylor’s Formula . . . . . . . . . . . . . . . . . . . . . . . 649
10.2 The Fr´echet and Gˆateaux Differentiation . . . . . . . . . . 625
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 625
10. Differentiability of Functions Defined on Normed Spaces 625
9.6 Spectra of Linear Operators on Banach Spaces . . . . . . 616 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 619 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 623
9.5 Duals of Normed Linear Spaces . . . . . . . . . . . . . . . 612
9.4 Compact Operators . . . . . . . . . . . . . . . . . . . . . 610
9.3 Linear Operators on Banach Spaces . . . . . . . . . . . . 603
n
Mathematical Analysis for Machine Learning and Data Mining
12. Convex Functions 747
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 747
12.2 Convex Functions — Basics . . . . . . . . . . . . . . . . . 748
12.3 Constructing Convex Functions . . . . . . . . . . . . . . . 756
12.4 Extrema of Convex Functions . . . . . . . . . . . . . . . . 759
12.5 Differentiability and Convexity . . . . . . . . . . . . . . . 760
12.6 Quasi-Convex and Pseudo-Convex Functions . . . . . . . 770
12.7 Convexity and Inequalities . . . . . . . . . . . . . . . . . . 775
12.8 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . 780 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 793 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 815
Part V. Applications 817
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 819
13.2 Local Extrema, Ascent and Descent Directions . . . . . . 819
13. Optimization 819
13.4 Optimization without Differentiability . . . . . . . . . . . 827
13.5 Optimization with Differentiability . . . . . . . . . . . . . 831
13.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
13.7 Strong Duality . . . . . . . . . . . . . . . . . . . . . . . . 849 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 854 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 863
14. Iterative Algorithms 865
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 865
14.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . 865
14.3 The Secant Method . . . . . . . . . . . . . . . . . . . . . 869
14.4 Newton’s Method in Banach Spaces . . . . . . . . . . . . 871
14.5 Conjugate Gradient Method . . . . . . . . . . . . . . . . . 874
14.6 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . 879
14.7 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . 882 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 884 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 892
13.3 General Optimization Problems . . . . . . . . . . . . . . . 826
Contents xv
15. Neural Networks 893
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 893
15.2 Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
15.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 895
15.4 Neural Networks as Universal Approximators . . . . . . . 896
15.5 Weight Adjustment by Back Propagation . . . . . . . . . 899 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 902 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 907
16. Regression 909
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 909
16.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 909
16.3 A Statistical Model of Linear Regression . . . . . . . . . . 912
16.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 914
16.5 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . 916
16.6 Lasso Regression and Regularization . . . . . . . . . . . . 917 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 920 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 924
17. Support Vector Machines 925
17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 925
17.2 Linearly Separable Data Sets . . . . . . . . . . . . . . . . 925
17.3 Soft Support Vector Machines . . . . . . . . . . . . . . . . 930
17.4 Non-linear Support Vector Machines . . . . . . . . . . . . 933
17.5 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . 939 Exercises and Supplements . . . . . . . . . . . . . . . . . . . . . 941 Bibliographical Comments . . . . . . . . . . . . . . . . . . . . . 947
Bibliography 949
Index 957
PART I
Set-Theoretical and AlgebraicPreliminaries
· ∞ = ∞ · x =
the set of rational numbers
R
∪ {−∞}
ˆ R
>0 the set
R <>0
∪ {+∞}
Q
I
ˆ R
the set of irrational numbers
Z
the set of integers
N
the set of natural numbers
The usual order of real numbers is extended to the set ˆ R by
−∞ < x <
∞ = ∞ + x = +∞, and , x − ∞ = −∞ + x = −∞, for every x ∈ R. Also, if x = 0 we assume that x
the set
∪ {+∞}
1.1 Introduction This introductory chapter contains a mix of preliminary results and nota- tions that we use in further chapters, ranging from set theory, and combi- natorics to metric spaces.
R
The membership of x in a set S is denoted by x ∈ S; if x is not a member of the set S, we write x
∈ S. Throughout this book, we use standardized notations for certain impor- tant sets of numbers:
C
the set of complex numbers
R
the set of real numbers
R
the set of non-negative real numbers
>0 the set of positive real numbers
ˆ R the set R
R
the set of non-positive real numbers
R
<0 the set of negative real numbers
ˆ C the set C
∪ {∞}
ˆ R the set R
∪ {−∞, +∞}
- ∞ for every x ∈ R. Addition and multiplication are extended by x +
- ∞ if x > 0,
Mathematical Analysis for Machine Learning and Data Mining
and −∞ if x > 0, x
· (−∞) = (−∞) · x = if x < 0.
∞ Additionally, we assume that 0 ·∞ = ∞·0 = 0 and 0·(−∞) = (−∞)·0 = 0. Note that ∞ − ∞, −∞ + ∞ are undefined.
Division is extended by x/ ∞ = x/ − ∞ = 0 for every x ∈ R. The set of complex numbers C is extended by adding a single “infinity” element
∞. The sum ∞ + ∞ is not defined in the complex case. If S is a finite set, we denote by |S| the number of elements of S.
1.2 Sets and Collections We assume that the reader is familiar with elementary set operations: union, intersection, difference, etc., and with their properties. The empty set is denoted by ∅.
We give, without proof, several properties of union and intersection of sets: (1) S
∪ (T ∪ U) = (S ∪ T ) ∪ U (associativity of union), (2) S
∪ T = T ∪ S (commutativity of union), (3) S
∪ S = S (idempotency of union), (4) S
∪ ∅ = S, (5) S
∩ (T ∩ U) = (S ∩ T ) ∩ U (associativity of intersection), (6) S
∩ T = T ∩ S (commutativity of intersection), (7) S
∩ S = S (idempotency of intersection), (8) S
∩ ∅ = ∅, for all sets S, T, U . The associativity of union and intersection allows us to denote unam- biguously the union of three sets S, T, U by S
∪ T ∪ U and the intersection of three sets S, T, U by S ∩ T ∩ U.
Definition 1.1. The sets S and T are disjoint if S ∩ T = ∅.
Sets may contain other sets as elements. For example, the set C
= {∅, {0}, {0, 1}, {0, 2}, {1, 2, 3}} contains the empty set
∅ and {0}, {0, 1},{0, 2},{1, 2, 3} as its elements. We refer to such sets as collections of sets or simply collections. In general, we use calligraphic letters C, D, . . . to denote collections of sets.
Preliminaries
If C and D are two collections, we say that C is included in D, or that C is a subcollection of D, if every member of C is a member of D. This is denoted by C ⊆ D.
Two collections C and D are equal if we have both C ⊆ D and D ⊆ C. This is denoted by C = D.
Definition 1.2. Let C be a collection of sets. The union of C, denoted by C
, is the set defined by C
= {x | x ∈ S for some S ∈ C}.
C If C is a non-empty collection, its intersection is the set given by
C = {x | x ∈ S for every S ∈ C}.
C If C = if and only if x
{S, T }, we have x ∈ ∈ S or x ∈ T and C x if and only if x
∈ ∈ S and y ∈ T . The union and the intersection of this two-set collection are denoted by S ∪ T and S ∩ T and are referred to as the union and the intersection of S and T , respectively.
The difference of two sets S, T is denoted by S − T . When T is a subset of S we write T for S
− T , and we refer to the set T as the complement of T with respect to S or simply the complement of T . The relationship between set difference and set union and intersection is well-known: for every set S and non-empty collection C of sets, we have
C C S = = − {S − C | C ∈ C} and S − {S − C | C ∈ C}.
For any sets S, T, U , we have S − (T ∪ U) = (S − T ) ∩ (S − U) and S − (T ∩ U) = (S − T ) ∪ (S − U). With the notation previously introduced for the complement of a set, the above equalities become:
T ∪ U = T ∩ U and T ∩ U = T ∪ U. For any sets T , U , V , we have
(U ∪ V ) ∩ T = (U ∩ T ) ∪ (V ∩ T ) and (U ∩ V ) ∪ T = (U ∪ T ) ∩ (V ∪ T ).
Note that if C and D are two collections such that C ⊆ D, then
C D D C and .
⊆ ⊆ We initially excluded the empty collection from the definition of the in- tersection of a collection. However, within the framework of collections of subsets of a given set S, we extend the previous definition by taking
Mathematical Analysis for Machine Learning and Data Mining
∅ = S for the empty collection of subsets of S. This is consistent with C the fact that
∅ ⊆ C implies ⊆ S. The symmetric difference of sets denoted by
⊕ is defined by U ⊕ V = (U − V ) ∪ (V − U) for all sets U, V .
We leave to the reader to verify that for all sets U, V, T we have (i) U
⊕ U = ∅; (ii) U
⊕ V = V ⊕ T ; (iii) (U ⊕ V ) ⊕ T = U ⊕ (V ⊕ T ).
The next theorem allows us to introduce a type of set collection of fundamental importance. Theorem 1.1. Let
{{x, y}, {x}} and {{u, v}, {u}} be two collections such that {{x, y}, {x}} = {{u, v}, {u}}. Then, we have x = u and y = v.
Proof. Suppose that {{x, y}, {x}} = {{u, v}, {u}}.
If x = y, the collection {{x, y}, {x}} consists of a single set, {x}, so the collection
{{u, v}, {u}} also consists of a single set. This means that {u, v} = {u}, which implies u = v. Therefore, x = u, which gives the desired conclusion because we also have y = v.
If x = y, then neither (x, y) nor (u, v) are singletons. However, they both contain exactly one singleton, namely
{x} and {u}, respectively, so x = u. They also contain the equal sets {x, y} and {u, v}, which must be equal. Since v
∈ {x, y} and v = u = x, we conclude that v = y. Definition 1.3. An ordered pair is a collection of sets {{x, y}, {x}}.
Theorem 1.1 implies that for an ordered pair {{x, y}, {x}}, x and y are uniquely determined. This justifies the following definition.
Definition 1.4. Let {{x, y}, {x}} be an ordered pair. Then x is the first component of p and y is the second component of p.
From now on, an ordered pair {{x, y}, {x}} is denoted by (x, y). If both x, y
∈ S, we refer to (x, y) as an ordered pair on the set S. Definition 1.5. Let X, Y be two sets. Their product is the set X
× Y that consists of all pairs of the form (x, y), where x ∈ X and y ∈ Y .
The set product is often referred to as the Cartesian product of sets. Example 1.1. Let X =
{a, b, c} and let Y = {1, 2}. The Cartesian product
X × Y is given by
X
Preliminaries
C Definition 1.6. Let C and D be two collections of sets such that =
D . D is a refinement of C if, for every D
∈ D, there exists C ∈ C such that D ⊆ C. This is denoted by C ⊑ D.
Example 1.2. Consider the collection C = {(a, ∞) | a ∈ R} and D =
C D = = R. {(a, b) | a, b ∈ R, a < b}. It is clear that
Since we have (a, b) ⊆ (a, ∞) for every a, b ∈ R such that a < b, it follows that D is a refinement of C.
Definition 1.7. A collection of sets C is hereditary if U ∈ C and W ⊆ U implies W
∈ C. Example 1.3. Let S be a set. The collection of subsets of S, denoted by P
(S), is a hereditary collection of sets since a subset of a subset T of S is itself a subset of S.
The set of subsets of S that contain k elements is denoted by P (S).
k
Clearly, for every set S, we have P (S) = {∅} because there is only one subset of S that contains 0 elements, namely the empty set. The set of all finite subsets of a set S is denoted by P fin (S). It is clear that P fin (S) =
P N k (S).
k ∈
Example 1.4. If S = {a, b, c}, then P(S) consists of the following eight sets:
∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}. For the empty set, we have P( ∅) = {∅}.
Definition 1.8. Let C be a collection of sets and let U be a set. The trace of the collection C on the set U is the collection C =
U {U ∩ C | C ∈ C}.
We conclude this presentation of collections of sets with two more op- erations on collections of sets. Definition 1.9. Let C and D be two collections of sets. The collections C
∨ D, C ∧ D, and C − D are given by C
∨ D = {C ∪ D | C ∈ C and D ∈ D}, C
∧ D = {C ∩ D | C ∈ C and D ∈ D}, C − D = {C − D | C ∈ C and D ∈ D}.
Example 1.5. Let C and D be the collections of sets defined by C
= {{x}, {y, z}, {x, y}, {x, y, z}},
D =
Mathematical Analysis for Machine Learning and Data Mining
We have C
∨ D = {{x, y}, {y, z}, {x, y, z}, {u, y, z}, {u, x, y, z}}, C
∧ D = {∅, {x}, {y}, {x, y}, {y, z}}, C
− D = {∅, {x}, {z}, {x, z}}, D − C = {∅, {u}, {x}, {y}, {u, z}, {u, y, z}}.
Unlike “ ∪” and “∩”, the operations “∨” and “∧” between collections of sets are not idempotent. Indeed, we have, for example,
D ∨ D = {{y}, {x, y}, {u, y, z}, {u, x, y, z}} = D. The trace C of a collection C on K can be written as C = C
K K ∧ {K}
We conclude this section by introducing a special type of collection of subsets of a set. Definition 1.10. A partition of a non-empty set S is a collection π of non-empty subsets of S that are pairwise disjoint and whose union equals S.
The members of π are referred to as the blocks of the partition π. The collection of partitions of a set S is denoted by PART(S). A parti- tion is finite if it has a finite number of blocks. The set of finite partitions of S is denoted by PART fin (S).
If π ∈ PART(S) then a subset T of S is π-saturated if it is a union of blocks of π.
Example 1.6. Let π = {{1, 3}, {4}, {2, 5, 6}} be a partition of S =
{1, 2, 3, 4, 5, 6}. The set {1, 3, 4} is π-saturated because it is the union of blocks {1, 3} and 4.
1.3 Relations and Functions Definition 1.11. Let X, Y be two sets. A relation on X, Y is a subset ρ of the set product X
× Y . If X = Y = S we refer to ρ as a relation on S.
The relation ρ on S is:
- reflexive if (x, x) ∈ ρ for every x ∈ S;
- irreflexive if (x, x) ∈ ρ for every x ∈ S;
Preliminaries
- antisymmetric if (x, y) ∈ ρ and (y, x) ∈ ρ imply x = y for all x, y
∈ S;
- transitive if (x, y) ∈ ρ and (y, z) ∈ ρ imply (x, z) ∈ ρ for all x, y, z ∈ S.
Denote by REFL(S), SYMM(S), ANTISYMM(S) and TRAN(S) the sets of reflexive relations, the set of symmetric relations, the set of antisymmetric, and the set of transitive relations on S, respectively.
A partial order on S is a relation ρ that belongs to REFL(S) ∩
ANTISYMM(S) ∩TRAN(S), that is, a relation that is reflexive, symmetric and transitive.
Example 1.7. Let δ be the relation that consists of those pairs (p, q) of natural numbers such that q = pk for some natural number k. We have (p, q)
∈ δ if p evenly divides q. Since (p, p) ∈ δ for every p it is clear that δ is symmetric. Suppose that we have both (p, q)
∈ δ and (q, p) ∈ δ. Then q = pk and p = qh. If either p or q is 0, then the other number is clearly 0. Assume that neither p nor q is 0. Then 1 = hk, which implies h = k = 1, so p = q, which proves that δ is antisymmetric.
Finally, if (p, q), (q, r) ∈ δ, we have q = pk and r = qh for some k, h ∈ N, which implies r = p(hk), so (p, r)
∈ δ, which shows that δ is transitive. Example 1.8. Define the relation λ on R as the set of all ordered pairs (x, y) such that y = x + t, where t is a non-negative number. We have (x, x)
∈ λ because x = x + 0 for every x ∈ R. If (x, y) ∈ λ and (y, x) ∈ λ we have y = x+t and x = y+s for two non-negative numbers t, s, which implies 0 = t + s, so t = s = 0. This means that x = y, so λ is antisymmetric. Finally, if (x, y), (y, z)
∈ λ, we have y = x + u and z = y + v for two non-negative numbers u, v, which implies z = x + u + v, so (x, z) ∈ λ.
In current mathematical practice, we often write xρy instead on (x, y) ∈
ρ, where ρ is a relation of S and x, y ∈ S. Thus, we write pδq and xλy instead on (p, q)
∈ δ and (x, y) ∈ λ. Furthermore, we shall use the standard notations “ | ” and “ ” for δ and λ, that is, we shall write p | q and x y if p divides q and x is less or equal to y. This alternative way to denote the fact that (x, y) belongs to ρ is known as the infix notation.
Example 1.9. Let P(S) be the set of subsets of S. It is easy to verify that the inclusion between subsets “ ⊆” is a partial order relation on P(S). If
U, V ∈ P(S), we denote the inclusion of U in V by U ⊆ V using the infix notation.
Mathematical Analysis for Machine Learning and Data Mining
Functions are special relation that enjoy the property described in the next definition. Definition 1.12. Let X, Y be two sets. A function (or a mapping) from
′ ′ X to Y is a relation f on X, Y such that (x, y), (x, y ) .
∈ f implies y = y In other words, the first component of a pair (x, y)
∈ f determines uniquely the second component of the pair. We denote the second compo- nent of a pair (x, y)
∈ f by f(x) and say, occasionally, that f maps x to y. If f is a function from X to Y we write f : X −→ Y . Definition 1.13. Let X, Y be two sets and let f : X −→ Y .
The domain of f is the set Dom(f ) = {x ∈ X | y = f(x) for some y ∈ Y }. The range of f is the set
Ran(f ) = {y ∈ Y | y = f(x) for some x ∈ X}. Definition 1.14. Let X be a set, Y = {0, 1} and let L be a subset of S. The characteristic function is the function 1 L : S
−→ {0, 1} defined by: 1 if x ∈ L, 1 (x) =
L
otherwise for x ∈ S. The indicator function of L is the function I : S rr defined by
L
−→ ˆ if x ∈ L,
I (x) =
L
∞ otherwise for x ∈ S.
It is easy to see that: 1 (x) = 1 (x) (x),
P P Q ∩Q · 1
1 (x) = 1 (x) + 1 (x) (x) (x),
P P Q P Q ∪Q − 1 · 1
1 ¯ (x) = 1 (x),
P P − 1
for every P, Q ⊆ S and x ∈ S. Theorem 1.2. Let X, Y, Z be three sets and let f : X
−→ Y and g : Y −→ Z be two functions. The relation gf : X
−→ Z that consists of all pairs (x, z) such that y = f (x) and g(y) = z for some y
Preliminaries
Proof.Let (x, z ), (x, z ) , y
1
2
1
2
∈ gf. There exist y ∈ Y such that y
1 = f (x), y 2 = f (x), g(y 1 ) = z 1 , and g(y 2 ) = z 2 .
The first two equalities imply y
1 = y 2 ; the last two yield z 1 = z 2 , so gf is indeed a function.
Note that the composition of the function f and g has been denoted in Theorem 1.2 by gf rather than the relation product f g. This manner of denoting the function composition is applied throughout this book.
Definition 1.15. Let X, Y be two sets and let f : X −→ Y . If U is a subset of X, the restriction of f to U is the function g : U
−→ Y defined by g(u) = f (u) for u ∈ U. The restriction of f to U is denoted by f ↿ U .
Example 1.10. Let f be the function defined by f (x) = |x| for x ∈ R. Its restriction to R is given by f ↿ (x) =
<0 R < −x.
Definition 1.16. A function f : X −→ Y is:
(i) injective or one-to-one if f (x
1 ) = f (x 2 ) implies x 1 = x 2 for x 1 , x
2
∈ Dom(f );
(ii) surjective or onto if Ran(f ) = Y ; (iii) total if Dom(f ) = X. If f is both injective and surjective, then it is a bijective function. Theorem 1.3. A function f : X
−→ Y is injective if and only if there exists a function g : Y −→ X such that g(f(x)) = x for every x ∈ Dom(f).
Proof.
Suppose that f is an injective function. For x ∈ Dom(f) and y = f (x) define g(y) = x. Note that g is well defined for if y = f (x ) =
1
f (x ) then x = x due to the injectivity of f . It follows immediately that
2
1
2
g(f (x)) = x for x ∈ Dom(f).
Conversely, suppose that there exists a function g : Y −→ X such that g(f (x)) = x for every x
1 ) = f (x 2 ), then x 1 = g(f (x 1 )) =
∈ Dom(f). If f(x g(f (x
2 )) = x 2 , which proves that f is indeed injective.
The function g whose existence is established by Theorem 1.3 is said to be the left inverse of f . Thus, a function f : X −→ Y is injective if and only if it has a left inverse.
To prove a similar result concerning surjective functions we need to state a basic axiom of set theory.
Mathematical Analysis for Machine Learning and Data Mining
The Axiom of Choice: Let C = i {C | i ∈ I} be a collectionof non-empty sets indexed by a set I. There exists a function
φ : I i −→ C (known as a choice function) such that φ(i) ∈ C for each i ∈ I.Theorem 1.4. A function f : X −→ Y is surjective if and only if there exists a function h : X
−→ Y such that f(h(y)) = y for every y ∈ Y .
−1 Proof.
Suppose that f is a surjective function. The collection (y) {f | y
∈ Y } indexed by Y consists of non-empty sets. By the Axiom of Choice there exists a choice function for this collection, that is a function h : Y −→
−1 −1
f (y) such that h(y) (y), or f (h(y)) = y for y y ∈ f ∈ Y .
∈Y
Conversely, suppose that there exists a function h : X −→ Y such that f (h(y)) = y for every y
∈ Y . Then, f(x) = y for y = h(y), which shows that f is surjective. The function h whose existence is established by Theorem 1.4 is said to be the right inverse of f . Thus, a function has a right inverse if and only if it is surjective. Corollary 1.1. A function f : X
−→ X is a bijection if and only if there exists a function k : X −→ X that is both a left inverse and a right inverse of f .
Proof. This statement follows from Theorems 1.3 and 1.4.
Indeed, if f is a bijection, there exists a left inverse g : X −→ X such that g(f (x)) = x and a right inverse h : X
−→ X such that f(h(x)) = x for every x ∈ Y . Since h(x) ∈ X we have g(f(h(x))) = h(x), which implies g(x) = h(x) because f (h(x)) = x for x
∈ X. This yields g = h. The relationship between the subsets of a set and characteristic func- tions defined on that set is discussed next.
Theorem 1.5. There is a bijection Ψ : P(S) −→ (S −→ {0, 1}) between the set of subsets of S and the set of characteristic functions defined on S.
Proof.
For P . The mapping Ψ is one-to-one.
P
∈ P(S) define Ψ(P ) = 1 Indeed, assume that 1 = 1 , where P, Q
P Q
∈ P(S). We have x ∈ P if and only if 1 (x) = 1, which is equivalent to 1 (x) = 1. This happens if and
P Q
only if x ∈ Q; hence, P = Q so Ψ is one-to-one.
Let f : S
f =
−→ {0, 1} be an arbitrary function. Define the set T {x ∈ S
Preliminaries
set T ; hence, Ψ(T ) = f which shows that the mapping Ψ is also onto;
f f hence, it is a bijection.
Definition 1.17. A set S is indexed be a set I if there exists a surjection f : I −→ S. In this case we refer to I as an index set.
If S is indexed by the function f : I −→ S we write the element f(i) just as s i , if there is no risk of confusion.
Definition 1.18. A sequence of length n on a set X is a function x : {0, 1, . . . , n − 1} −→ X.
At times we will use the same term to designate a function x : {1, . . . , n} −→ X. Sequences are denoted by bold letters. If x is a sequence of length n we
th
refer to x(i) as the i of x; this element of X is denoted usually by x . We
i