Machine Learning A Constraint Based Approach pdf pdf

  Machine Learning

A Constraint-Based Approach

  Machine Learning

A Constraint-Based Approach

  Marco Gori Università di Siena Morgan Kaufmann is an imprint of Elsevier

50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2018 Elsevier Ltd. All rights reserved.

  No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: .

This book and the individual contributions contained in it are protected under copyright by the Publisher (other

than as may be noted herein).

  Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our

understanding, changes in research methods, professional practices, or medical treatment may become necessary.

  

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using

any information, methods, compounds, or experiments described herein. In using such information or methods

they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

  

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability

for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or

from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

  Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library

  ISBN: 978-0-08-100659-7 For information on all Morgan Kaufmann publications visit our website at

  Publisher: Katey Birtcher Acquisition Editor: Steve Merken Editorial Project Manager: Peter Jardim Production Project Manager: Punithavathy Govindaradjane Designer: Miles Hitchen Typeset by VTeX

  

To the Memory of My Father

who provided me with enough examples

to appreciate the importance of hard work

to achieve goals and to disclose

the beauty of knowledge.

  Contents . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • • • Preface . xiii Notes on the Exercises ............. xix .

  1 The

  2 CHAPTER Big Picture ................................... .

  1.1 Why Do Machines Need

  3 to Learn? ......................

  .

  4 1.1.1 Learning Tasks ........................

  .1.2

J Symbolic and SubsymboJic Representations of the

v m n . . . . . . . .

  9 on e t . . . . . . . . . . . . . . .

  En ir

  1.1.3 Biological and Artificial Neural Networks ...

  II

  13

  1.1.4 Protocols of Learning Constraint-Based e rn

  19 1.1.5 L a ing ....

  28

  1.2 Principles and Practice .

  The zz

  1.2.1 Pu ling Nature of Induction 28 rn n

  1.2.2 e

  34 L a i g Principles.

  1.2.3 The Role of

  34 Time in Learning Processes

  1.2.4 Foclls of Attention .................... . 35 · n ... . .. .. . . .

  1.3 H 38 ands on Experie ce .. ... . ... .. . .

  1.3.1

  39 Measuring the Sliccess of Experiments.

  ] .3.2

  40 Handwritten Character Recognition i 1.3.3 a

  42 Setting up a M c hine Learning Exper ment

  1.3.4 T t 45

e s and Experimental Remarks ......... .

s .

  1.4 . .. . .

  . . . . . . . . . . . . . . • . . . . . . 50 Ch allenge in Machine Learning .. ... .. ... . .

  50 1.4.1 Learning to See . .

  51 1.4.2 Speech Understanding ........ . . . . . . . . . . .

  52 ] .4.3 Agents Living in Their Own Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.5 i

  54 Schol a .

  2

  60 CHAPTER Learning Principles v ronm n

2.1 i e . . . . . . . . . . . . . . . . . . . . . . . . . .

  61 E n tal Constraints .. . .

  

2.1.1 . . . . . . . . . . . . . . . . . . . . . .

  61 Loss and Risk Functions .. . .

  2.1.2 Risk Functions.

  69 Ill-Position of Constraint·lnduced

2.1.3 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7 2.1.4 The Bias-Variance . s Dilemma .............

  1 Minimization .. . .

  75 2.2 S i t . . . . . . . . . . . . . . . . . . . . . . . . .

  83 tat ical Learnjng . . . . .

  

2.2.1 M_ax.imum k . , . , . . . . . , . . . . . , . .

  83 i U

  

L e hood ESlimalion

2.2.2 • . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  86 Bayesian Inference n

2.2.3 l • . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  88 Bayesian L earn g

2.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  89 Graphical Modes .

  

2.2.5 B .

  . . . , . . . . . . . , . . . . . . . . . . . . . . 92 Frequentist and ayesian Approach ...............

  95

  2.3 Information-Based Learning . .

  

2.3 .1 . • . . . . . . . . . . . . . . . . . . . . . . . . .

  95 A Motivating Example

  VII

  viii Contents

  • • . . . . . . . . . . .

  97 2.3.2 Principle of Maximum Entropy . . . . . . . . . . . . . . . . . . . .

  2.3.3 Maximum Mutual Information

  99 . . . . . . . . . . . .. . .. 104 2.4 . Learning Under the Parsimony Principle .

  . . . . .. .. . . . . . . . . . .. . . .. 104 2.4.1 The Parsimony Principle.. .

  , ..... , , , , , , , ..... , 104 ,

  2.4.2 Minimum Description Length, . . . . . . . . . . . . . . . . . . . . . . . . 110

  2.4.3 MDL and Regularization . . . . . . . . .. 113

2.4.4 Statislicallnterpretation of Regularization.

  2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scholia

  . .. I IS

  3 122 Linear Threshold Machines.

  CHAPTER

  123

  3.1 L inear Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . .. .. .. .. .. .. .. .. 128 3.1.1 Normal Equations .. .

  129

  

3.1.2 Undetermined Problems and Pseudoinversion

132 3.1.3 Ridge Regression ....

  3.1.4 . 134

Primal and Dual Representations ....

  L inear Machines With Threshold Units ...

  141 3.2 .

  Predicate-Order Representational Issues .......

  142 3.2.1 and .

  149

  3.2.2 Optimality for L inearly-Separable Examples. . . . . . . . .

  .. .

  3.2.3 Failing to Separate. . . . . . . lSI . . 155

  3.3 .... .

  Statistical View .... . . .

  155 3.3 .] Bayesian Decislon and Linear Discrimination .. . . . 156 3.3.2 . . . . . . . . . . . . .

  Logistic Regression.

  158

  3.3.3 The Parsimony Principle Meets the Bayesian Decision

  LMS in the Statistical Framework.. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

  159 3.3.4 . ... . .

  3.4 Algorithmic [ssues . 162 3.4.1 Gradient Descent . ................

  . 164 3.4.2 Stochastic Gradient Descent ..... ................

  165 3.4.3 The Perceptron Algorithm ..

  3.4.4 t69 Complexity Tssues .

  3.5 t75 Scholi.

  .

  4 186

  CHAPTER Kernel Machines ..........

  4.1 Feature Space. . . .. . . . . . . . . . . ... ...................... 187

  4.1.1 187 . .. . . Polynomial Preprocessing .. ..... . . . . . . . . . . .. .. .. .. .. .. .. .. ... 188

  4.1.2 Boolean Enrichment . . . . . . . . . . . . . . . . . . . . . . . . .. 189 4.1.3 Invariant Featllre Maps . . . . . . 190

  4.1.4 Linear-Separability in High-Dimensional Spaces . , ........... " ....... 194 4.2 Maximum Margin Problem ...... .

  . . . . . . .. 194

4.2.1 Classification Under Linear·Separability .. ..

4.2.2 . . .... . . . . . . . ... .. 198 Dealillg With Soft·Constraints

  201 4.2.3 Regression ........................ . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . .. 207 4.3 Kernel Functions . . . • • . . . . . . . . . . . 207

  4.3.1 Similarity and Kernel Trick • . . . . . . . . . . . . . . . . . . . . . 208

  4.3.2 Characterization of Kernels

  Contents ix

  8 5.5.1 Supervised Learning . . . . . . . .

  5.4.3 Deep Convolutional Networks. . . . . . . . . . . 293

  5.5 Learning in Feedforward Networks. . . . . . . ...

  .

  .

  .

  . . .

  .

  . .

  .

  .

  29

  298

  5.4 Convolutional Networks. . . . . . . . . . . . . . . . . . . . 280 5. 4.1 Kernels, Convolutions, and Receptive Fields . 280

  5.5.2 Backpropagation ...................... 298 5.5.3 Symbolic and Automatic Differentiation ....

  306

  5.5.4 Regularization Issues 5.6 Complexity Issues .. ............. .

  5.6.1 On the Problem of Loeal Minima

  5.6.2 Facing Saturation ..

  5.6.3 Complexity and Numerical Issues ............... .

  5.7 Scholia ... .............. .... ... .

  CHAPTER 6 Learning ami Reasoning With Constraints 308 3 13

  313 319 323 326 340

  

6.1 Constraint Machines. . . . . . . . . . . . . . . . . . . . 3 43

  6.1.1 Walking Throllgh Learning and Inference ... 3 43

  5.4.2 Incorporating Invariance. . . . . . . . . . . . . . . . 28 8

  5.3.4 Deep Networks and Representational Issues 276

  4.3.3 The Reproducing Kernel Map . . . . . . . . . . . . . . . . . . . .. 212

  7

  4.3.4 Types of Kernels ................................ 2 14

  4.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .. 220

  4.4.1 Regularized Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 220

  4.4.2 Regularization in RKHS . . . . . . . . . . . . . . . . . . . . . . . . .. 222

  4.4.3 Minimization of Regularized Risks. . . . . . . . . . . . . . . .. 223

  4.4.4 Regularization Operators .... . . . . . . . . . . . . . .. 22 4

  4.5 Scholia .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 230 CHAPTER 5 Deep Architectures.

  . . . .... .. . . . . 236

  5.1 Architectural Issues. . . . . . . . . . . . . . . . . . . . . .

  2

  3

  5.1.1 Digraphs and Feedforward Networks 238 5.1 .2 Deep Paths . . . . . . . . .... . 240

  5.3.3 Solution Space and Separation Surfaces. . . . 271

  5.1.3 From Deep to Relaxation-Based Architectures 24 3

  5.1.4 Classifiers, Regressors, and Auto-Encoders . 244

  5.2 Realization of Boolean Functions .. .

  .. . . . . . . 247

  5.2.1 Canonical Realizations by and-or Gates . . .. . . . . 247

  5.2.2 Universal na nd Realization. . . 251 5.2.3 Shallow vs Deep Realizations . . . . . . . . . .

  251

  5.2.4 LTU-Based Realizations and Complexity Issues . 254

  

5.3 Realization of Real-Valued Functions. .......... 26 5

  5.3.1 Computational Geometry-Based Realizations 26 5

  5.3.2 Universal Approximation . . . . . . . . . . . . 268

  6.1.2 A Unified View of Constrained Environments. 352

  x Contents . 359 6.1 .3 Functional Representation of Learning Task s ........ . 36 4 6.1 .4 Reasoning With Constraints .... .................. , .......... , , . 373

  6.2 Logic Constraints in the Environment,

  6.2.1 Formal Logic ,\Od Complexity of Reasoning. . . . . . . .. 373

  6.2.2 Enviromnents With Symbols and Subsymbols . . . . . . .. 376

  6.2.3 T-Norms................................ ....... 384 6.2.4 . . . . . .. . . . . . . . . . .. 388

Lukasiewicz Propositional Logic.

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 392

  6.3 Diffusion Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 393 6.3 .1 Data Models.

  6.3.2 Diffusion in SpatioternporaJ Environments .. , , . .. . . .. 399

  6.3.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . .. . .. 400

6.4 . . .

  Algorithmic [ssues .. ...... ....... . . .. ....... .. . ... . 404 6.4. 1 Pointwise Content-Based Constraints. . . . . . . . . . . . . .. 405 . . .

  4 . . . . . .. 408

  6 2 Propositional Constraints in the Tnput Space . . . . . . . . .

  6.4.3 Supervised Learning With Linear Constraints 4 1 3 . . . . . . . . . . . . . .

  6.4.4 Learning Under Dif f usion Constraints 4 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 424

  6.5 Lif e·Long Learning Agents

  6.5.1 Cognitive Action and Temporal Manifolds. . . . . . . . . .. 425

  6.5.2 Energy Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 430

  6.5.3 Focus of Attention, Teaching, and Active Learning .... 431 . .

  4 . . . . . . . . . . . . . . . . . . . . . . . .. 433

  6 5 Developmental Learning.

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 6.6 . Scholia

  7 446

  CHAPTER Epilogue to 452

  CHAPTER 8 Answers Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Section J.J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . 45 4 Section .1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 455 Section 1.3 .

  455 Section 2.1 . . . . . _ . . 459 . .

  Section 2.2 . . . . . . . 465 Section 3. 1 . .. . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 Section 3.2 . 471 Section 3.3 ......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Section 3.4

  473 Section 4.1 . 475 Section 4.2 .......................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 Section 4.3 . 486 Section 4.4 .......................................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

Section 5.1 . . . .

. . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . . . . .. 489 Section 5.2 . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . 490 Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . 492 Section 5.4 . 494 Section 5.5 ..........................................

  Contents xi 495

Section 5.7 • . • . . . . • . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . .

497

  

Section . . • . . . . • . • . • • . . . . . . . . . . . • . . . . . . . . . . . . . . . . .

  6.1 500 Section 6.2 ........ . ..... .

  Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 502 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 504 Section

  6.4 508 Appendix A Constrained Optimization in Finite Dimensions..

  B 512 Appendix Regularization Operators. . . . . 518 ... .... ............ .... .

  Appendix C Calculus of Variations . . . . _ . . . . 5 1 8 C.l

  Functionals and Variations a ns 520 C.2 . . . . . _ . . . . . . . . . . _ .

  Basic Notion on Vari tio . . . .

  Euler-Lagrange Equations C.4

  523 C.3 . . . . . _ . . . . . . . . . _ . .

  Variational Problems With Subsidiary Conditions. 526 . 530 Appendix D Index to Notation ................................... • • . . . . • . . . . . . . . . . . • . . . . . . . . • . . • . . . . . . . . . . . . . . . . . . . . . 534 Bibliography . . . . . . . . . . . • . . . . . .

  552 Index

  Preface

  Machine Learning projects our ultimate desire to understand the essence of human in- telligence onto the space of technology. As such, while it cannot be fully understood in the restricted field of computer science, it is not necessarily the search of clever emulations of human cognition. While digging into the secrets of neuroscience might stimulate refreshing ideas on computational processes behind intelligence, most of nowadays advances in machine learning rely on models mostly rooted in mathemat- ics and on corresponding computer implementation. Notwithstanding brain science will likely continue the path towards the intriguing connections with artificial compu- tational schemes, one might reasonably conjecture that the basis for the emergence of cognition should not necessarily be searched in the astonishing complexity of biologi-

  Machine learning

  cal solutions, but mostly in higher level computational laws. The biological solutions

  and

  for supporting different forms of cognition are in fact cryptically interwound with

  information-based

  the parallel need of supporting other fundamental life functions, like metabolism, laws of cognition. growth, body weight regulation, and stress response. However, most human-like in- telligent processes might emerge regardless of this complex environment. One might reasonably suspect that those processes be the outcome of information-based laws of cognition, that hold regardless of biology. There is clear evidence of such an in- variance in specific cognitive tasks, but the challenge of artificial intelligence is daily enriching the range of those tasks. While no one is surprised anymore to see the com- puter power in math and logic operations, the layman is not very well aware of the outcome of challenges on games, yet. They are in fact commonly regarded as a dis- tinctive sign of intelligence, and it is striking to realize that games are already mostly dominated by computer programs! Sam Loyd’s 15 puzzle and the Rubik’s cube are nice examples of successes of computer programs in classic puzzles. Chess, and more recently, Go clearly indicate that machines undermines the long last reign of human intelligence. However, many cognitive skills in language, vision, and motor control, that likely rely strongly on learning, are still very hard to achieve.

  Looking inside the

  This book drives the reader into the fascinating field of machine learning by of- book. fering a unified view of the discipline that relies on modeling the environment as an appropriate collection of constraints that the agent is expected to satisfy. Nearly every task which has been faced in machine learning can be modeled under this mathematical framework. Linear and threshold linear machines, neural networks, and kernel machines are mostly regarded as adaptive models that need to softly-satisfy a set of point-wise constraints corresponding to the training set. The classic risk, in both the functional and empirical forms, can be regarded as a penalty function to be minimized in a soft-constrained system. Unsupervised learning can be given a similar formulation, where the penalty function somewhat offers an interpretation of the data probability distribution. Information-based indexes can be used to ex- tract unsupervised features, and they can clearly be thought of as a way of enforcing soft-constraints. An intelligent agent, however, can strongly benefit also from the ac- quisition of abstract granules of knowledge given in some logic formalism. While

  xiii

  xiv Preface

  artificial intelligence has achieved a remarkable degree of maturity in the topic of knowledge representation and automated reasoning, the foundational theories that are mostly rooted in logic lead to models that cannot be tightly integrated with machine learning. While regarding symbolic knowledge bases as a collection of constraints, this book draws a path towards a deep integration with machine learning that relies on the idea of adopting multivalued logic formalisms, like in fuzzy systems. A special at- tention is reserved to deep learning, which nicely fits the constrained-based approach followed in this book. Some recent foundational achievements on representational is- sues and learning, joined with appropriate exploitation of parallel computation, have been creating a fantastic catalyst for the growth of high tech companies in related fields all around the world. In the book I do my best to jointly disclose the power of deep learning and its interpretation in the framework of constrained environments, while warning from uncritical blessing. In so doing, I hope to stimulate the reader to conquer the appropriate background to be ready to quickly grasp also future innova- tions.

  Throughout the book, I expect the reader to become fully involved in the disci- pline, so as to maturate his own view, more than to settle up into frameworks served by others. The book gives a refreshing approach to basic models and algorithms of machine learning, where the focus on constraints nicely leads to dismiss the classic difference between supervised, unsupervised, and semi-supervised learning. Here are some book features:

  • It is an introductory book for all readers who love in-depth explanations of funda- mental concepts.
  • It is intended to stimulate questions and help a gradual conquering of basic meth- ods, more than offering “recipes for cooking.”
  • It proposes the adoption of the notion of constraint as a truly unified treatment of nowadays most common machine learning approaches, while combining the strength of logic formalisms dominating in the AI community.
  • It contains a lot of exercises along with the answers, according to a slight modifi- cation of Donald Knuth’s difficulty ranking.
  • It comes with a companion website to assist more on practical issues.

  The reader.

  The book has been conceived for readers with basic background in mathemat- ics and computer science. More advanced topics are assisted by proper appendixes. The reader is strongly invited to act critically, and complement the acquisition of the concepts by the proposed exercises. He is invited to anticipate solutions and check them later in the part “Answers to the Exercises.” My major target while writing this book has been that of presenting concepts and results in such a way that the reader feels the same excitement as the one who discovered them. More than a passive read- ing, he is expected to be fully involved in the discipline and play a truly active role. Nowadays, one can quickly access the basic ideas and begin working with most com- mon machine learning topics thanks to great web resources that are based on nice illustrations and wonderful simulations. They offer a prompt, yet effective support for everybody who wants to access the field. A book on machine learning can hardly

  xv Preface

  compete with the explosive growth of similar web resources on the presentation of good recipes for fast development of applications. However, if you look for an in- depth understanding of the discipline, then you must shift the focus on foundations, and spend more time on basic principles that are likely to hold for many algorithms and technical solutions used in real-world applications. The most important target in writing this book was that of presenting foundational ideas and provide a unified view centered around information-based laws of learning. It grew up from material collected during courses at Master’s and PhD level mostly given at the University of Siena, and it was gradually enriched by my own view point of interpreting learning under the unifying notion of environmental constraint. When considering the impor- tant role of web resources, this is an appropriate text book for Master’s courses in machine learning, and it can also be adequate for complementing courses on pattern recognition, data mining, and related disciplines. Some parts of the book are more appropriate for courses at the PhD level. In addition, some of the proposed exercises, which are properly identified, are in fact a careful selection of research problems that represent a challenge for PhD students. While the book has been primarily conceived for students in computer science, its overall organization and the way the topics are covered will likely stimulate the interest of students also in physics and mathematics.

  While writing the book I was constantly stimulated by the need of quenching my own thirst of knowledge in the field, and by the challenge of passing through the main principles in a unified way. I got in touch with the immense literature in the field and discovered my ignorance of remarkable ideas and technical developments. I learned a lot and did enjoy the act of retracing results and their discovery. It was really a pleasure, and I wish the reader experienced the same feeling while reading this book.

  Siena

  Marco Gori

  July 2017

  xvi Preface ACKNOWLEDGMENTS

  As it usually happens, it’s hard not to forget people who have played a role in this book. An overall thanks is for all who taught me, in different ways, how to find the reasons and the logic within the scheme of things. It’s hard to make a list, but they definitely contributed to the growth of my desire to understand human intelligence and to study and design intelligent machines; that desire is likely to be the seed of this book. Most of what I’ve written comes from lecturing Master’s and PhD courses on Machine Learning, and from re-elaborating ideas and discussions with colleagues and students at the AI lab of the University of Siena in the last decade. Many insight- ful discussions with C. Lee Giles, Ah Chung Tsoi, Paolo Frasconi, and Alessandro Sperduti contributed to conquering my view on recurrent neural networks as diffusion machines presented in this book. My viewpoint of learning from constraints has been gradually given the picture that you can find in this book also thanks to the interac- tion with Marcello Sanguineti, Giorgio Gnecco, and Luciano Serafini. The criticisms on benchmarks, along with the proposal of crowdsourcing evaluation schemes, have emerged thanks to the contribution of Marcello Pelillo and Fabio Roli, who collab- orated with me in the organization of a few events on the topic. I’m indebted with Patrick Gallinari, who invited me to spend part of the 2016 summer at LIP6, Uni- versité Pierre et Marie Curie, Paris. I found there a very stimulating environment for writing this book. The follow-up of my seminars gave rise to insightful discussions with colleagues and students in the lab. The collaboration with Stefan Knerr contam- inated significantly my view on the role of learning in natural language processing. Most of the advanced topics covered in this book benefited from his long-term vision on the role of machine learning in conversational agents. I benefited of the accurate check and suggestions by Beatrice Lazzerini and Francesco Giannini on some parts of the book.

  Alessandro Betti deserves a special mention. His careful and in-depth reading gave rise to remarkable changes in the book. Not only he discovered errors, but he also came up with proposals for alternative presentations, as well as with related interpretations of basic concepts. A number of research oriented exercises were in- cluded in the book after our long daily stimulating discussions. Finally, his advises

  A

  and support on L TEX typesetting have been extremely useful.

  I thank Lorenzo Menconi and Agnese Gori for their artwork contribution in the cover and in the opening-chapter pictures, respectively. Finally, thanks to Cecilia, Irene, and Agnese for having tolerated my elsewhere-mind during the weekends of work on the book, and for their continuous support to a Cyborg, who was hovering from one room to another with his inseparable laptop.

  xvii Preface

READING GUIDELINES

  Most of the book chapters are self-contained, so as one can profitably start reading Chapter

  4 on kernel machines or Chapter 5 on deep architecture without having read

  the first three chapters. Even though Chapter

  6 is on more advanced topics, it can be

  read independently of the rest of the book. The big picture given in Chapter

  1 offers

  the reader a quick discussion on the main topics of the book, while Chapter

  2 , which

  could also be omitted at a first reading, provides a general framework of learning principles that surely facilitates an in-depth analysis of the subsequent topics. Fi- nally, Chapter

  3 on linear and linear-threshold machines is perhaps the simplest way

  to start the acquisition of machine learning foundations. It is not only of historical interest; it is extremely important to appreciate the meaning of deep understanding of architectural and learning issues, which is very hard to achieve for other more com- plex models. Advanced topics in the book are indicated by the “dangerous-bend” and “double dangerous bend” symbols:

  , ; research topics will be denoted by the “work-in-progress” symbol: . Notes on the Exercises

  While reading the book, the reader is stimulated to retrace and rediscover main prin- ciples and results. The acquisition of new topics challenges the reader to complement some missing pieces to compose the final puzzle. This is proposed by exercises at the end of each section that are designed for self-study as well as for classroom study. Following Donald Knuth’s books organization, this way of presenting the material relies on the belief that “we all learn best the things that we have discovered for ourselves.” The exercises are properly classified and also rated with the purpose of explaining the expected degree of difficulty. A major difference concerns exercises and research problems. Throughout the book, the reader will find exercises that have been mostly conceived for deep acquisition of main concepts and for completing the view proposed in the book. However, there are also a number of research problems that I think can be interesting especially for PhD students. Those problems are prop- erly framed in the book discussion, they are precisely formulated and are selected because of their scientific relevance; in principle, solving one of them is the objective of a research paper.

  Exercises and research problems are assessed by following the scheme below which is mostly based on Donald Knuth’s rating s

   Rating Interpretation

  00 An extremely easy exercise that can be answered immediately if the mate-

  rial of the text has been understood; such an exercise can almost always be worked “in your head.”

  10 A simple problem that makes you think over the material just read, but is

  by no means difficult. You should be able to do this in one minute at most; pencil and paper may be useful in obtaining the solution.

  20 An average problem that tests basic understanding of the text material, but you may need about 15 or 20 minutes to answer it completely.

  30 A problem of moderate difficulty and/or complexity; this one may involve

  more than two hours’ work to solve satisfactorily, or even more if the TV is on.

  40 Quite a difficult or lengthy problem that would be suitable for a term project

  in classroom situations. A student should be able to solve the problem in a reasonable amount of time, but the solution is not trivial.

  50 A research problem that has not yet been solved satisfactorily, as far as the

  author knew at the time of writing, although many people have tried. If you have found an answer to such a problem, you ought to write it up for publication; furthermore, the author of this book would appreciate hearing about the solution as soon as possible (provided that it is correct).

1 The rating interpretation is verbatim from [198] .

  xix

  xx Notes on the Exercises

  Roughly speaking, this is a sort of “logarithmic” scale, so as the increment of the score reflects an exponential increment of difficulty. We also adhere to an interest- ing Knuth’s rule on the balance between amount of work required and the degree of creativity needed to solve an exercise. The idea is that the remainder of the rat- ing number divided by 5 gives an indication of the amount of work required. “Thus, an exercise rated 24 may take longer to solve than an exercise that is rated 25, but the latter will require more creativity.” As already pointed out, research problems are clearly identified by the rate 50. It’s quite obvious that regardless of my efforts to pro- vide an appropriate ranking of the exercises, the reader might argue on the attached rate, but I hope that the numbers will offer at least a good preliminary idea on the difficulty of the exercises. The reader of this book might have a remarkably different degree of mathematical and computer science training. The rating preceded by an M indicates whether the exercises is oriented more to students with good background in math and, especially, to PhD students. The rating preceded by a C indicates whether the exercises requires computer developments. Most of these exercises can be term projects in Master’s and PhD courses on machine learning (website of the book). Some exercises marked by are expected to be especially instructive and especially recommended.

  Solutions to most of the exercises appear in the answer chapter. In order to meet the challenge, the reader should refrain from using this chapter or, at least, he/she is expected to use the answers only in case he/she cannot figure out what the solution is. One reason for this recommendation is that it might be the case that he/she comes up with a different solution, so as he/she can check the answer later and appreciate the difference.

  Summary of codes:

  00 Immediate

  10 Simple (one minute)

  Recommended

  20 Medium (quarter hour) C Computer development

  30 Moderately hard M Mathematically oriented

  40 Term project HM Requiring “higher math”

  50 Research problem

  CHAPTER The Big Picture

  1 Let’s start! Machine Learning. DOI:

  3

1.1 Why Do Machines Need to Learn?

  This chapter gives a big picture of the book. Its reading offers an overall view of the current machine learning challenges, after having discussed principles and their concrete application to real-world problems. The chapter introduces the intriguing topic of induction, by showing its puzzling nature, as well as its necessity in any task which involves perceptual information.

1.1 WHY DO MACHINES NEED TO LEARN?

  The metalevel of

  Why do machines need to learn? Don’t they just run the program, which simply machine learning. solves a given problem? Aren’t programs only the fruit of human creativity, so as machines simply execute them efficiently? No one should start reading a machine learning book without having answered these questions. Interestingly, we can easily see that the classic way of thinking about computer programming as algorithms to express, by linguistic statements, our own solutions isn’t adequate to face many chal- lenging real-world problems. We do need to introduce a metalevel, where, more than formalizing our own solutions by programs, we conceive algorithms whose purpose becomes that of describing how machines learn to execute the task.

  As an example let us consider the case of handwritten character

  Handwritten

  recognition. To make things easy, we assume that an intelligent agent is

  d

  expected to recognize chars that are generated using black and white characters: The

  2 warning!

  pixels only — as it is shown in the figure. We will show that also this dramatic simplification doesn’t reduce significantly the difficulty of facing this problem by algorithms based on our own understanding of regularities. One early realizes that human-based decision processes are very difficult to encode into precise algorithmic formulations. How can we provide a formal description of character “

  2 ”? The instance of the above picture suggests how tentative algorithmic

  descriptions of the class can become brittle. A possible way of getting rid of this difficulty is to try a brute force approach, where all possible pictures on the retina with the chosen resolution are stored in a table, along with the corresponding class code. The above 8 × 8 resolution char is converted into a Boolean string of 64 bits by scanning the picture by rows: 0001100000100100000000100000001000000010100001000111110000000011.

  ∼ (1.1.1)

  Of course, we can construct tables with similar strings, along with the associated class code. In so doing, handwritten char recognition would simply be reduced to

  Char recognition

  the problem of searching a table. Unfortunately, we are in front of a table with

  64 by searching a

  2 = 18446744073709551616 items, and each of them will occupy 8 bytes, for 18 table. a total of approximately 147 quintillion (10 ) bytes, which makes totally unreason- able the adoption of such a plain solution. Even a resolution as small as 5×6 requires storing 1 billion records, but just the increment to 6 × 7 would require storing about

4 CHAPTER 1 The Big Picture

  4 trillion records! For all of them, the programmer would be expected to be patient enough to complete the table with the associated class code. This simple example is

  d

  a sort of 2 warning message : As d grows towards values that are ordinarily used for the retina resolution, the space of the table becomes prohibitive. There is more

  • – we have made the tacit assumption that the characters are provided by a reliable segmentation program, which extracts them properly from a given form. While this might be reasonable in simple contexts, in others segmenting the characters might

  Segmentation might

  be as difficult as recognizing them. In vision and speech perception, nature seems

  be as difficult as

  to have fun in making segmentation hard. For example, the word segmentation of

  recognition!

  speech utterances cannot rely on thresholding analyses to identify low levels of the

  “computers

  signal. Unfortunately, those analyses are doomed to fail. The sentence

  are attacking the secret of intelligence”

  , quickly pronounced, would likely be segmented as

  com / pu / tersarea / tta / ckingthesecre / tofin / telligence . p, t, k

  The signal is nearly null before the explosion of voiceless plosives , whereas, because of phoneme coarticulation, no level-based separation between contiguous words is reliable. Something similar happens in vision. Overall, it looks like seg- mentation is a truly cognitive process that in most interesting tasks does require understanding the information source.

1.1.1 LEARNING TASKS

  Agent: χ

  Intelligent agents interact with the environment, from which they are expected to : E → D. learn, with the purpose of solving assigned tasks. In many interesting real-world prob- lems we can make the reasonable assumption that the intelligent agent interacts with the environment by distinct segmented elements e ∈ E of the learning environment, on which it is expected to take a decision. Basically, we assume somebody else has already faced and solved the segmentation problem, and that the agent only processes single elements from the environment. Hence, the agent can be regarded as a func- tion χ : E → O, where the decision result is an element of O. For example, when performing optical character recognition in plain text, the character segmentation can take place by algorithms that must locate the row/column transition from the text to background. This is quite simple, unless the level of noise in the image document is pretty high.

  χ

  In general, the agent requires an opportune internal representation of elements

  = h ◦ f ◦ π, where π is the input

  in E and O, so that we can think of χ as the composition χ = h ◦ f ◦ π. Here

  encoding, f is the

  π : E → X is a preprocessing map that associates every element of the environment

  learning function,

  e with a point x = π(e) in the input space X , f : X → Y is the function that takes

  and h is the output the decision y = f (x) on x, while h: Y → O maps y onto the output o = h(y). encoding.

  In the above handwritten character recognition task we assume that we are given a low resolution camera so that the picture can be regarded as a point in the environ-

   — as ment space E . This element can be represented — as suggested by Eq.

  64

  ). Basically, in this elements of a 64-dimensional Boolean hypercube (i.e., X ⊂ R

  5

1.1 Why Do Machines Need to Learn?

  case π is simply mapping the Boolean matrix to a Boolean vector by row scanning in such a way that there is no information loss when passing from e to x. As it will be shown later, on the other hand, the preprocessing function π typically returns a pat- tern representation with information loss with respect to the original environmental representation e ∈ E . Function f maps this representation onto the one-hot encoding of number 2 and, finally, h transforms this code onto a representation of the same number that is more suitable for the task at hand:

  π ′

  −→ (0, 0, 0, 1, 1, 0, 0, 0, . . . , 0, 0, 0, 0, 0, 0, 1, 1)

  f h ′

  −→ (0, 0, 1, 0, 0, 0, 0, 0, 0, 0) −→ 2. Overall the action of χ can be nicely written as χ( )

  = 2. In many learning ma- chines, the output encoding function h plays a more important role, which consists

  10

  onto the corresponding of converting real-valued representations y = f (x) ∈ R one-hot representation. For example, in this case, one could simply choose h such that h (y) arg max , where δ denotes the Kronecher’s delta. In doing so, the