Mastering Java Machine Learning Architectures 7 pdf pdf

Table of Contents
Mastering Java Machine Learning
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Machine Learning Review

Machine learning – history and definition
What is not machine learning?
Machine learning – concepts and terminology
Machine learning – types and subtypes
Datasets used in machine learning
Machine learning applications
Practical issues in machine learning
Machine learning – roles and process
Roles
Process
Machine learning – tools and datasets
Datasets
Summary
2. Practical Approach to Real-World Supervised Learning
Formal description and notation
Data quality analysis
Descriptive data analysis
Basic label analysis
Basic feature analysis


Visualization analysis
Univariate feature analysis
Categorical features
Continuous features
Multivariate feature analysis
Data transformation and preprocessing
Feature construction
Handling missing values
Outliers
Discretization
Data sampling
Is sampling needed?
Undersampling and oversampling
Stratified sampling
Training, validation, and test set
Feature relevance analysis and dimensionality reduction
Feature search techniques
Feature evaluation techniques
Filter approach
Univariate feature selection

Information theoretic approach
Statistical approach
Multivariate feature selection
Minimal redundancy maximal relevance (mRMR)
Correlation-based feature selection (CFS)
Wrapper approach
Embedded approach
Model building
Linear models
Linear Regression
Algorithm input and output
How does it work?
Advantages and limitations
Naïve Bayes
Algorithm input and output
How does it work?
Advantages and limitations
Logistic Regression
Algorithm input and output
How does it work?

Advantages and limitations
Non-linear models
Decision Trees

Algorithm inputs and outputs
How does it work?
Advantages and limitations
K-Nearest Neighbors (KNN)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Support vector machines (SVM)
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Ensemble learning and meta learners
Bootstrap aggregating or bagging
Algorithm inputs and outputs
How does it work?
Random Forest

Advantages and limitations
Boosting
Algorithm inputs and outputs
How does it work?
Advantages and limitations
Model assessment, evaluation, and comparisons
Model assessment
Model evaluation metrics
Confusion matrix and related metrics
ROC and PRC curves
Gain charts and lift curves
Model comparisons
Comparing two algorithms
McNemar's Test
Paired-t test
Wilcoxon signed-rank test
Comparing multiple algorithms
ANOVA test
Friedman's test
Case Study – Horse Colic Classification

Business problem
Machine learning mapping
Data analysis
Label analysis
Features analysis
Supervised learning experiments
Weka experiments

Sample end-to-end process in Java
Weka experimenter and model selection
RapidMiner experiments
Visualization analysis
Feature selection
Model process flow
Model evaluation metrics
Evaluation on Confusion Metrics
ROC Curves, Lift Curves, and Gain Charts
Results, observations, and analysis
Summary
References

3. Unsupervised Machine Learning Techniques
Issues in common with supervised learning
Issues specific to unsupervised learning
Feature analysis and dimensionality reduction
Notation
Linear methods
Principal component analysis (PCA)
Inputs and outputs
How does it work?
Advantages and limitations
Random projections (RP)
Inputs and outputs
How does it work?
Advantages and limitations
Multidimensional Scaling (MDS)
Inputs and outputs
How does it work?
Advantages and limitations
Nonlinear methods
Kernel Principal Component Analysis (KPCA)

Inputs and outputs
How does it work?
Advantages and limitations
Manifold learning
Inputs and outputs
How does it work?
Advantages and limitations
Clustering
Clustering algorithms
k-Means
Inputs and outputs

How does it work?
Advantages and limitations
DBSCAN
Inputs and outputs
How does it work?
Advantages and limitations
Mean shift
Inputs and outputs

How does it work?
Advantages and limitations
Expectation maximization (EM) or Gaussian mixture modeling (GMM)
Input and output
How does it work?
Advantages and limitations
Hierarchical clustering
Input and output
How does it work?
Advantages and limitations
Self-organizing maps (SOM)
Inputs and outputs
How does it work?
Advantages and limitations
Spectral clustering
Inputs and outputs
How does it work?
Advantages and limitations
Affinity propagation
Inputs and outputs

How does it work?
Advantages and limitations
Clustering validation and evaluation
Internal evaluation measures
Notation
R-Squared
Dunn's Indices
Davies-Bouldin index
Silhouette's index
External evaluation measures
Rand index
F-Measure
Normalized mutual information index
Outlier or anomaly detection
Outlier algorithms

Statistical-based
Inputs and outputs
How does it work?
Advantages and limitations

Distance-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Density-based methods
Inputs and outputs
How does it work?
Advantages and limitations
Clustering-based methods
Inputs and outputs
How does it work?
Advantages and limitations
High-dimensional-based methods
Inputs and outputs
How does it work?
Advantages and limitations
One-class SVM
Inputs and outputs
How does it work?
Advantages and limitations
Outlier evaluation techniques
Supervised evaluation
Unsupervised evaluation
Real-world case study
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Feature analysis and dimensionality reduction
PCA
Random projections
ISOMAP
Observations on feature analysis and dimensionality reduction
Clustering models, results, and evaluation
Observations and clustering analysis
Outlier models, results, and evaluation
Observations and analysis

Summary
References
4. Semi-Supervised and Active Learning
Semi-supervised learning
Representation, notation, and assumptions
Semi-supervised learning techniques
Self-training SSL
Inputs and outputs
How does it work?
Advantages and limitations
Co-training SSL or multi-view SSL
Inputs and outputs
How does it work?
Advantages and limitations
Cluster and label SSL
Inputs and outputs
How does it work?
Advantages and limitations
Transductive graph label propagation
Inputs and outputs
How does it work?
Advantages and limitations
Transductive SVM (TSVM)
Inputs and outputs
How does it work?
Advantages and limitations
Case study in semi-supervised learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data quality analysis
Data sampling and transformation
Datasets and analysis
Feature analysis results
Experiments and results
Analysis of semi-supervised learning
Active learning
Representation and notation
Active learning scenarios
Active learning approaches
Uncertainty sampling
How does it work?

Least confident sampling
Smallest margin sampling
Label entropy sampling
Advantages and limitations
Version space sampling
Query by disagreement (QBD)
How does it work?
Query by Committee (QBC)
How does it work?
Advantages and limitations
Data distribution sampling
How does it work?
Expected model change
Expected error reduction
Variance reduction
Density weighted methods
Advantages and limitations
Case study in active learning
Tools and software
Business problem
Machine learning mapping
Data Collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Pool-based scenarios
Stream-based scenarios
Analysis of active learning results
Summary
References
5. Real-Time Stream Machine Learning
Assumptions and mathematical notations
Basic stream processing and computational techniques
Stream computations
Sliding windows
Sampling
Concept drift and drift detection
Data management
Partial memory
Full memory
Detection methods
Monitoring model evolution
Widmer and Kubat

Drift Detection Method or DDM
Early Drift Detection Method or EDDM
Monitoring distribution changes
Welch's t test
Kolmogorov-Smirnov's test
CUSUM and Page-Hinckley test
Adaptation methods
Explicit adaptation
Implicit adaptation
Incremental supervised learning
Modeling techniques
Linear algorithms
Online linear models with loss functions
Inputs and outputs
How does it work?
Advantages and limitations
Online Naïve Bayes
Inputs and outputs
How does it work?
Advantages and limitations
Non-linear algorithms
Hoeffding trees or very fast decision trees (VFDT)
Inputs and outputs
How does it work?
Advantages and limitations
Ensemble algorithms
Weighted majority algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Bagging algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Online Boosting algorithm
Inputs and outputs
How does it work?
Advantages and limitations
Validation, evaluation, and comparisons in online setting
Model validation techniques
Prequential evaluation
Holdout evaluation
Controlled permutations

Evaluation criteria
Comparing algorithms and metrics
Incremental unsupervised learning using clustering
Modeling techniques
Partition based
Online k-Means
Inputs and outputs
How does it work?
Advantages and limitations
Hierarchical based and micro clustering
Inputs and outputs
How does it work?
Advantages and limitations
Inputs and outputs
How does it work?
Advantages and limitations
Density based
Inputs and outputs
How does it work?
Advantages and limitations
Grid based
Inputs and outputs
How does it work?
Advantages and limitations
Validation and evaluation techniques
Key issues in stream cluster evaluation
Evaluation measures
Cluster Mapping Measures (CMM)
V-Measure
Other external measures
Unsupervised learning using outlier detection
Partition-based clustering for outlier detection
Inputs and outputs
How does it work?
Advantages and limitations
Distance-based clustering for outlier detection
Inputs and outputs
How does it work?
Exact Storm
Abstract-C
Direct Update of Events (DUE)
Micro Clustering based Algorithm (MCOD)
Approx Storm

Advantages and limitations
Validation and evaluation techniques
Case study in stream learning
Tools and software
Business problem
Machine learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Supervised learning experiments
Concept drift experiments
Clustering experiments
Outlier detection experiments
Analysis of stream learning results
Summary
References
6. Probabilistic Graph Modeling
Probability revisited
Concepts in probability
Conditional probability
Chain rule and Bayes' theorem
Random variables, joint, and marginal distributions
Marginal independence and conditional independence
Factors
Factor types
Distribution queries
Probabilistic queries
MAP queries and marginal MAP queries
Graph concepts
Graph structure and properties
Subgraphs and cliques
Path, trail, and cycles
Bayesian networks
Representation
Definition
Reasoning patterns
Causal or predictive reasoning
Evidential or diagnostic reasoning
Intercausal reasoning
Combined reasoning
Independencies, flow of influence, D-Separation, I-Map
Flow of influence

D-Separation
I-Map
Inference
Elimination-based inference
Variable elimination algorithm
Input and output
How does it work?
Advantages and limitations
Clique tree or junction tree algorithm
Input and output
How does it work?
Advantages and limitations
Propagation-based techniques
Belief propagation
Factor graph
Messaging in factor graph
Input and output
How does it work?
Advantages and limitations
Sampling-based techniques
Forward sampling with rejection
Input and output
How does it work?
Advantages and limitations
Learning
Learning parameters
Maximum likelihood estimation for Bayesian networks
Bayesian parameter estimation for Bayesian network
Prior and posterior using the Dirichlet distribution
Learning structures
Measures to evaluate structures
Methods for learning structures
Constraint-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Search and score-based techniques
Inputs and outputs
How does it work?
Advantages and limitations
Markov networks and conditional random fields
Representation
Parameterization

Gibbs parameterization
Factor graphs
Log-linear models
Independencies
Global
Pairwise Markov
Markov blanket
Inference
Learning
Conditional random fields
Specialized networks
Tree augmented network
Input and output
How does it work?
Advantages and limitations
Markov chains
Hidden Markov models
Most probable path in HMM
Posterior decoding in HMM
Tools and usage
OpenMarkov
Weka Bayesian Network GUI
Case study
Business problem
Machine learning mapping
Data sampling and transformation
Feature analysis
Models, results, and evaluation
Analysis of results
Summary
References
7. Deep Learning
Multi-layer feed-forward neural network
Inputs, neurons, activation function, and mathematical notation
Multi-layered neural network
Structure and mathematical notations
Activation functions in NN
Sigmoid function
Hyperbolic tangent ("tanh") function
Training neural network
Empirical risk minimization
Parameter initialization
Loss function

Gradients
Gradient at the output layer
Gradient at the Hidden Layer
Parameter gradient
Feed forward and backpropagation
How does it work?
Regularization
L2 regularization
L1 regularization
Limitations of neural networks
Vanishing gradients, local optimum, and slow training
Deep learning
Building blocks for deep learning
Rectified linear activation function
Restricted Boltzmann Machines
Definition and mathematical notation
Conditional distribution
Free energy in RBM
Training the RBM
Sampling in RBM
Contrastive divergence
Inputs and outputs
How does it work?
Persistent contrastive divergence
Autoencoders
Definition and mathematical notations
Loss function
Limitations of Autoencoders
Denoising Autoencoder
Unsupervised pre-training and supervised fine-tuning
Deep feed-forward NN
Input and outputs
How does it work?
Deep Autoencoders
Deep Belief Networks
Inputs and outputs
How does it work?
Deep learning with dropouts
Definition and mathematical notation
Inputs and outputs
How does it work?
Learning Training and testing with dropouts
Sparse coding

Convolutional Neural Network
Local connectivity
Parameter sharing
Discrete convolution
Pooling or subsampling
Normalization using ReLU
CNN Layers
Recurrent Neural Networks
Structure of Recurrent Neural Networks
Learning and associated problems in RNNs
Long Short Term Memory
Gated Recurrent Units
Case study
Tools and software
Business problem
Machine learning mapping
Data sampling and transfor
Feature analysis
Models, results, and evaluation
Basic data handling
Multi-layer perceptron
Parameters used for MLP
Code for MLP
Convolutional Network
Parameters used for ConvNet
Code for CNN
Variational Autoencoder
Parameters used for the Variational Autoencoder
Code for Variational Autoencoder
DBN
Parameter search using Arbiter
Results and analysis
Summary
References
8. Text Mining and Natural Language Processing
NLP, subfields, and tasks
Text categorization
Part-of-speech tagging (POS tagging)
Text clustering
Information extraction and named entity recognition
Sentiment analysis and opinion mining
Coreference resolution
Word sense disambiguation

Machine translation
Semantic reasoning and inferencing
Text summarization
Automating question and answers
Issues with mining unstructured data
Text processing components and transformations
Document collection and standardization
Inputs and outputs
How does it work?
Tokenization
Inputs and outputs
How does it work?
Stop words removal
Inputs and outputs
How does it work?
Stemming or lemmatization
Inputs and outputs
How does it work?
Local/global dictionary or vocabulary?
Feature extraction/generation
Lexical features
Character-based features
Word-based features
Part-of-speech tagging features
Taxonomy features
Syntactic features
Semantic features
Feature representation and similarity
Vector space model
Binary
Term frequency (TF)
Inverse document frequency (IDF)
Term frequency-inverse document frequency (TF-IDF)
Similarity measures
Euclidean distance
Cosine distance
Pairwise-adaptive similarity
Extended Jaccard coefficient
Dice coefficient
Feature selection and dimensionality reduction
Feature selection
Information theoretic techniques
Statistical-based techniques

Frequency-based techniques
Dimensionality reduction
Topics in text mining
Text categorization/classification
Topic modeling
Probabilistic latent semantic analysis (PLSA)
Input and output
How does it work?
Advantages and limitations
Text clustering
Feature transformation, selection, and reduction
Clustering techniques
Generative probabilistic models
Input and output
How does it work?
Advantages and limitations
Distance-based text clustering
Non-negative matrix factorization (NMF)
Input and output
How does it work?
Advantages and limitations
Evaluation of text clustering
Named entity recognition
Hidden Markov models for NER
Input and output
How does it work?
Advantages and limitations
Maximum entropy Markov models for NER
Input and output
How does it work?
Advantages and limitations
Deep learning and NLP
Tools and usage
Mallet
KNIME
Topic modeling with mallet
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Feature analysis and dimensionality reduction
Models, results, and evaluation
Analysis of text processing results

Summary
References
9. Big Data Machine Learning – The Final Frontier
What are the characteristics of Big Data?
Big Data Machine Learning
General Big Data framework
Big Data cluster deployment frameworks
Hortonworks Data Platform
Cloudera CDH
Amazon Elastic MapReduce
Microsoft Azure HDInsight
Data acquisition
Publish-subscribe frameworks
Source-sink frameworks
SQL frameworks
Message queueing frameworks
Custom frameworks
Data storage
HDFS
NoSQL
Key-value databases
Document databases
Columnar databases
Graph databases
Data processing and preparation
Hive and HQL
Spark SQL
Amazon Redshift
Real-time stream processing
Machine Learning
Visualization and analysis
Batch Big Data Machine Learning
H2O as Big Data Machine Learning platform
H2O architecture
Machine learning in H2O
Tools and usage
Case study
Business problem
Machine Learning mapping
Data collection
Data sampling and transformation
Experiments, results, and analysis
Feature relevance and analysis

Evaluation on test data
Analysis of results
Spark MLlib as Big Data Machine Learning platform
Spark architecture
Machine Learning in MLlib
Tools and usage
Experiments, results, and analysis
k-Means
k-Means with PCA
Bisecting k-Means (with PCA)
Gaussian Mixture Model
Random Forest
Analysis of results
Real-time Big Data Machine Learning
SAMOA as a real-time Big Data Machine Learning framework
SAMOA architecture
Machine Learning algorithms
Tools and usage
Experiments, results, and analysis
Analysis of results
The future of Machine Learning
Summary
References
A. Linear Algebra
Vector
Scalar product of vectors
Matrix
Transpose of a matrix
Matrix addition
Scalar multiplication
Matrix multiplication
Properties of matrix product
Linear transformation
Matrix inverse
Eigendecomposition
Positive definite matrix
Singular value decomposition (SVD)
B. Probability
Axioms of probability
Bayes' theorem
Density estimation
Mean
Variance

Standard deviation
Gaussian standard deviation
Covariance
Correlation coefficient
Binomial distribution
Poisson distribution
Gaussian distribution
Central limit theorem
Error propagation
Index

Mastering Java Machine Learning

Mastering Java Machine Learning
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: June 2017
Production reference: 1290617
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-051-3
www.packtpub.com

Credits
Authors
Uday Kamath
Krishna Choppella
Reviewer
Samir Sahli
Prashant Verma
Commissioning Editor
Veena Pagare
Acquisition Editor
Divya Poojari
Content Development Editor
Mayur Pawanikar
Technical Editor
Vivek Arora
Copy Editors
Vikrant Phadkay
Safis Editing
Project Coordinator
Nidhi Joshi
Proofreaders
Safis Editing
Indexer

Francy Puthiry
Graphics
Tania Dutta
Production Coordinator
Arvindkumar Gupta
Cover Work
Arvindkumar Gupta

Foreword
Dr. Uday Kamath is a volcano of ideas. Every time he walked into my office, we had fruitful
and animated discussions. I have been a professor of computer science at George Mason
University (GMU) for 15 years, specializing in machine learning and data mining. I have
known Uday for five years, first as a student in my data mining class, then as a colleague
and co-author of papers and projects on large-scale machine learning. While a chief data
scientist at BAE Systems Applied Intelligence, Uday earned his PhD in evolutionary
computation and machine learning. As if having two high-demand jobs was not enough,
Uday was unusually prolific, publishing extensively with four different people in the computer
science faculty during his tenure at GMU, something you don't see very often. Given this
pedigree, I am not surprised that less than four years since Uday's graduation with a PhD, I
am writing the foreword for his book on mastering advanced machine learning techniques
with Java. Uday's thirst for new stimulating challenges has struck again, resulting in this
terrific book you now have in your hands.
This book is the product of his deep interest and knowledge in sound and well-grounded
theory, and at the same time his keen grasp of the practical feasibility of proposed
methodologies. Several books on machine learning and data analytics exist, but Uday's
book closes a substantial gap—the one between theory and practice. It offers a
comprehensive and systematic analysis of classic and advanced learning techniques, with a
focus on their advantages and limitations, practical use and implementations. This book is a
precious resource for practitioners of data science and analytics, as well as for
undergraduate and graduate students keen to master practical and efficient
implementations of machine learning techniques.
The book covers the classic techniques of machine learning, such as classification,
clustering, dimensionality reduction, anomaly detection, semi-supervised learning, and active
learning. It also covers advanced and recent topics, including learning with stream data,
deep learning, and the challenges of learning with big data. Each chapter is dedicated to a
topic and includes an illustrative case study, which covers state-of-the-art Java-based tools
and software, and the entire knowledge discovery cycle: data collection, experimental
design, modeling, results, and evaluation. Each chapter is self-contained, providing great
flexibility of usage. The accompanying website provides the source code and data. This is
truly a gem for both students and data analytics practitioners, who can experiment firsthand with the methods just learned or deepen their understanding of the methods by
applying them to real-world scenarios.
As I was reading the various chapters of the book, I was reminded of the enthusiasm Uday
has for learning and knowledge. He communicates the concepts described in the book with
clarity and with the same passion. I am positive that you, as a reader, will feel the same. I
will certainly keep this book as a personal resource for the courses I teach, and strongly

recommend it to my students.
Dr. Carlotta Domeniconi
Associate Professor of Computer Science, George Mason University

About the Authors
Dr. Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence. He
specializes in scalable machine learning and has spent 20 years in the domain of AML,
fraud detection in financial crime, cyber security, and bioinformatics, to name a few. Dr.
Kamath is responsible for key products in areas focusing on the behavioral, social
networking and big data machine learning aspects of analytics at BAE AI. He received his
PhD at George Mason University, under the able guidance of Dr. Kenneth De Jong, where
his dissertation research focused on machine learning for big data and automated sequence
mining.
I would like to thank my friend, Krishna Choppella, for accepting the offer to co-author
this book and being an able partner on this long but satisfying journey.
Heartfelt thanks to our reviewers, especially Dr. Samir Sahli for his valuable comments,
suggestions, and in-depth review of the chapters. I would like to thank Professor
Carlotta Domeniconi for her suggestions and comments that helped us shape various
chapters in the book. I would also like to thank all the Packt staff, especially Divya
Poojari, Mayur Pawanikar, and Vivek Arora, for helping us complete the tasks in time.
This book required making a lot of sacrifices on the personal front and I would like to
thank my wife, Pratibha, and our nanny, Evelyn, for their unconditional support. Finally,
thanks to all my lovely teachers and professors for not only teaching the subjects, but
also instilling the joy of learning.
Krishna Choppella builds tools and client solutions in his role as a solutions architect for
analytics at BAE Systems Applied Intelligence. He has been programming in Java for 20
years. His interests are data science, functional programming, and distributed computing.

About the Reviewers
Samir Sahli was awarded a BSc degree in applied mathematics and information sciences
from the University of Nice Sophia-Antipolis, France, in 2004. He received MSc and PhD
degrees in physics (specializing in optics/photonics/image science) from University Laval,
Quebec, Canada, in 2008 and 2013, respectively. During his graduate studies, he worked
with Defence Research and Development Canada (DRDC) on the automatic detection and
recognition of targets in aerial imagery, especially in the context of uncontrolled environment
and sub-optimal acquisition conditions. He has worked since 2009 as a consultant for
several companies based in Europe and North America specializing in the area of
Intelligence, Surveillance, and Reconnaissance (ISR) and in remote sensing.
Dr. Sahli joined McMaster Biophotonics in 2013 as a postdoctoral fellow. His research was
in the field of optics, image processing, and machine learning. He was involved in several
projects, such as the development of a novel generation of gastrointestinal tract imaging
device, hyperspectral imaging of skin erythema for individualized radiotherapy treatment,
and automatic detection of the precancerous Barrett's esophageal cell using fluorescence
lifetime imaging microscopy and multiphoton microscopy.
Dr. Sahli joined BAE Systems Applied Intelligence in 2015. He has since worked as a data
scientist to develop analytics models to detect complex fraud patterns and money
laundering schemes for insurance, banking, and governmental clients using machine
learning, statistics, and social network analysis tools.
Prashant Verma started his IT career in 2011 as a Java developer in Ericsson, working in
the telecom domain. After a couple of years of Java EE experience, he moved into the big
data domain and has worked on almost all of the popular big data technologies such as
Hadoop, Spark, Flume, Mongo, Cassandra, and so on. He has also played with Scala.
Currently, he works with QA Infotech as a lead data engineer, working on solving e-learning
problems with analytics and machine learning.
Prashant has worked for many companies, such as Ericsson and QA Infotech, with domain
knowledge of telecom and e-learning. He has also worked as a freelance consultant in his
free time.
I want to thank Packt Publishing for giving me the chance to review the book, as well as
my employer and my family for their patience while I was busy working on this book.

www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as
a print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for
a range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.

https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.

Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser

Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https://www.amazon.com/dp/1785880519.
If you'd like to join our team of regular reviewers, you can e-mail us at
customerreviews@packtpub.com. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Dedicated to my parents, Krishna Kamath and Bharathi Kamath, my wife, Pratibha
Shenoy, and the kids, Aaroh and Brandy
--Dr. Uday Kamath.
To my parents
--Krishna Choppella

Preface
There are many notable books on machine learning, from pedagogical tracts on the theory
of learning from data; to standard references on specializations in the field, such as
clustering and outlier detection or probabilistic graph modeling; to cookbooks that offer
practical advice on the use of tools and libraries in a particular language. The books that
tend to be broad in coverage are often short on theoretical detail, while those with a focus
on one topic or tool may not, for example, have much to say about the difference in
approach in a streaming as opposed to a batch environment. Besides, for the non-novices
with a preference for tools in Java who wish to reach for a single volume that will extend
their knowledge—simultaneously, on the essential aspects—there are precious few options.
Finding in one place
The pros and cons of different techniques given any data availability scenario—when
data is labeled or unlabeled, streaming or batch, local, or distributed, structured or
unstructured
A ready reference for the most important mathematical results related to those very
techniques for a better appreciation of the underlying theory
An introduction to the most mature Java-based frameworks, libraries, and visualization
tools with descriptions and illustrations on how to put these techniques into practice is
not possible today, as far as we know
The core idea of this book, therefore, is to address this gap while maintaining a balance
between treatment of theory and practice with the aid of probability, statistics, basic linear
algebra, and rudimentary calculus in the service of one, and emphasizing methodology,
case studies, tools and code in support of the other.
According to the KDnuggets 2016 software poll, Java, at 16.8%, has the second highest
share in popularity among languages used in machine learning, after Python. What's more is
that this marks a 19% increase from the year before! Clearly, Java remains an important
and effective vehicle to build and deploy systems involving machine learning, despite claims
of its decline in some quarters. With this book, we aim to reach professionals and motivated
enthusiasts with some experience in Java and a beginner's knowledge of machine learning.
Our goal is to make Mastering Java Machine Learning the next step on their path to
becoming advanced practitioners in data science. To guide them on this path, the book
covers a veritable arsenal of techniques in machine learning—some which they may already
be familiar with, others perhaps not as much, or only superficially—including methods of
data analysis, learning algorithms, evaluation of model performance, and more in
supervised and semi-supervised learning, clustering and anomaly detection, and semisupervised and active learning. It also presents special topics such as probabilistic graph
modeling, text mining, and deep learning. Not forgetting the increasingly important topics in
enterprise-scale systems today, the book also covers the unique challenges of learning

from evolving data streams and the tools and techniques applicable to real-time systems,
as well as the imperatives of the world of Big Data:
How does machine learning work in large-scale distributed environments?
What are the trade-offs?
How must algorithms be adapted?
How can these systems interoperate with other technologies in the dominant Hadoop
ecosystem?
This book explains how to apply machine learning to real-world data and real-world
domains with the right methodology, processes, applications, and analysis. Accompanying
each chapter are case studies and examples of how to apply the newly learned techniques
using some of the best available open source tools written in Java. This book covers more
than 15 open source Java tools supporting a wide range of techniques between them, with
code and practical usage. The code, data, and configurations are available for readers to
download and experiment with. We present more than ten real-world case studies in
Machine Learning that illustrate the data scientist's process. Each case study details the
steps undertaken in the experiments: data ingestion, data analysis, data cleansing, feature
reduction/selection, mapping to machine learning, model training, model selection, model
evaluation, and analysis of results. This gives the reader a practical guide to using the tools
and methods presented in each chapter for solving the business problem at hand.

What this book covers
Chapter 1, Machine Learning Review, is a refresher of basic concepts and techniques that
the reader would have learned from Packt's Learning Machine Learning in Java or a
similar text. This chapter is a review of concepts such as data, data transformation,
sampling and bias, features and their importance, supervised learning, unsupervised
learning, big data learning, stream and real-time learning, probabilistic graphic models, and
semi-supervised learning.
Chapter 2, Practical Approach to Real-World Supervised Learning, cobwebs dusted, dives
straight into the vast field of supervised learning and the full spectrum of associated
techniques. We cover the topics of feature selection and reduction, linear modeling, logistic
models, non-linear models, SVM and kernels, ensemble learning techniques such as
bagging and boosting, validation techniques and evaluation metrics, and model selection.
Using WEKA and RapidMiner, we carry out a detailed case study, going through all the
steps from data analysis to analysis of model performance. As in each of the other
chapters, the case study is presented as an example to help the reader understand how the
techniques introduced in the chapter are applied in real life. The dataset used in the case
study is UCI HorseColic.
Chapter 3, Unsupervised Machine Learning Techniques, presents many advanced
methods in clustering and outlier techniques, with applications. Topics covered are feature
selection and reduction in unsupervised data, clustering algorithms, evaluation methods in
clustering, and anomaly detection using statistical, distance, and distribution techniques. At
the end of the chapter, we perform a case study for both clustering and outlier detection
using a real-world image dataset, MNIST. We use the Smile API to do feature reduction
and ELKI for learning.
Chapter 4, Semi-supervised Learning and Active Learning, gives details of algorithms and
techniques for learning when only a small amount labeled data is present. Topics covered
are self-training, generative models, transductive SVMs, co-training, active learning, and
multi-view learning. The case study involves both learning systems and is performed on the
real-world UCI Breast Cancer Wisconsin dataset. The tools introduced are
JKernelMachines ,KEEL and JCLAL.
Chapter 5, Real-Time Stream Machine Learning, covers data streams in real-time present
unique circumstances for the problem of learning from data. This chapter broadly covers the
need for stream machine learning and applications, supervised stream learning,
unsupervised cluster stream learning, unsupervised outlier learning, evaluation techniques in
stream learning, and metrics used for evaluation. A detailed case study is given at the end
of the chapter to illustrate the use of the MOA framework. The dataset used is Electricity
(ELEC).

Chapter 6, Probabilistic Graph Modeling, shows that many real-world problems can be
effectively represented by encoding complex joint probability distributions over multidimensional spaces. Probabilistic graph models provide a framework to represent, draw
inferences, and learn effectively in such situations. The chapter broadly covers probability
concepts, PGMs, Bayesian networks, Markov networks, Graph Structure Learning, Hidden
Markov Models, and Inferencing. A detailed case study on a real-world dataset is
performed at the end of the chapter. The tools used in this case study are OpenMarkov and
WEKA's Bayes network. The dataset is UCI Adult (Census Income).
Chapter 7, Deep Learning, If there is one super-star of machine learning in the popular
imagination today it is deep learning, which has attained a dominance among techniques
used to solve the most complex AI problems. Topics broadly covered are neural networks,
issues in neural networks, deep belief networks, restricted Boltzman machines,
convolutional networks, long short-term memory units, denoising autoencoders, recurrent
networks, and others. We present a detailed case study showing how to implement deep
learning networks, tuning the parameters and performing learning. We use DeepLearning4J
with the MNIST image dataset.
Chapter 8, Text Mining and Natural Language Processing, details the techniques,
algorithms, and tools for performing various analyses in the field of text mining. Topics
broadly covered are areas of text mining, components needed for text mining,
representation of text data, dimensionality reduction techniques, topic modeling, text
clustering, named entity recognition, and deep learning. The case study uses real-world
unstructured text data (the Reuters-21578 dataset) highlighting topic modeling and text
classification; the tools used are MALLET and KNIME.
Chapter 9, Big Data Machine Learning – the Final Frontier, discusses some of the most
important challenges of today. What learning options are available when data is either big
or available at a very high velocity? How is scalability handled? Topics covered are big data
cluster deployment frameworks, big data storage options, batch data processing, batch
data machine learning, real-time machine learning frameworks, and real-time stream
learning. In the detailed case study for both big data batch and real-time we select the UCI
Covertype dataset and the machine learning libraries H2O, Spark MLLib and SAMOA.
Appendix A, Linear Algebra, covers concepts from linear algebra, and is meant as a brief
refresher. It is by no means complete in its coverage, but contains a whirlwind tour of some
important concepts relevant to the machine learning techniques featured in the book. It
includes vectors, matrices and basic matrix operations and properties, linear
transformations, matrix inverse, eigen decomposition, positive definite matrix, and singular
value decomposition.
Appendix B, Probability, provides a brief primer on probability. It includes the axioms of
probability, Bayes' theorem, density estimation, mean, variance, standard deviation,
Gaussian standard deviation, covariance, correlation coefficient, binomial distribution,

Poisson distribution, Gaussian distribution, central limit theorem, and error propagation.

What you need for this book
This book assumes you have some experience of programming in Java and a basic
understanding of machine learning concepts. If that doesn't apply to you, but you are
curious nonetheless and self-motivated, fret not, and read on! For those who do have some
background, it means that you are familiar with simple statistical analysis of data and
concepts involved in supervised and unsupervised learning. Those who may not have the
requisite math or must poke the far reaches of their memory to shake loose the odd
formula or funny symbol, do not be disheartened. If you are the sort that loves a challenge,
the short primer in the appendices may be all you need to kick-start your engines—a bit of
tenacity will see you through the rest! For those who have never been introduced to
machine learning, the first chapter was equally written for you as for those needing a
refresher—it is your starter-kit to jump in feet first and find out what it's all about. You can
augment your basics with any number of online resources. Finally, for those innocent of
Java, here's a secret: many of the tools featured in the book have powerful GUIs. Some
include wizard-like interfaces, making them quite easy to use, and do not require any
knowledge of Java. So if you are new to Java, just skip the examples that need coding and
learn to use the GUI-based tools instead!

Who this book is for
The primary audience of this book is professionals who works with data and whose
responsibilities may include data analysis, data visualization or transformation, the training,
validation, testing and evaluation of machine learning models—presumably to perform
predictive, descriptive or prescriptive analytics using Java or Java-based tools. The choice
of Java may imply a personal preference and therefore some prior experience programming
in Java. On the other hand, perhaps circumstances in the work environment or company
policies limit the use of third-party tools to only those written in Java and a few others. In
the second case, the prospective reader may have no programming experience in Java.
This book is aimed at this reader just as squarely as it is at their colleague, the Java expert
(who came up with the policy in the first place).
A secondary audience can be defined by a profile with two attributes alone: an intellectual
curiosity about machine learning and the desire for a single comprehensive treatment of the
concepts, the practical techniques, and the tools. A specimen of this type of reader can opt
to skip the math and the tools and focus on learning the most common supervised and
unsupervised learning algorithms alone. Another might skim over Chapters 1, 2, 3, and 7,
skip the others entirely, and jump headlong into the tools—a perfectly reasonable strategy if
you want to quickly make yourself useful analyzing that dataset the client said would be
here any day now. Importantly, too, with some practice reproducing the experiments from
the book, it'll get you asking the right questions of the gurus! Alternatively, you might want
to use this book as a reference to quickly look up the details of the algorithm for affinity
propagation (Chapter 3, Unsupervised Machine Learning Techniques), or remind yourself
of an LSTM architecture with a brief review of the schematic (Chapter 7, Deep Learning),
or dog-ear the page with the list of pros and cons of distance-based clustering methods for
outlier detection in stream-based learning (Chapter 5, Real-Time Stream Machine
Learning). All specimens are welcome and each will find plenty to sink their teeth into.

Conventions
In this book, you will find a number of text styles that distinguish between different kinds of
information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The
algorithm calls the eliminate function in a loop, as shown here."
A block of code is set as follows:
DataSource source = new DataSource(trainingFile);
Instances data = source.getDataSet();
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);

Any command-line input or output is written as follows:
Correctly Classified Instances
Incorrectly Classified Instances

53
15

New terms and important words are shown in bold.

Note
Warnings or important notes appear in a box like this.

Tip
Tips and tricks appear like this.

77.9412 %
22.0588 %

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book
—what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of.
To send us general feedback, simply e-mail , and mention the
book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
You can download the code files by following these steps:
1.
2.
3.
4.
5.
6.
7.

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's
webpage at the Packt Publishing website. This page can be accessed by entering the
book's name in the Search box. Please note that you need to be logged in to your Packt
account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
https://github.com/mjmlbook/mastering-java-machine-learning. We also have other code
bundles from our rich catalog of books and videos available at
https://github.com/PacktPublishing/. Check them out!

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books—maybe a mistake in the text or the code
—we would be grateful if you could report this to us. By doing so, you can save other
readers from frustration and help us improve subsequent versions of this book. If you find
any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section

of that title.
To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the
search field. The required information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
Please contact us at
material.



with a link to the suspected pirated

We appreciate your help in protecting our authors and our ability to bring you valuable
content.

Questions
If you have a problem with any aspect of this book, you can contact us at
, and we will do our best to address the problem.

Chapter 1. Machine Learning Review
Recent years have seen the revival of artificial intelligence (AI) and machine learning in
particular, both in academic circles and the industry. In the last decade, AI has seen
dramatic successes that eluded practitioners in the intervening years since the original
promise of the field gave way to relative decline until its re-emergence in the last few years.
What made these successes possible, in large part, was the impetus provided by the need
to process the prodigious amounts of ever-growing data, key algorithmic advances by
dogged researchers in deep learning, and the inexorable increase in raw computational
power driven by Moore's Law. Among the areas of AI leading the resurgence, machine
learning has seen spectacular developments, and continues to find the widest applicability in
an array of domains. The us