Today’s developers often hear about leveraging machine learning algorithms in order to build more intelligent applications, but many don’t know where to start.
One of the most important aspects of developing smart applications is to understand the underlying machine learning models, even if you aren’t the person building them. Whether you are integrating a recommendation system into your app or building a chat bot, this guide will help you get started in understanding the basics of machine learning.
This introduction to machine learning and list of resources is adapted from my October 2016 talk at ACT-W, a women’s tech conference.
While this is only a brief definition, machine learning means we can use statistical models and probabilistic algorithms to answer questions so we can make informative decisions based on our data.
An excerpt from Rob Schapire’s Theoretical Machine Learning lecture in 2008 sums up machine learning very nicely:
Machine learning studies computer algorithms for learning to do stuff. We might, for instance, be interested in learning to complete a task, or to make accurate predictions, or to behave intelligently. The learning that is being done is always based on some sort of observations or data, such as examples…direct experience, or instruction. So in general, machine learning is about learning to do better in the future based on what was experienced in the past.
The two main types of machine learning algorithms are supervised and unsupervised learning. Unsupervised algorithms are great for exploring your dataset and are used for pattern detection, object recognition in images and other classification problems like recommendations based on similar items.
The k-means algorithm is a popular unsupervised algorithm that makes no assumptions about the data meaning it uses random seeds and an iterative process that eventually converges. This unsupervised clustering algorithm uses a distance metric with the goal of minimizing the Euclidean distance from the data points to a centroid, remeasuring and reassigning each data point to a centroid on each iteration.
This algorithm takes n observations into k clusters with each observation belonging to thecluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means is used in market segmentation,computer vision, geostatistics, astronomy and agriculture.
A model is a supervised algorithm if it relies on training data that already contains the correct label for each input and makes inferences based on that relationship to predict new unseen data. Supervised algorithms are often used for classification problems such as sentiment labeling, object detection in images, credit card fraud detection, and spam filtering to just name a few use cases.
The two main types of supervised machine learning are regression and classification. For instance a regression model is used for the prediction of continuous data such as predicting housing prices based on historical data points and trends. A classification model is used for the prediction of categorical data, for example assigning discrete class labels in an image classification model that labels the image as a person or landscape.
There are many types of supervised algorithms available, one of the most popular ones is the Naive Bayes model which is often a good starting point for developers since it’s fairly easy to understand the underlying probabilistic model and easy to execute.
Decision trees are also a predictive model and have two types of trees: regression (which take continuous values) and classification models (which take finite values) and use a divide and conquer strategy that recursively separates the data to generate the tree.
Neural networks is a model inspired by how biological neural networks solve problems and can either be supervised or unsupervised. Neural networks that are supervised have a known output and are built in layers of interconnected weighted nodes with an output layer that gives us a known output such as an image label.
Naïve Bayes Classification is an algorithm that attempts to make predictions based on previously labeled data using a probabilistic model. Features are independent of each other meaning that one feature doesn’t impact the value of another feature and a set of labels are considered and assigned in advance.
Some examples of labels used in classifiers are sentiment scores (can either be strings, integers or float for a scaled score), or for object detection you could have labels such as chair, table or desk to describe objects in images. Feature detection is decided in advance such as the appearance of key words or email length in spam detection.
This example shows code that is modified from the NLTK book, chapter Learning to Classify Text and shows the steps to train the model on known data with the last letter of a name as the feature.
The basic steps needed to use a classification model that has a large dataset:
- Training Set: Fit the model based on known data
- Validation Set: Used for parameter tuning – choose model complexity
- Hyperparameters: can be done by setting different values and choosing which tests better or via statistical methods
- Number of clusters in k-means: in our K-means example we used the elbow method.
- Number of leaves in a decision tree
- Hyperparameters: can be done by setting different values and choosing which tests better or via statistical methods
- Test Set: Assess model after model has been run on the training set – run confusion matrix to find errors and compare models
Cross validation methods help to understand how a model will generalize to unseen data and are used for smaller datasets. For example the K-fold cross-validation follows these steps:
- Training data set is split into subsets of data – one as the test set, the remaining datasets are for training. – so you are using the same test set on every subset that is used for training data
- Calculate the standard deviation of each test/training set.
- Averages error rate over rounds to estimate model performance.
R is great for statistical/data analysis and machine learning, but not as good for production systems or utility functions due to performance and security issues.
Regression diagnostics: Outlier Tests (p-value), Influential Observations, Evaluating nonlinearity, Correlations, descriptive stats.
All the things statistics: ANOVA, Resampling Techniques, Clustering, PCA for unsupervised ML, Decision Trees and more.
Pandas is a Python library that uses data frames such as R. While it slow to use in production (Numpy arrays would be faster), Pandas is a favorite in using for data analysis and machine learning in a Python environment.
The benefits of using Pandas is that it will reduce your code by at least two-thirds and you can use really cool SQL-like features such as joins, merges, pivots and aggregating functions.
There are also many I/O methods available that make inputing and exporting your data easy such as: DataFrame.to_excel, .to_json, .to_csv, and more.
Scikit-learn is another favorite Python library and is a great place to find machine learning models with tutorials and documentation that have been vetted by many Python developers. It has everything from image classification algorithms to natural language processing ones.
Here is a list of clickable links of the above slide that lists tools, tutorials and videos: