Machine Learning for beginners

Introduction

Khouloud Alkhammassi
5 min readJan 26, 2020

When we talk about Machine Learning we are talking about a large field named data science which is a new area of work that increase classical analysis capabilities to help companies make informed decisions; and
machine learning is actually only part of the job of a data scientist

The questions here :

  1. Which part of data science is situated the Machine Learning?
  2. What is Machine Learning?
  3. How it works?
  4. Where we can find Machine Learning in real life?

I. Data scientist working cycle

The data scientist’s work cycle can be summarized by the diagram below. To put it simply, we start from reality, we recover the data, we clean it, we explore it then we use our algorithms to create Artificial Intelligence that helps decision:

Data scientist working cycle
  1. Recovery

Once you’ve decided to tackle a problem, the first thing to do is explore all possible avenues to recover the data. Indeed, data constitutes experience, the examples that you will provide to your machine learning algorithm so that it can learn and become more efficient.

2. Cleaning

Cleaning the data is ensuring that it is consistent, with no outliers or missing values.

3. Exploration

Clean data can now begin to be explored. This step allows you to better understand the different behaviors and to understand the underlying phenomenon.

4. Model data using machine learning

We can finally get into the most interesting part of the business, that is to say the creation of the statistical model associated with the data that interests us! This is what we call machine learning; we will explore this part later.

5. Evaluate and interpret the results

Once a first work of modeling carried out, the continuation of the study is carried out by the evaluation of the quality of our model, means its capacity to represent with precision our phenomenon, or at least its capacity to solve our problematic.

6. Deploy the model in production

Once we are satisfied with the quality of the performance of our model, we will be able to proceed to the next step, which is the rendering of our results and the potential deployment of the model in production.

II. Different modeling steps

Machine learning is a sub-field of Artificial Intelligence (AI). The goal of machine learning generally is to understand the structure of data and fit that data into models that can be understood and utilized by people.

Imagine that you are a data scientist. You are now comfortable with all the data collected for your analyzes. You have a knowledge of the main objectives of the company, which has helped you to synthesize the different variables involved, as well as to visualize the different behaviors and correlations present within this data.

The machine learning problem is the next step and allows a computer to model the data provided to it.

“Modeling” in this case means representing the behavior of a phenomenon in order to be able to help solve a concrete business problem.

In machine learning, the algorithm builds an “internal representation” in order to be able to perform the task requested of it (prediction, identification, etc.). To do this, we will first have to enter a set of example data so that we can train and improve, hence the word learning. This dataset is called the training set. You can call an entry in the data set an instance or an observation.

the data scientist’s job in machine learning consists of selecting the right test data, choosing and training the right algorithm by verifying through error analysis that the model is becoming more and more efficient and robust. If performance improves when the training data is provided, then the machine is said to be “learning”.

Once the model is correctly configured on the training data, the data scientist can then deploy it so that it processes new data, to accomplish the specific task pursued (prediction, recommendation, decision …).

=> A machine learning problem has several specific elements:

  1. Data
  2. A task to accomplish
  3. A learning algorithm
  4. A measure of performance

III. Types of machine learning

  1. Supervised learning

You are going to recover so-called annotated data from their outputs to train the model, that is to say that you have already associated a target label or class with them and you want the algorithm to become capable, once trained, of predicting this target on new non-annotated data.

2. Semi-supervised learning

which takes as input some annotated data and others not

3. Unsupervised learning

In this case, the training algorithm applies to finding the similarities and distinctions within these data alone, and to grouping together those which share common characteristics.

4. Reinforcement learning

which is based on an experience, reward cycle and improves performance with each iteration. An analogy often cited is that of the dopamine cycle: a “good” experience increases dopamine and therefore increases the probability that the agent repeats the experience.

IV. The learning algorithm

The learning algorithm is the method with which the statistical model will configure itself from the example data. There are many different algorithms like:

  1. Linear regression
  2. K-NN
  3. Support Vector Machine (SVM)
  4. Neural networks
  5. The random forests

Let’s take the example of K-NN algorithm! How it works?

Below we have represented a training dataset, with two classes, red and blue. The input is therefore two-dimensional here, and the target is the color to classify.

Test point cloud

If we have a new entry whose class we want to predict, how could we do?

The white point is a new entry.

Well we will simply look at the k neighbors closest to this point and look at which class constitutes the majority of these points, in order to deduce the class of the new point. For example here, if we use 5-NN, we can predict that the new data belongs to the red class since it has 3 reds and 2 blues around it.

The 5 points closest to the point we are trying to classify

The Euclidean distance was used here as a measure of similarity so we can easily deduce the red and blue zone, where the points which will be located in the zone will be respectively classified as red or blue.

The two areas that separate the space for the decision on the classification of new entries with the 5-NN model

VI. Machine Learning In Real Life Examples

Extra resources:

--

--