The aim of this post is to present some tools used by data scientist to solve their everyday life project, but not to give a theoretical lecture
about machine learning. This post is for non data scientist that want to know what data scientist does. You will discover what a data scientist has
to do, not how to do it.

Most of the knowledge of a data scientist is machine learning but most of this work consist in preprocessing data in order to have a clean input table with the right variable for the model. The first part of this post will be about machine learning techniques and the second part about getting clean data.

Most of the knowledge of a data scientist is machine learning but most of this work consist in preprocessing data in order to have a clean input table with the right variable for the model. The first part of this post will be about machine learning techniques and the second part about getting clean data.

This domain is call supervised learning, because we know what we have to predict. The data scientist have in his hand the inputs $X$ for different patient, says $n$, with $p$ informations for each one, thus $X$ is a matrix with $n$ row and $p$ columns; and the corresponding output for each patient $Y$ which his a vector of length $n$. The purpose is to find the best function $f$ such that $f(X)$ is close to $Y$, thereby, if a new patient comes and we know the $p$ informations we need to know, by applying the function $f$ we get an estimation of the output for this person.

loss function | formula | comments |
---|---|---|

0/1 loss | $\ell(y,f(x)) = \sum_{i=1}^n 1(y_i f(x_i)<0) $ | Only for problem where $y = \pm 1$. The error if $1$ is the sign of $f$ is different from the sign of $y$ and $0$ otherwise |

quadratic loss | $\ell(y,f(x)) = \sum_{i=1}^n ||y_i - f(x_i)||^2$ | This will give much more weight to big error since the difference is up to square |

log loss | $\ell(y,f(x)) = \sum_{i=1}^n \log (1 + \exp(-y_i f(x_i)) )$ | Only for problem where $y = \pm 1$ |

- The 0/1 loss is very particular and make no distinction is whether you make a huge mistake or not. Moreover, a small variation of the output can change completely the error. If the prediction is a small positive number, adding some noise can make the prediction a small negative number and the error would go from 0 to 1. This loss function is not continuous.
- The quadratic loss is minimum when when $y=f(x)$, is symmetric in 1 and increases really quickly as $f$ gets far from $y$. Saying $3$ to predict $1$ is as bad as saying $1$ to predict $-1$. This can be argue because it one case we get the right sign ...
- On the contrary, if $f$ and $y$ have the same sign, the higher $f$, the smaller the error. The idea is that if we said $100$ it means that we are really sure that the result is positive thus is $1$

The objective of supervised learning being to find the function $f$ closest to $Y$, i.e we want to find the function $f$ such that $\ell(Y,f(X))$ is minimum. In maths, we note that : $$\underset{f}{\arg \min} \quad \ell(Y,f(X))$$

- for each function $f$ we get the value of $\ell(Y,f(X)$
- we look for the minimum
- we return the function $f$ for which we get the minimum (that's what "arg" means : we don't want the value of the minimum but the function that reaches this minimum)

The only thing left to define is $f$. We are now facing an optimisation problem. This cannot be solved for any type of function for $f$ nor $\ell$, and the way to find the solution also depends on the "form" of $f$ and $\ell$.

For instance,

- $f$ can be linear : if the vector $x = (x_1,...,x_p)$, there exist a vector $a = (a_1,...,a_p)$ such that $f(x) = a_1 x_1+ ... + a_p x_p = \sum_{i=1}^p a_i x_i$
- $f$ can be a transformation of a linear product : $f(x) = \frac{1}{1+\exp(-\sum_{i=1}^p a_i x_i)}$
- $f$ can be a polynomial in a degree $q$ higher than 1 (degree 1 = linear) : $f(x) = \sum_{i=1}^p a_i x_i^q$
- $f$ can a combination of different functions. For instance, let's note $f_q(x)$ the function $\sum_{i=1}^p a_i x_i^q$. We could have $f(x) = \sum_{q=1}^Q f_q(x)$
- ...

$f$ | Loss function | model |
---|---|---|

$f(x) = \sum_{i=1}^p a_i x_i$ | $\ell(y,f(x)) = \sum_{i=1}^n ||y_i - f(x_i)||^2$ | Linear regression |

$f(x) = \sum_{i=1}^p a_i x_i$ | $\ell(y,f(x)) = \sum_{i=1}^n ||y_i - f(x_i)||^2 + \lambda \sum_{i=1}^p ||a_i||^2$ | Ridge regression |

$f(x) = \sum_{i=1}^p a_i x_i$ | $\ell(y,f(x)) = \sum_{i=1}^n ||y_i - f(x_i)||^2 + \lambda \sum_{i=1}^p |a_i|^2$ | LASSO regression |

$f(x) = \frac{2}{1+\exp(-\sum_{i=1}^p a_i x_i)}-1$ | $\ell(y,f(x)) = \sum_{i=1}^n \log (1 + \exp(-y_i \sum_{i=1}^p a_i x_i ))$ | Logistic regression |

$f(x) = \sum_{i=1}^p a_i x_i$ | $\ell(y,f(x)) = \sum_{i=1}^n \max(0,1 - y_i f(x_i)) + \lambda \sum_{i=1}^n ||a_i||^2 $ | Linear Support Vector Machine |

For simplicity let's consider $n=2$ and $p=2$. $$ X = \left( \begin{array}{cc} 2 & 4 \\ -3 & -0.5 \end{array} \right) \text{ and } Y = \left( \begin{array}{c} 1 \\ -1 \end{array} \right) $$ Furthermore let's say we want to compare those two functions, $f_1(x) = 10x_1+4x_2$ and $f_2(x) = 0.2x_1+0.6x_2$.

Let's compute each error.

- For quadratic loss
$err_{quadratic}^1 = \sum_{i=1}^2 ||y_i-f_1(x_i)||^2 $ $= (1-(10 \times 2 + 4 \times 4))^2 + ((-1)-(10 \times (-3) + 4 \times (-0.5))^2$ $= 35^2 + 31^2$ $= 2186$ $err_{quadratic}^2 = \sum_{i=1}^2 ||y_i-f_2(x_i)||^2 $ $= (1-(0.2 \times 2 + 0.6 \times 4))^2 + ((-1)-(0.2 \times (-3) + 0.6 \times (-0.5))^2$ $= 1.8^2 + 0.1^2$ $= 3.25$

- For log loss
$err_{log}^1 = \sum_{i=1}^2 \log (1 + \exp(-y_i f_1(x_i)) ) $ $= \log(1+\exp(-(10 \times 2 + 4 \times 4))) + \log(1+\exp(-(-1)(10 \times (-3) + 4 \times (-0.5))))$ $= \log(1+\exp(-36)) + \log(1+\exp(-32))$ $= 10^{-14}$ $err_{log}^2 = \sum_{i=1}^2 \log (1 + \exp(-y_i f_2(x_i)) ) $ $= \log(1+\exp(-(0.2 \times 2 + 0.6 \times 4))) + \log(1+\exp(-(-1)(0.2 \times (-3) + 0.6 \times (-0.5))))$ $= \log(1+\exp(-2.8)) + \log(1+\exp(-0.9))$ $= 0.4$

So according to you, who's best ? $f_1$ says $36$ to predict $+1$ and $-32$ to predict $-1$ while $f_2$ says $2.8$ to predict $+1$ and $-0.9$ to predict $-1$.

If you want to say $f_2$ it's because I cheat a little. If you look at the previous table, when using the log loss, $f$ is not purely linear. $f$ is supposed to be a transformation of a linear regression. This transformation's role is to scale the function between 0 and 1. We then multiply this transformation by 2 and delete 1 to get something between -1 and 1. The first transformation name is called a sigmoid : $g(t) = \frac{1}{1+\exp(-t)}$

- $f_1(x) = \frac{2}{1+\exp(-(10x_1+4x_2))}-1$. The predicted values are then $1$ and $-1$.
- $f_2(x) = \frac{2}{1+\exp(-(0.2x_1+0.6x_2))}-1$. The predicted values are then $0.88$ and $-0.42$.

Eventually, the choice of the loss function really matters and depends on the problem. If you want to predict a binary output, you'd better use the log loss.

For instance let's imagine a binary classification problem. Some dots are blue other or red. We want to find a function that will cut the space in order to have blue dots on one side of the function and red dots on the other side.

The are two kinds of behaviours that can lead to overfitting :

- Adding too many variables that have no explaining power. One should use statistics method to select the "good" variables
- Choosing a too complex model

The most common method is called

But in fact unsupervised learning is much more than just clustering. It also about discovering latent factor, matrix completion, ...

If we not $\mu_k$ each of the K cluster, and $m$ the function that for each data point $x_i$ return the cluster number to which $x_i$ belongs (so $m(i)= 1, \cdots, K$), the K-means objective is to minimize : $$\sum_{i=1}^n |x_i-\mu_{m(i)}|^2$$ The algorithm works as follow:

- Choose K points at random from the n points in the datasets to be the centroids
- Compute to which cluster each of the data points belongs to
- Compute the mean of the points in each cluster. Those mean becomes the new centroids
- Go back to step 2 until it doesn't change that much (meaning that the difference of the value of the objective between two iterations is lower than a given threshold like $10^{-3}$)

K-means can also lead to very unbalanced clusters, one clustering can contains 10% of data data, while another can contains 70% of the data.

- Expectation-Maximisation (EM) algorithm
- Hierarchical clustering. For instance let's start with n points, glue together the two closest points. Compute their mean to be a new data point. Iterate until you meet the top.
- Clara (Clustering LARge Applications)
- ...

The most common dimensionality reduction technique is PCA (Principal Component Analysis) where the solution resides in selecting the most important direction. For instance, if the points form a very wide ellipse, we can keep the projection of the points on the wide direction as a lower dimensional representation of the points.

- The agent perceives state $s_t$
- The agent performs action $a_t$
- The environment evolves to state $s_{t+1}$
- The agent receives reward $r_t$

An example of a markov chain is a walking process to go from a point A to a point B, and you want to find the quickest way to reach B. From time to time, you check your map to see if your walking in the right direction. The direction you should take to reach B only depends on where you are right now and not on where you've been before.

A Markov Decision is form of :

- The space of state $X$. It is all the possible states the context the agent his living in can take
- The action space $A$. It is all the possible actions the agent can performs at each time step
- The transition probability $p(y|x,a)$. It models the probability of being in state $y$ if the agent performs the action $a$ in state $x$
- The reward $r(x,a,y)$ which represents the outcome received when performing action $a$ while the state evolves from $x$ to $y$

The aim is to find the policy $\pi$ that will tell us what action to perform in state $x$ at time $t$ in order to maximize the sum of the rewards. So the action $a$ will be $\pi_t(x)$. Formally, the aim is to find the policy that will maximise the following quantity, called the value function : $$V^{\pi}(x) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r(x_t,\pi(x_t))|x_0=x; \pi]$$ Let's stop a moment on this formula.

- Why an expectation? The process if probabilistic, meaning that taking twice the same action in the same state won't necessary lead to the same next state. So the estimation must be average on a huge amount of time action $a$ is taken in state $x$. That is what the expectation models
- Why $\gamma$? Mathematically we need the infinite sum to exists meaning that it must be equal to something and not the infinity. To do so, we weight the reward by a factor $\gamma$ between 0 and 1 (exclude). If $\gamma$ is close to 0, the value function will focus on short term rewards, if close to 1 on long term rewards
- What is $x_0$? $x_0$ is the starting state that will condition all the strategy for the problem being studied. Going from Atlanta to Chicago is not the same problem as going from New York to Chicago

The two mosts know MAB algorithm are :

- $\epsilon$-greedy : $\epsilon$ per cent of the time, choose an arm at random and observe what happens. The rest of the time, choose the best action observed so far. The higher $\epsilon$ the more you explore. This strategy is obviously not optimal in most cases but answer the issue. Indeed, you keep testing from time to time new arm or previously suboptimal arm. You want to keep testing suboptimal arm because maybe the few times you test it you were unlucky, so you want to be sure it is really suboptimal.
- Upper Confidence Bound (UCB) : The UCB strategy performs exploration and exploitation in the same time ! The idea is to first compute the average performance for each arm, and instead of pulling
the best arm like you would do if you were exploiting, you then compute a confidence interval around the average performance. The upper bound of this confidence interval is function of the number of
times you pull this particular arm and the number of time you pull an arm. Indeed, the most often you pull an arm, the most confidence you are about what is the performance of this arm. Let's suppose
you pull the arm $j$ $n_j$ times and you pull an arm $n$ times. Moreover, let's note $r_j^n$ the average reward of arm $j$ after $n$ trials. The upper bound is compute as follow :
$$ub_j^n = r_j^n +\sqrt{\frac{ 3 \log n}{n_j}}$$
This bound increases with $n$ and drecreases with $n_j$. It increases with $n$ slower than it decreases with $n_j$ because otherwise the bound will keep increasing except if you pull the same arm over and
over. Thus, applying this formula, you get a box around the mean which is wide the first time you pull this arm, and decreases over time.

Still, we haven't answer the question of which arm should I pull? Let's imagine that we have 4 arms, and the corresponding graph where each red line correspond to average (expected) reward for each arm, and the blue box represent the confidence bound around it. What arm would you pull? (The wrong answer is A ...)The UCB tells you to choose arm C, because he is the one with the higher upper confidence bound. The idea is that in best case scenario it is the one that will perform the best. If it not the best, at least choosing this arm will reduce its uncertainty because we will gain knowledge for this arm. In both case we are winning something. The catchphrase of UCB is "Optimism in Face Of Uncertainty", meaning exactly the same : when facing uncertainty we choose the optimistic performance of the available arms. Of couse, if the upper bound of arm A were above the C one, we would pick A. We doesn't pick C because its uncertainty is wide but because it is the highest.

visitor id | session id | timestamp | url | ... | action | page | custom |
---|---|---|---|---|---|---|---|

12345 | 7484567 | 1410189398 | /mysite/homepage.com | ... | impressions | homepage | {'list':NULL,'product':NULL,'price':NULL,'qt':NULL} |

784655 | 134324 | 1410194321 | /mysite/shirts/redshirts.com | ... | view | product | {'list':NULL,'product':1242,'price':45,'qt':4} |

12345 | 959575 | 1410245876 | /mysite/dress/flowerdresses.com | ... | view | product | {'list':NULL,'product':784,'price':12,'qt':1} |

6574 | 5749 | 1410374676 | /mysite/dress.com | ... | impressions | list | {'list':[7486,1453,897424,178461,1873486,186749],'product':NULL,'price':NULL,'qt':NULL} |

The last thing to do is ALWAYS CENTERED DATA. For now the variables can take any values, some are binary, some are a rate so between 0 and 1, some are continuous (price, time spend, ...). All those variables are not comparable from one to another, so you need to transform the variables. There are mainly two possibilities :

- Divide all the variable by their max value, thus all variables will be between -1 and 1 (between 0 and 1 if positive)
- Substract the mean and divide by the standard deviation, thus all variable will spread around 0 depending on their standard deviation

It is now time to run the machine learning algorithm and be careful of the relevance of the variables.