This article is not totally complete, I'll try to find time to do so, but I think that the methodology can be interesting.

I always had the feeling that with a good algorithm, it's possible to have a good confidence about knowing who to bet on in sport. Sure, it won't be an unrisked strategy, or if I would have find such an algorithm, you wouldn't be reading about it by now ! But if you find such, feel free to contact me ;)

Even if I do have in mind the fact that "if they do that, it's because they win at the end", I may found an algorithm that will lower my risk by giving me good prediction given the history and a confidence interval, preventing me from remembering all the history of each player, and maybe it will make me little money ! For several reason I think it may be possible.

The first one is that I think that they and I don't have the same output to predict. If I were to design an algorithm for a bet institution, I wouldn't predict who will win the match, but what will be the distribution of the bets for each team. Let's give an example with the last football world cup. I leave in France. In quarter-final there were France vs Germany. Around me everybody bet France, and I amuse that in Germany everybody bet Germany. So if as a bet institution I only predict who will win to compute my odds, and predicted Germany as a winner in France, if France win I will lose money because most people would bet France. On the other hand, if I know the distribution of the bets, I can choose my odds in order to be winning regardless the winner of the match.

The second reason is that I can use as an input for my model the odds computed by the bet institution. And their scores reflects a non linear and non trivial informations about the game.

The third reason is that I can construct a strategy of bets knowing the global precision of my model. Let's take a simplified example and suppose I have a good algorithm : if I'm wrong I give them y and if I'm right they give me (1+x)*y. Moreover, let's suppose that I tested my algorithm, and I know I'm right 4 times over 5. So I have to bet on matchs such that the money I will get the 4 rights times is greater than the money I will lose the fifth time :

4*(1+x)*y-4*y > y. So if x>0.25, I'm winning ! Of course, the precision of my model will be lower in tight match, and when my precision will be good x will be closer to 0.01 I think. So I think a strategy exists, if and only if I have a really really good model for tight matchs.

#### But why tennis ?

First, I like tennis, so it's easier for me to think about good variables that would impact the result. Second, I have to limit the biais, and I think the biais
is lower in tennis than in other sports like football for instance. To me it's a question of convergence and personal motivation. The biais is smaller in indivual
sport because during a match the pressure is always on you, and to keep winning you must be always at your top. Regarding the convergence, scoring in tennis is much
more frequent than in football where just one mistake could cost you a goal and the match whereas in tennis, you could always come back.

I get all the data from here. I take the match from 2006 to 2014. They have data since 2000 but for 2000 they don't have the bet and for 2005 they don't have the number of points of a match, so I started in 2006, which gives me 23 942 matchs. Then I suppressed the line where one of the player's rank was greater than 100, because too little observations are available for such player and a top 10 against a top 100 needs no modelisation to predict with a good confidence rate who will win... I ended up with 15 338 matchs.

I always had the feeling that with a good algorithm, it's possible to have a good confidence about knowing who to bet on in sport. Sure, it won't be an unrisked strategy, or if I would have find such an algorithm, you wouldn't be reading about it by now ! But if you find such, feel free to contact me ;)

Even if I do have in mind the fact that "if they do that, it's because they win at the end", I may found an algorithm that will lower my risk by giving me good prediction given the history and a confidence interval, preventing me from remembering all the history of each player, and maybe it will make me little money ! For several reason I think it may be possible.

The first one is that I think that they and I don't have the same output to predict. If I were to design an algorithm for a bet institution, I wouldn't predict who will win the match, but what will be the distribution of the bets for each team. Let's give an example with the last football world cup. I leave in France. In quarter-final there were France vs Germany. Around me everybody bet France, and I amuse that in Germany everybody bet Germany. So if as a bet institution I only predict who will win to compute my odds, and predicted Germany as a winner in France, if France win I will lose money because most people would bet France. On the other hand, if I know the distribution of the bets, I can choose my odds in order to be winning regardless the winner of the match.

The second reason is that I can use as an input for my model the odds computed by the bet institution. And their scores reflects a non linear and non trivial informations about the game.

The third reason is that I can construct a strategy of bets knowing the global precision of my model. Let's take a simplified example and suppose I have a good algorithm : if I'm wrong I give them y and if I'm right they give me (1+x)*y. Moreover, let's suppose that I tested my algorithm, and I know I'm right 4 times over 5. So I have to bet on matchs such that the money I will get the 4 rights times is greater than the money I will lose the fifth time :

4*(1+x)*y-4*y > y. So if x>0.25, I'm winning ! Of course, the precision of my model will be lower in tight match, and when my precision will be good x will be closer to 0.01 I think. So I think a strategy exists, if and only if I have a really really good model for tight matchs.

I get all the data from here. I take the match from 2006 to 2014. They have data since 2000 but for 2000 they don't have the bet and for 2005 they don't have the number of points of a match, so I started in 2006, which gives me 23 942 matchs. Then I suppressed the line where one of the player's rank was greater than 100, because too little observations are available for such player and a top 10 against a top 100 needs no modelisation to predict with a good confidence rate who will win... I ended up with 15 338 matchs.

The goal of this article is not to try hundreds of algorithms to get the best one as we could do for a Kaggle challenge (because I have no time for this) but to try and present
different machine learning techniques applied to a real world and fun project. Maybe later I will post and second article on the same dataset to try and compare and different
methodology. I'm writing this article while coding the algorithm thus I don't know what it will give but I have in mind a methodology that I want to test, that will mix EM algorithm,
non linear dimensionality reduction and kernel method. I will thus give the steps and functions used for the prediction and the precision obtain for different variations (size of the
dimensionality reduction and kernel choice). Feel free to adapt and improve the code and also extract a strategy based on the precision in function of the match incertitude
as describe above. Tell me if you get rich ;).

The key point I think about are :

- The relative strength of the players
- The relative dynamic of the players (winning trend)
- Who has the psychological ascendant ?
- The tiredness
- The context advantage (who is the best on this specific surface, indoor/outdoor, tournament, ...)

A visualisation of those key point for each match in my dataset will be available in the viz part here.

This gives me for each match two strengths vectors (player 1 relative to player 2 and player 2 relative to player 1) s of dimension 9. From this dataset, I performed dimensionality reduction by applying an EM algorithm with K clusters in order to take into account non-linearity effects. In order to choose the number K of clusters, I performed an EM algorithm for K from 3 to 15 and plotted the winning rate by cluster for each K. My will was to have a least one cluster with a high winning rate, one with a low and one around 50%, no matter if several cluster would have a similar winning rate, maybe they will help to differentiate once crossed with the other keys. See here for the plot for each key point. I end up with for each match two vectors s' of dimension 11 where the k-th value correspond the probability of this relative strength to belong to cluster k. To choose the value of K, I performed several K-means for K from 3 to 20 and then use the elbow method to determine the region of optimal K (around 11-12) and choose the best among this region graphically.

The first one is obtain by pasting every key points in columns so this gives me for each match a vector of length 46 = 11(strength)+9(dynamic)+8 (psychological ascendant)+7(tiredness)+11+(context)

The second one is obtain by computing the outer product between all my key points so this gives me for each match a vector of length 95 040 = (11+1)(strength)*(9+1)(dynamic)*(8+1) (psychological ascendant)*(7+1)(tiredness)*(11+1)*(context). The +1 coming from the fact that I had a constant vector of 1 for each key point to keep in my variable each variable as a standalone. This dataset gave a lot a trouble to manage that many columns !

Then for each dataset, in order to reduce the dimension, I implemented the Laplacian Eigenmap method, which had been presented by Mikhail Belkin and Partha Niyogi. Finally I tried a projection into a subspace of several dimension from 3 to 19, and performed for each an SVM classifier with different kernel (linear, polynomial and gaussian).

Dataset | - |
---|---|

Kernel | - |

Dimension | - |

Precision | - |

Reversed | - |

TP = ? | FN = ? | ||

FP = ? | TN = ? |