Research article - (2006)05, 480 - 487 |
Predicting the Match Outcome in One Day International Cricket Matches, while the Game is in Progress |
Michael Bailey1,, Stephen R. Clarke2 |
Key words: Linear regression, live prediction, market efficiency, betting |
Key Points |
|
|
|
In ODI cricket the aim of the team batting first is to score as many runs as possible in the allotted time (usually 50 six ball overs). If the first team scores more runs than the second team, the MOV can readily be expressed in terms of runs difference between the two teams. The aim of the side batting second is to score more runs than the first team. Because the game is deemed to be finished if the team batting second achieves their target, the MOV is usually expressed in terms of resources (wickets and balls) remaining, rather than runs. In order to develop a predictive process for match outcomes, a consistent measure of the MOV is required. This can be achieved by following the work of Duckworth and Lewis, Frank Duckworth and Tony Lewis developed a now well-known system for resetting targets in ODI matches that were shortened due to rain. Although this system has undergone several refinements in recent years, the general way in which the Duckworth-Lewis (D-L) method is calculated has not changed, with wickets and balls remaining expressed as resources available and converted to runs. Whilst the D-L approach was specifically designed to improve ‘fairness’ in interrupted one- day matches, (de Silva et al., When an ODI match is won by the team batting first, the MOV is readily determined by the difference in runs scored. When the match is won by the team batting second, the MOV can be found by multiplying the first innings run total by the corresponding modified percentage of resources remaining as given by (1). By referencing the MOV so that a ‘home’ win has a positive value and an ‘away’ win has a negative value, it can be seen from |
Statistical analysis |
All analysis was performed using SAS version 8.2 (SAS Institute Inc., Cary, NC, USA). Multiple linear regression models were constructed using a stepwise selection procedure and validated a backward elimination procedure. To increase the robustness of the prediction models a reduced level of statistical significance was incorporated with all variables achieving a level of significance below p = 0.005. Comparisons between continuously normally distributed variables were made using student t-tests. |
Prediction models for MOV |
Using match and player information from 1800 ODIs played prior to Jan 2002, (Bailey, Prediction variables of experience, quality and form were derived by developing separate measures for both teams and then subtracting the away team values from the home team values. This effectively references the final result in term of the home team. Indicator variables were created to identify matches played at a neutral venue and matches where the two competing teams were clearly from different class structures (established nation versus developing nation). From Because the MOV in the regression model is nominally structured in favour of the home team, the intercept term in the regression equation reflects HA. It can be seen from The difference in quality, as measured by the difference in averages between the two teams for all past matches, was by far the strongest predictor, explaining 20.7% of the variation in the updated model. The best measure of current form was the difference in averages for the past 10 matches, whilst the difference in overall experience (games played by the country) between the home and away team was also statistically significant. Whilst no statistically significant difference could be found in parameter estimates, the difference in class (when a developing cricket nation played host to an established cricket nation) declined (29.6 runs vs. 25. 1 runs) as developing nations gain more experience. Similarly, the effect of HA rose slightly (13.4 runs vs. 13.9 runs) with more data, while the effect of a neutral venue was slightly lower (8.6 runs vs. 8.2 runs). Not surprisingly, all variables in the model achieved a higher level of statistical significant when additional data were used. |
Prediction model for team totals |
Using past averages and exponential smoothing, prediction variables relating to past performance were created. Using a multiple linear regression, a six variable model was constructed. The resulting parameter values are given in Interestingly, when using a stepwise selection procedure, the strongest predictor of the total scored by the team batting first was in fact the average of the past MOV between the two teams. The next strongest predictors in the model were derived from the past first innings scores achieved by the batting team as well as scores conceded by the bowling team. HA was the next predictor of importance, with a team playing in it home country scoring an additional 15 runs. A second surrogate marker for the quality of the batting team was given by the average past MOV for the batting team. The final variable that was found to be highly statistically significant (p = 0.0004) was derived from all past first innings played at the venue. This helped account for pitch conditions and venue size. Whilst over 23% of the variation in MOV could be explained by the multivariate model, the total of the team batting first was not as predictable, with an R-square statistic of 19.1%. Using a holdout sample of 100 completed matches played in the year 2005, the regression model successfully predicted the winning team 71% of the time and had an Absolute Average Error (AAE) between the predicted and actual margin of 55.8 ± 4.1 runs. These results compare favourably against the original prediction model of (Bailey, Using the same holdout sample of 100 matches, the AAE for the difference between the predicted and actual totals of the team batting first was 42.5 ± 3.2 runs. By referencing the MOV in terms of the team batting first rather than the home team, a predicted total for the team batting second can be given by From the chosen holdout sample of 100 matches, the AAE for the difference between the predicted and actual totals of the team batting second was 47.1 ± 4.0 runs. |
|
|
With the use of the D-L method to convert available resources into runs, at the completion of each over, an updated total for the team batting first is calculated by combining the actual total with the predicted total for the remainder of the innings. Using complete over by over information for the 100 match holdout sample, it can be seen from By subtracting the pre-match predicted total from the updated prediction of the total, a performance indicator can be derived for whether each batting team is performing above or below expectation. With the use the performance indicator, an updated prediction for the MOV can then be readily obtained From As shown by (Bailey, Using prediction models for the team total and MOV, the predicted probability for Australia to win was calculated both before and during the match, and compared with the market price offered by Betfair (market probabilities included 5% for commission ). Where the predicted probability can be seen to exceed the market probability, the ‘in play’ market can be thought to be inefficient. From Chasing 323 runs to win the match, New Zealand started slowly. With some big hitting towards the end of the innings, the black caps clawed their way into contention and started the final over as favourites, only requiring six runs to win. Unfortunately, two wickets falling in the final over gave victory to Australia by 2 runs. |
|
|
In July 2005 the International Cricket Council (ICC) announced a new set of rules to be applicable to ODI matches. An increase in fielding restrictions and the introduction of a substitute player (super-sub), significantly increased the total achieved by the team batting first by more than 20 runs. (252.7 ± 8.0 vs. 229.7 ± 1.2 p = 0.002). As these changes occurred within the holdout sample of the data used, it is unsure how these modifications would impact upon the prediction process. Whilst the price and volume of bets traded are available through Betfair (see In Australia, federal laws prevent Australian citizens from placing bets over the internet after a sporting event has commenced. Paradoxically, Australian citizen can place bets ‘in the run’ provided the bets are placed over the phone. This inconvenience causes a greater delay between observing an inefficient price and actually placing a bet. |
Conclusions |
Multiple linear regression provides a useful way to assign the winning probabilities to the competing teams in ODI matches. With the use of D-L approach, this process can be readily modified to produce ‘in the run’ predictions. Whilst a definitive analysis of the efficiency of the betting market is yet to be conducted, preliminary evidence suggest punters may be prone to over or under estimate the true probability of the competing teams as the game progresses. |
AUTHOR BIOGRAPHY |
|
REFERENCES |
|