Predicting the Match Outcome in One Day International Cricket Matches, while the Game is in Progress

Research article - (2006)05, 480 - 487

Predicting the Match Outcome in One Day International Cricket Matches, while the Game is in Progress

Michael Bailey^1,

, Stephen R. Clarke²

¹Department of Epidemiology & Preventive Medicine, Monash University, Australia
²Swinburne University of Technology, Melbourne, Australia

Michael Bailey
✉ Department of Epidemiology & Preventive Medicine, Monash University, Australia.
Email: Michael.Bailey@med.monash.edu.au

Received: -- -- Accepted: --
Published (online): 15-12-2006

ABSTRACT

Millions of dollars are wagered on the outcome of one day international (ODI) cricket matches, with a large percentage of bets occurring after the game has commenced. Using match information gathered from all 2200 ODI matches played prior to January 2005, a range of variables that could independently explain statistically significant proportions of variation associated with the predicted run totals and match outcomes were created. Such variables include home ground advantage, past performances, match experience, performance at the specific venue, performance against the specific opposition, experience at the specific venue and current form. Using a multiple linear regression model, prediction variables were numerically weighted according to statistical significance and used to predict the match outcome. With the use of the Duckworth-Lewis method to determine resources remaining, at the end of each completed over, the predicted run total of the batting team could be updated to provide a more accurate prediction of the match outcome. By applying this prediction approach to a holdout sample of matches, the efficiency of the “in the run ”wagering market could be assessed. Preliminary results suggest that the market is prone to overreact to events occurring throughout the course of the match, thus creating brief inefficiencies in the wagering market.

Key words: Linear regression, live prediction, market efficiency, betting

Key Points

In excess of 80% of monies wagered on the outcome of ODI matches are placed after the match has commenced.
Using all past data from ODI matches, multiple linear regression models are constructed to predict team totals and margin of victory.
By combining match information with prediction models, an ‘in the run’ prediction process is created for ODI matches.

INTRODUCTION

The first official one day international (ODI) match was played in 1971 between Australia and England at the Melbourne Cricket Ground. Whilst ODI cricket has developed over the past 35 years (2300 matches), the general principles have remained the same. Both sides bat once for a limited time (maximum 50 overs) with the aim in the first innings to score as many runs as possible, and in the second innings to score more than the target set in the first innings. The high scoring nature of ODI matches ensures that team totals and differences between scores can be well approximated by a normal distribution. As shown by (Bailey, 2005), this facilitates the use of multiple linear regression to predict a margin of victory (MOV) prior to the commencement of the match. Using a similar approach, a multiple linear regression is also used to predict the number of runs scored by the team batting first. With the use of (Duckworth and Lewis, 1999) approach of converting resources available into runs, as each over is bowled, the current total and the predicted total for the remaining overs are combined to produce an updated predicted total for the batting team. The difference between the pre-match predicted total and the updated predicted total provides a measure of how the batting team is performing through the course of their inning. This difference is then used to provide an updated prediction for the MOV.

METHODS

In ODI cricket the aim of the team batting first is to score as many runs as possible in the allotted time (usually 50 six ball overs). If the first team scores more runs than the second team, the MOV can readily be expressed in terms of runs difference between the two teams. The aim of the side batting second is to score more runs than the first team. Because the game is deemed to be finished if the team batting second achieves their target, the MOV is usually expressed in terms of resources (wickets and balls) remaining, rather than runs. In order to develop a predictive process for match outcomes, a consistent measure of the MOV is required. This can be achieved by following the work of Duckworth and Lewis, 1999 to convert resources available into runs.

Frank Duckworth and Tony Lewis developed a now well-known system for resetting targets in ODI matches that were shortened due to rain. Although this system has undergone several refinements in recent years, the general way in which the Duckworth-Lewis (D-L) method is calculated has not changed, with wickets and balls remaining expressed as resources available and converted to runs. Table 1 shows an abbreviated version of the remained resources (R) for wickets lost and balls remaining. A complete tables and detailed account of the derivation of this table is given by Duckworth and Lewis, 1999.

Whilst the D-L approach was specifically designed to improve ‘fairness’ in interrupted one- day matches, (de Silva et al., 2001) found that when used to quantify the MOV, the D-L approach sometimes overestimated the available resources when the second team to bat won easily, and underestimated the available resources when the second team to bat only just won. By minimizing the Cramer-von Mises statistic for the differences between actual and predicted runs, de Silva derived a formula to reduce bias by modifying the remaining resources. This is given by

When an ODI match is won by the team batting first, the MOV is readily determined by the difference in runs scored. When the match is won by the team batting second, the MOV can be found by multiplying the first innings run total by the corresponding modified percentage of resources remaining as given by (1). By referencing the MOV so that a ‘home’ win has a positive value and an ‘away’ win has a negative value, it can be seen from Figure 1, that the underlying distribution for MOV can be well approximated by a Normal distribution.

Statistical analysis

All analysis was performed using SAS version 8.2 (SAS Institute Inc., Cary, NC, USA). Multiple linear regression models were constructed using a stepwise selection procedure and validated a backward elimination procedure. To increase the robustness of the prediction models a reduced level of statistical significance was incorporated with all variables achieving a level of significance below p = 0.005. Comparisons between continuously normally distributed variables were made using student t-tests.

Prediction models for MOV

Using match and player information from 1800 ODIs played prior to Jan 2002, (Bailey, 2005) combined measures of recent form, experience, overall quality and home ground advantage (HA), to produce a prediction model that was successfully used to identify inefficiencies the betting market for ODI matches. Using 2200 matches played prior to January 2005 an updated version of this model was created and compared to the original.

Prediction variables of experience, quality and form were derived by developing separate measures for both teams and then subtracting the away team values from the home team values. This effectively references the final result in term of the home team. Indicator variables were created to identify matches played at a neutral venue and matches where the two competing teams were clearly from different class structures (established nation versus developing nation).

From Table 2 it can be seen that the results of ODI matches are becoming more predictable, with the updated model explaining 3.5% more of the variation in ODI outcomes (R-square: 23.4% vs. 19.6% p < 0.0001).

Because the MOV in the regression model is nominally structured in favour of the home team, the intercept term in the regression equation reflects HA. It can be seen from Table 2 that HA is equivalent to about 14 runs and is highly statistically significant (p < 0.0001). Because one third of all ODI have been played at neutral venues, a binomial indicator variable was created to negate the HA for these games. As the regression process requires a ‘Home’ and ‘Away’ team, when playing at neutral venue, the team with the most experience at the venue was assigned to be the ‘Home’ team. If all matches played at neutral venues were devoid of HA then the binomial variable for a neutral venue would be the exact negative of the intercept term. This was not the quite the case, with the neutral variable equivalent to about eight runs, suggesting a HA in neutral matches equivalent to about six runs. This six run difference could be thought of as a surrogate marker for the difference in familiarity between the competing teams.

The difference in quality, as measured by the difference in averages between the two teams for all past matches, was by far the strongest predictor, explaining 20.7% of the variation in the updated model. The best measure of current form was the difference in averages for the past 10 matches, whilst the difference in overall experience (games played by the country) between the home and away team was also statistically significant. Whilst no statistically significant difference could be found in parameter estimates, the difference in class (when a developing cricket nation played host to an established cricket nation) declined (29.6 runs vs. 25. 1 runs) as developing nations gain more experience. Similarly, the effect of HA rose slightly (13.4 runs vs. 13.9 runs) with more data, while the effect of a neutral venue was slightly lower (8.6 runs vs. 8.2 runs). Not surprisingly, all variables in the model achieved a higher level of statistical significant when additional data were used.

Prediction model for team totals

Figure 2 it shows that the total of the team batting first can be well approximated by a normal distribution (mean = 229.7, SD = ± 1.2). When the score of the team batting first was shortened due to rain, (about 13% of matches), the DL method was once again incorporated to determine a projected total.

Using past averages and exponential smoothing, prediction variables relating to past performance were created. Using a multiple linear regression, a six variable model was constructed. The resulting parameter values are given in Table 3.

Interestingly, when using a stepwise selection procedure, the strongest predictor of the total scored by the team batting first was in fact the average of the past MOV between the two teams. The next strongest predictors in the model were derived from the past first innings scores achieved by the batting team as well as scores conceded by the bowling team. HA was the next predictor of importance, with a team playing in it home country scoring an additional 15 runs. A second surrogate marker for the quality of the batting team was given by the average past MOV for the batting team. The final variable that was found to be highly statistically significant (p = 0.0004) was derived from all past first innings played at the venue. This helped account for pitch conditions and venue size.

Whilst over 23% of the variation in MOV could be explained by the multivariate model, the total of the team batting first was not as predictable, with an R-square statistic of 19.1%.

Using a holdout sample of 100 completed matches played in the year 2005, the regression model successfully predicted the winning team 71% of the time and had an Absolute Average Error (AAE) between the predicted and actual margin of 55.8 ± 4.1 runs. These results compare favourably against the original prediction model of (Bailey, 2005), who accurately identified the winning team 69.6% of the time, and had an AAE of 54.6 ± 0.9 runs for a sample of 336 matches played between 2002 and 2004.

Using the same holdout sample of 100 matches, the AAE for the difference between the predicted and actual totals of the team batting first was 42.5 ± 3.2 runs. By referencing the MOV in terms of the team batting first rather than the home team, a predicted total for the team batting second can be given by

From the chosen holdout sample of 100 matches, the AAE for the difference between the predicted and actual totals of the team batting second was 47.1 ± 4.0 runs.

RESULTS

With the use of the D-L method to convert available resources into runs, at the completion of each over, an updated total for the team batting first is calculated by combining the actual total with the predicted total for the remainder of the innings.

Using complete over by over information for the 100 match holdout sample, it can be seen from Figure 3 that the accuracy with which the total of the batting team can be predicted, progressively improves throughout the course of the innings, with first innings totals significantly more accurate that those of the second innings.

By subtracting the pre-match predicted total from the updated prediction of the total, a performance indicator can be derived for whether each batting team is performing above or below expectation.

With the use the performance indicator, an updated prediction for the MOV can then be readily obtained

From Figure 4 it can be seen that during the course of the first innings, the AAE for the difference between the predicted and actual MOV reduces by about 10 runs. In the second innings the reduction in AAE is much greater as the game draws nearer to its conclusion.

As shown by (Bailey, 2005), by dividing the predicted MOV by its standard error and comparing with a standard Normal distribution, the approximate probability that either side will win the match can be readily calculated.

Example: On December 7 2005, Australia played New Zealand in a day/night match at Westpac Stadium in Wellington. After winning the toss and electing to bat Australia proceeded to score a very respectable total of 322. The betting exchange Betfair fielded a betting market for this match, with just over $1,000,000 AUD of matched bets occurring before the start of the game. As betting on this match remains open for the duration of the game, by the completion of the Australian innings, just over $4,000,000 AUD of matched bets had been placed. Figure 5 shows both the volume of bets placed and the price matched. From Figure 5 it can be seen that the opening price for Australia was about $1.38, with the price dropping to $1.30 after Australia won the toss. After losing 3 early wickets, the price drifted out to $1. 70, but as Australia rallied, the price continued to drop and by the completion on the 50^th over, the best price available for Australia to win was $1.08.

Using prediction models for the team total and MOV, the predicted probability for Australia to win was calculated both before and during the match, and compared with the market price offered by Betfair (market probabilities included 5% for commission ). Where the predicted probability can be seen to exceed the market probability, the ‘in play’ market can be thought to be inefficient. From Figure 6 it can be seen that while Australia was batting, the predicted probability for Australia to win was consistently below the market probability, with only one inefficiency occurring throughout the course of the Australian innings.

Chasing 323 runs to win the match, New Zealand started slowly. With some big hitting towards the end of the innings, the black caps clawed their way into contention and started the final over as favourites, only requiring six runs to win. Unfortunately, two wickets falling in the final over gave victory to Australia by 2 runs. Figure 7 shows that several inefficiencies were present in the betting market with the predicted probability of success often exceeding the market price. By the completion of the 100th over, more than $9,000,000 AUD had been wagered on the outcome of the match.

DISCUSSION

In July 2005 the International Cricket Council (ICC) announced a new set of rules to be applicable to ODI matches. An increase in fielding restrictions and the introduction of a substitute player (super-sub), significantly increased the total achieved by the team batting first by more than 20 runs. (252.7 ± 8.0 vs. 229.7 ± 1.2 p = 0.002). As these changes occurred within the holdout sample of the data used, it is unsure how these modifications would impact upon the prediction process.

Whilst the price and volume of bets traded are available through Betfair (see Figure 5), this information is not time coded by over, ensuring that if the efficiency of the market is to be accurately determined, information must be gathered manually at the completion of each over. This would undoubtedly prove time consuming should a definitive appraisal of the market inefficiency be required.

In Australia, federal laws prevent Australian citizens from placing bets over the internet after a sporting event has commenced. Paradoxically, Australian citizen can place bets ‘in the run’ provided the bets are placed over the phone. This inconvenience causes a greater delay between observing an inefficient price and actually placing a bet.

Conclusions

Multiple linear regression provides a useful way to assign the winning probabilities to the competing teams in ODI matches. With the use of D-L approach, this process can be readily modified to produce ‘in the run’ predictions. Whilst a definitive analysis of the efficiency of the betting market is yet to be conducted, preliminary evidence suggest punters may be prone to over or under estimate the true probability of the competing teams as the game progresses.

AUTHOR BIOGRAPHY


		Michael Bailey
		Employment:Statistician, Department of Epidemiology & Preventive Medicine, Monash University, Australia.
		Degree: PhD, MSc (Statistics), BSc(Hons)
		Research interests: Health, sport, gambling.
		E-mail: Michael.Bailey@med.monash.edu.au


		Stephen R. Clarke
		Employment:Professor, Swinburne University of Technology, Australia.
		Degree: PhD, M.A., B.Sc(Hons), Dip Ed.
		Research interests: Modelling in sport, gambling.
		E-mail: sclarke@swin.edu.au

REFERENCES

Bailey M. (2005) . Predicting sporting outcomes: A statistical approach , 212-.

de Silva B., Pond G., Swartz T. (2001) Estimation of the magnitude of victory in one-day cricket. Australian & New Zealand Journal of Statistics , -.

Duckworth F., Lewis T. (1999) . Your comprehensive guide to the Duckworth/Lewis method for Resetting targets in one-day cricket , -.

Back

PDF

Email link to this article

Predicting the Match Outcome in One Day International Cricket Matches, while the Game is in Progress

Michael Bailey, Stephen R. Clarke

2006(05), 480 - 487.

Share this article