Research article - (2006)05, 525 - 532 |
Statistical Analysis of Notational AFL Data Using Continuous Time Markov Chains |
Denny Meyer, Don Forbes, Stephen R. Clarke |
Key words: Homogeneity in time, sequential dependency, semi-Markov process, football |
Key Points |
|
|
|
CTMC Assumptions |
There are a number of assumptions associated with a continuous time Markov chain. The Markov property implies that transitions are independent of the time for previous transitions as well as the type of previous transitions. In addition it is assumed that the characteristics of the transitions have exponential distributions for each state. In animal behaviour it is commonly found that the times for behaviours (bouts) do have an exponential distribution. In our analysis of AFL football we refer to the states of the Markov chain as events, such as a Kick. We have the times for each transition between events as well as the distance and speed associated with each transition, so we shall endeavour to include all three of these transition characteristics into our CTMC model. It is unlikely that these variables will have exponential distributions because these dimensions are confined by field size and shape and because there is a grouping of behaviours under each of the events (e.g. a Kick may be long, short, a ground kick, a clanger, a kick to advantage or an ineffective kick). Also it may be that we do not have a first-order Markov model in that the transition probabilities may not be independent of the previous sequence of events. This paper will investigate these issues in detail. Processes which do not have exponentially distributed transition times are called Semi-Markov chains. A common distribution in the animal behaviour literature is a displaced exponential distribution which allows for a non-zero minimum value. The gamma distribution has also been used to describe the duration of animal behaviours, allowing for a mixture of exponential distributions. Log-normal distributions are also used and even normal distributions which have been censored at zero. All these possible distributions can be tested for our time, distance and speed variables. Of course, a multivariate distribution allowing for correlations between these three variables should also be considered. Haccou and Meelis, We will follow this process, in the analysis below, using a data set derived from four AFL matches during the 2004 season. |
The data |
The data was collected by Champion Data, the official provider of AFL statistics, for four matches during the 2004 season. These matches, the venues and the results are described in In the following analysis we start with an exploratory analysis in which we examine the assumption of an exponential distribution for time, distance and speed for each type of event. Thereafter we test for time inhomogeneity in our data and then test the nature of any time dependencies. |
|
|
Exploratory data analysis |
A transition matrix was derived using the above event codes and the average times (sec), distances (m) and speeds (m·sec-1) were calculated for each event as shown in The histograms in The goodness of fit for a set of four common survival distributions was studied using the Anderson-Darling statistic. This statistic measures the area between the fitted distribution function and the nonparametric empirical distribution function. As shown in |
Analysis for time inhomogeneity in the case of abrupt changes |
Visual methods can be used for detecting inhomogeneity in time. Our time plots give some indication of inhomogeneity in time and between matches in that If the number of change points is known a Kruskal-Wallis test can be used to test whether the distribution of values for a specific event differs between the differing periods. This test makes no assumption about the distribution of values for a specific event. We compared the time, distance and speed distributions for each event between the quarters in any match and found no significant differences for any event when the Bonferroni correction was applied (α= 0.05/28). This confirms that there is no time inhomogeneity in the time, distance and speed distributions. Change points in the transition matrix can be tested using multinomial logistic regression. In animal behaviour studies it is not usual to allow a transition from a state to itself, however, we shall allow this in AFL football so that we can track the passage of the ball from player to player. On the other hand there are some transitions that are not possible in AFL football (e.g. a Kick-In is the only event that can follow a Behind), so we will ignore all transitions with a frequency of zero in For the sake of simplicity we again consider the end of each quarter as possible change points for each of the four matches. Our multinomial logistic regression analysis shows no significant match or quarter effect, suggesting that the transition matrix, like the transition variables, is homogeneous in time. As a result we shall use our complete data set for all four matches to test for sequential dependency. |
Tests of sequential dependency |
In a continuous time Markov chain (CTMC) a first-order dependency in the sequence of states is assumed. This means that the transition probability for states A and B in time ? is independent of the sequence of preceding states. This implies that the transition durations are independent for a given sequence of states. Dependencies may be short-term, long- term, or periodic in nature. They may relate to the sequence of states or dependencies between transition values and preceding and/or following states, or they may relate to correlations with transition values in subsequent transitions. In the case of animal behaviour transitions from state A to itself cannot occur, but as mentioned above this is not true in the case of AFL football. Instead there are several other transitions that are impossible as exhibited in Deviation from first-order dependency in a sequence of states is commonly tested with a chi-squared test. This test has reasonable power, however, it does not necessarily detect dependencies of higher than second order. Multinomial logistic regression was therefore used to model the occurrence of event Y based on the two previous events (X and A). It was found that only the most recent event had a significant influence [χ2(36) = 762.0, p<0.001] while the effect of the previous event was not significant [χ2(42) = 44.6, p = 0.384]. The next form of dependency occurs when the transition value distributions depend on the preceding state. This can be tested using a Kruskal-Wallis test, making no assumptions regarding the nature of the value distributions. Not unsurprisingly there was a strong relationship between the type of previous event and the values for time [χ2(6) = 20.7, p = 0.002], distance [χ2(6) = 186.7, p<0. 001] and speed [χ2(6) = 210.6, p<0.001]. Relations between subsequent transitions for the same and for different states produce a further form of dependency found in (semi-)Markov models, which can be measured using autocorrelation. Autocorrelations were initially calculated for all types of events simultaneously. For the time variable there was a very weak but significant positive autocorrelation of 0.05 for every second transition, suggesting that shorter events, such as handballs, would alternate with other types of event such as kicks. This theory is supported by the transition matrix in When autocorrelations are considered for each type of event separately, only in the case of Kicks do we obtain any significant autocorrelations. The time taken for consecutive kicks has a weak but significant negative correlation of 0.15, suggesting that short duration kicks alternate with longer duration kicks. However, the speed for consecutive kicks has a weak but significant positive correlation of 0.10. Although weak, these correlations probably need to be incorporate in the modelling process. |
Discussion and Conclusion |
Our analysis of four 2004 AFL football matches has shown that inhomogeneity is unlikely to be a problem within an AFL football match. There were similar processes for all four matches, perhaps on account of the similar scores for the four matches. However, for our definition of events there were marked differences in time, distance and speed requiring a separate analysis for each type of event. The distributions for the time, distance and speed variables varied for the different types of event, however, the 3- parameter LogLogistic and the 3-parameter LogNormal distributions tended to give the best fit. There were strong correlations between these variables for most of the events. Finally, it was confirmed that a first-order sequential dependency existed for the events, and that for successive kicks there was a weak correlation for the speed and time variables. These results suggest that a semi-Markov model is appropriate since the distributions are not usually exponential or mixtures of exponential or gamma distributions, but there is first-order sequential dependency. This model could be used for simulation purposes. An initial centre bounce (CEBO) would result in Ball-up Bounce (BUBO) a handball (HB) or a kick (KK) with respective probabilities of 13%, 48% and 39%. The associated time, distance and speed could be generated using the appropriate CEBO three-parameter log-normal distributions, allowing appropriate correlations between the times, distances and speeds. Similarly, results for all ensuing game events could be simulated. Through changes to the transition matrix and/or other model parameters, the resulting model could be used in order to predict the effect of rules changes and changes in play strategy. However, although the total number of goals and behinds would be known, the final score and the winner would not be known. In order to develop a more useful model all that is needed is a split of the events to identify the teams involved in each transition. The current work suggests that a semi-Markov model would be appropriate for this extended model, allowing a simulation similar to that described above, from which scores and the winning team could be determined for each simulated game. In the above analysis we have associated distances, speeds and times with each transition in time. The addition of directions for each transition would make it possible for a spatial simulation to be performed. In this case it would make sense to define the events according to spatial zone (within the field) as well as activity. An alternative approach would have been to use the quarters of the field as the events, again using time, distance, speed and direction to describe each transition. This approach would also not allow the simulation of match outcomes but it would help coaches and players to better understand the spatial patterns of play. A further extension to this work could allow continuous changes in the model parameters over time with the possible inclusion of covariates in the models for the transition probabilities. |
ACKNOWLEDGEMENTS |
The authors wish to thank Champion Data for access to the data on which this paper is based and they wish to thank a referee for helpful comments. |
AUTHOR BIOGRAPHY |
|
REFERENCES |
|