REVISTA DE ENGENHARIA CIVIL IMED Limitation of classification tree models in investigating road accident severity Limitações do uso dos modelos de árvore de classificação na investigação da severidade de acidentes rodoviários

The objective of this study is to discuss the main limitations identified in the classification process of traffic accident severity, as based on Classification and Regression Trees models (CART). With this purpose, CART was used in the collection of an unbalanced database of road accidents, considering injury severity, categorized as accidents without victims and with victims (fatal and non-fatal), as the dependent variable. The variables associated with accident characteristics, road infrastructure and environmental conditions were used to identify the influence of these factors on accident severity. Although the overall classification by CART resulted in a high accuracy, it also indicated a low rate of accuracy in the classification of accidents with victims, which in turn corresponds to the rarest observations in the database. In addition, it was obtained a high number of decision rules, considering the number of categories of independent variables in the prediction process of the target variable. The results indicated that CART is not efficient in the study of multiple-effect phenomena, such as road accidents, since it does not have the potential to associate a large number of parameters, which restricts the analysis and interpretation of the results to the binary structure of the tree. Thus, an exploratory analysis of the database is suggested, when the influence of a database variable was analyzed in the occurrence of traffic accidents.


Introduction
Road accident prevention researches aim at investigating the main factors associated with a road accident. These researches are fundamental to develop a proactive safety management in the road environment. Although the number of studies covering this theme has increased in recent years, there are still aspects to be investigated, especially in a developing country, where the majority of road accidents resulting in fatalities occur.
Traditional statistical tests are criticized for their inability to analyze the variables in relation to the observed phenomenon, which are traffic accidents (HAUER, 1997). The regression models require the establishment of relations between the dependent and independent variables, which can lead to misleading estimates of probability of injury severity (ABELLÁN et al., 2013;WANG, 2006;GRISELDA et al., 2012). Although those techniques are still in use, they are inadequate in the treatment of a large number of variables, are insensitive to the detection of outliers, noises and missing data, characteristics inherent to the accident database. (CALIENDO et al., 2007;KARLAFTIS;VLAHOGIANNI, 2011).
Due to this problem, non-parametric data mining techniques have been explored, where there is no need to settle down previous relationships between the target (dependent) variables and the predictor variables (independent) in the classification and forecasting process (ABELLÁN et al., 2013;KASHANI et al., 2011;PAKGOHAR et al., 2011). These analytical techniques allow automatic detection of the best predictor variables and their respective thresholds by extracting "if-then" decision rules that can be used to discover attitudes that occur within a specific data set (ABELLÁN et al., 2013;DE OÑA et al., 2013, 2014KASHANI et al., 2011, LÓPEZ et al., 2013. Among the data mining techniques that have provided efficient results in relation to the classic regression models, the decision tree (CART -Classification and Regression Trees) is worth to be mentioned.
Some studies on road safety use CART for assessing the severity caused by different motor vehicles WANG, 2006;SAVOLAINEN et al., 2011;CHIEN, 2013), for the identification of factors associated with traffic severity and accident patterns (ABELLÁN et al., 2013;DE OÑA et al., 2013;DE OÑA et al., 2014;GRISELDA et al., 2012;MONTELLA et al., 2011MONTELLA et al., , 2012KASHANI et al., 2011);in the analysis of the effects of road geometry in road accidents (RUSSO et al., 2016). However, most studies detected in the literature do not discuss the challenges encountered in mining data from traffic accident databases.
In addition, road accident databases are generally not balanced, that is, they are composed of classes with just a few elements and classes with large quantity of elements, since they comprise events of a random nature whose parameters are not constant over time and space. In general, this leads to very optimistic results that do not fit the reality. In most cases, classifications with high accuracy are obtained for the majority classes and low accuracy for the minority classes. The most frequent classes in road accident databases are categorized as accidents without victims, while the rarest as with victims (fatal and non-fatal).
However, the essential variables for reducing injury severity are linked to the minority classes of the database, associated with the number of accidents with fatalities. In addition, the classifications obtained with CART are restricted to binary trees, which in most cases make it difficult to interpret and represent certain classes of the database (ABELLÁN et al., 2013).
Thus, this paper has the objective to discuss the limitations of the use of CART for road accident severity prediction.
In this perspective, this paper uses four years of unbalanced road accident database of a segment of Dom Pedro I Highway (SP-065), aiming to identify the limitations of the decision rules obtained with CART in the study of the relationship between the severity of the driver's injury and the variables associated with the road environment (weather, visibility and road surface condition) and accident characteristics (type of accident, probable cause and time of day).

Road accident database
The data set used in this work was provided by the "Rota das Bandeiras" concessionaire and includes the database of individual accidents recorded between km 125 and km 145 + 500 m of Dom Pedro I Highway (SP-065), in the urban area of the city of Campinas, Brazil. Four years of data were used (2009 to 2012).
The SP-065 highway has 145.5 km of extension and occupies the third position within the best highways in the national ranking. The city of Campinas occupies the eighth position in the national ranking of traffic deaths with a rate of 19.4 deaths/100 thousand inhabitants, and with a Human Development Index (HDI) in 2010 of 0.805 and a population of 1,098,630 million inhabitants (ONSV, 2014).
In order to identify the main contribution factors to road accident severity in the SP-065 highway, eight variables were used as described in Table 1. In this study, it was built a decision tree with one main node with the target variable the accident severity, which was categorized into accidents no injury victims (NI) and with victims (WI), fatal and non-fatal. Legend: Accident Type (ACT); Weather Condition (WTC); Sight Condition (SGC); Road Profile (PFR); Road Geometry (GER); Pavement Condition (PAV); Period (PER) and Accident Cause (ACC).
The variables selected for the analysis were: type of accident (rear-end collision, frontal collision, transversal collision, lateral collision, pile-up, rollover, run over, overturning, crash fixed or mobile object, and fall of motorbike or bicycle); weather conditions (good, rain, cloudy, haze and drizzle); road profile (level, ascending and descending slope); road geometry (straight, smooth curve and sharp curve); pavement condition (dry, wet and oily); visibility condition (good, partial and poor); period of day (morning, afternoon and night); and probable cause (driver, vehicle, road/environment, others). Those variables are present in several traffic accidents analysis (SAVOLAINEN et al., 2011;DE OÑA et al., 2013, 2014GRISELDA et al., 2012;MONTELLA et al., 2011MONTELLA et al., , 2012. Initially, the quality of the data records was analyzed. The observations with inconsistent, questionable or missing information were excluded from the analysis (total of 86 observations). Thus, the most prevalent conditions of the occurrences were maintained, yielding 2,824 accidents, corresponding to 97.04% of the original data set. Subsequently, the data were divided into two subsets, which comprised the test sample (10% of the data) and training (90% of the data), so that the CART algorithm was trained and validated by the classification process of cross-correlation (KOHAVI, 1995;LÜ;ZHOU, 2010).
It can be observed ( Table 1) that accidents with victims represent a small portion of the database (23.87%) in relation to accidents without victims (76.13%), meaning that this database is unbalanced.

CART construction principles
The structure of the CART tree is constructed recursively with each node representing a variable and the branches representing their respective attributes, according to a threshold or decision rule. Each terminal node or leaf specifies the expected value for each variable. To do so, metrics were used to maximize the purity score of each node among possible input variables. In this work, the Gini metric, which seeks to maximize the homogeneity of the nodes in relation to the dependent variable, was used (PANDE; ABDEL-ATY, 2006). This metric reaches minimum values (zero) when all cases in a node fall into a single category (LÓPEZ et al., 2016;WEI et al., 2017). The routine was implemented in SPSS software (Statistical Package for the Social Sciences), (MONTELLA et al., 2011;SINGH et al., 2016).
Thus, based on a binary structure or binary decision, the variables are grouped according to their importance to describe the target variable. In this case, a single rule is fired when an attribute is classified. In practice, a decision rule is an implication such as: if "A" then "B", where A represents a set of conditions. Each condition is defined by a relation of type attribute equal to value, attribute greater than or equal to value, and attribute less than or equal to value, where the value belongs to the domain of the attribute under analysis.
In this study, the grouped variables correlate to the dependent variable, road accident severity, classified in accidents without victims (No Injury -NI) and with victims (With Injury -WI). Through the CART structure makes it is possible to detect the nodes that contain the largest possible number of occurrences related to each category of the independent variables. Due to the cross-validation between the test and training samples, it is possible to identify the structure that best fits the data set, cutting the tree when necessary and excluding the nodes or branches that add little to the classification process of the dependent variable, that is, have little predictive value. Table 2 shows that the accuracy obtained in the classification process using CART for the database used in this study was 78.6%, which is in agreement with the literature (KASHANI et al., 2011;DE OÑA et al., 2013;PAKGOHAR et al., 2011;CHIEN, 2013). However, CART presented a low accuracy rate in the classification of accidents involving fatal and non-fatal victims (27%) and high accuracy rate for accidents without victims (94.8%), as shown in Table 2.

Results and discussion
It can be seen from the results presented in Table 2 that of the total of 2,150 NI accidents, 2,039 were classified correctly (94.80%) and 111 were erroneously classified as WI accidents (5.20%). While of the total of 674 WI accidents, 73% were classified as NI accidents and only 27% were properly classified as accidents involving fatal and nonfatal victims.
In this study, the most important independent variable for characterization of road accident severity is associated with ACT, with 100% importance (Table 3). Subsequently, were the variables PER (70.4%), ACC (58.6%) and GER (40.4%). The other variables had less importance, and the variable PAV (2.9%) was the one that had least influence in the prediction process and, therefore, was eliminated in the CART adjustment process.
The results obtained with CART are shown in Figure 1 and the decision rules in Table 3. The binary tree structure represents the number of induced decision rules, which correspond to the number of terminal node. The larger the CART structure, the greater the number of both terminal nodes and decision rules. This characteristic of CART limits and restricts its application in big data searches, which have a large number of variables and multiple possible combinations.
Node 33 indicates that collision type accident (frontal, transversal, lateral), pillup, rollover, overturning, run over, crash with fixed object and fall were those that culminated in WI accidents with a probability of 61.10%, in straight or with smooth curve road segments, under good visibility. The probable cause of the occurrence of these accidents was associated with driver and vehicle factors.
Node 34 considers the same types of WI accidents of node 33, but only those that occurred exclusively in the morning, in straight or with smooth curve road segments, partial and poor visibility. These accidents had a probability of 44.70% and the probable cause was, again, associated with the driver and the vehicle.
In node 36 the same types of WI accidents of nodes 33 and 34 are verified, but they occurred exclusively due to probable causes road/environment and others (presence of cyclist, pedestrian and animal on the road, congestion, previous accident and suicide), in the morning and in straight or with smooth curve road segment, with probability of 51.60%.
Node 38 shows the WI accidents of the rollover type, run over, pile-up, crash with fixed object and fall, which occurred in the afternoon, with good, rain, or drizzle weather conditions, with probability of 59.70%.
Node 39 presents the same types of WI accidents as node 38, but those occurred at night and with a straight segment, with a probability of 30.60%. Node 14 concentrates on the WI accidents caused by frontal collision and run over accidents in the afternoon and night, with probability of 86.80%.    The main factors contributing to the occurrence of WI accidents (nodes 14, 33, 34, 36, 38, and 39) are associated with road characteristics such as road profile, environmental conditions (weather and visibility condition), time of day and the characteristics of accidents (type of accident and probable cause). Note that the accidents that are most likely to generate victims were accidents due to frontal collision and run over (node 14). WI accidents in road segments with sharp curve are less recurrent, with 21.40% probability (node 12). These accidents were of the type collision (frontal, transversal, lateral), overturning, rollover, run over, crash with fixed object and fall (bicycle and motorcycle), being more frequent in the morning. These results are in accordance with previous studies (GRISELDA et al., 2012).
The pavement condition variable was pruned from the final tree structure, as it presented little relevance to the model (Table 3). However, it is directly associated with environmental conditions in the segment, with the exception of accidents that occurred in an oily pavement condition.
NI accidents were identified with a higher likelihood of occurrence and frequency of data at nodes 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38 and 39. This indicates that NI accidents are recurring in all segments of the highway and can happen due to all types of accidents. They are also associated with environmental variables (weather and visibility condition), road characteristics (road geometry and road profile) and probable cause. The most recurrent NI accidents are observed at nodes 28 and 29, with probability of occurrence of 79.30% and 82.80%, respectively. These accidents were collision type (rear, transversal and lateral) and crash with fixed object, with probable cause associated with the driver and other ones (congestion and previous accident), at different periods of the day (morning, afternoon, and night) and with good or bad visibility condition. The driver variable is in accordance to the previous studies of Abdel-Aty e Radwan (2000).

Conclusions
The results obtained with the CART algorithm express the influence on road accident severity classification of the variables related to road infrastructure and environmental conditions. In the classification process, although an accuracy compatible with the values found in the literature (78.6%) was obtained, it was verified that the accuracy rate for WI accidents (27%) was much lower than the NI accidents accuracy rate (94.8%). This is due to fewer observations related to WI in the road accident database, which means that the database is unbalanced. Due to this characteristic, it is recommended in road accident severity classification studies that databases be balanced in order to properly train the classifier employed by different class classification approaches.
Since the classification process with CART is based on set of attributes where each internal node corresponds to a test on the values of the attributes of a given variable, it is expected that a balanced database will produce more consistent results with good accuracy.
In the context of Road Safety, CART can be efficient when analyzing the impact of a specific category of a particular dependent variable on road accident severity, such as the driver's profile (age, gender), type of vehicle (motorcycles, trucks, vehicles), level of drunkenness (high, low, medium), among others. However, CART is not efficient for the analysis of multi-causal effects associated with the study of road accident. For these cases, it is indicated the elaboration of a CART for each dependent variable, with a smaller and significant number of decision rules.
In the spatial scope of the problem, it would also be interesting to add neighborhood relations based on measures of similarity, since the location of accidents can affect the probability of road accident severity.
For future work, more refined data mining techniques are recommended, based on network structures, in order to extract not only exploratory and visual information from the database, but also to identify, in a more efficient way, the main factors, which when combined, contribute to the occurrence of road accidents.