Where is the perfect crystal ball?

With the advance in the management and treatment of information (the shabby concept of Big Data), new voices have emerged to say it is possible to forecast traffic with a reliability close to 100%. This is due to new ways of gathering information that we have recently emerged.

But is this possible?

Although some data processing companies say it is possible, but as far as it is shown, this seems contradictory. This is because engineers and Big Data companies manage a completely different time scale.

What’s up, Doc?

A clarification: nothing new has been invented, it has only been varnished (and why not say it, improved) but the use of Big Data and data analysis has been done for a long time with more mundane terms.

Globalvia has been doing it for a long time. The clearest example is Metro de Sevilla: when we budget its demand, we consider the dates of Easter Week, the Feria de Abril and even the days when there are football matches (even forecasting how significant Sevilla FC is in terms of international competitions).

As said before, the new data analysis tools do not replace the old methods of calculation, but they allow us to cover more information and therefore they are more precise and quick in the characterization of our traffic.

This improvement in the characterization is based on obtaining new correlations between traffic and new variables that may affect it, opening a range of possibilities that until very recently it was unthinkable to develop.

Correlating the evolution of traffic with some justifying variable was referred to practically well-known variables, perhaps by habit, but mainly because there was not enough information to associate these data.

The fields of data science
Source: Dahl Winters.

But correlation does not mean causality.

An example: a correlation has been detected between the increase in sales of ice axes in a store in Chamonix and the increase in the number of dead people climbing Mont Blanc. This data could make one think that whoever buys an ice axe in this store is going to die if they go to Mont Blanc. However, even if they correlate, one does not cause (a priori) the other, and it is more likely that the good weather has attracted more mountaineers to the area, and therefore the sales of the store and the chances of having an accident have both increased.

The work of the data analyst (the Traffic Engineer) is to detect inconsistencies in each correlation to avoid misuse of information, as well as validate the variables that imply causality.

Models and models

The new and accessible information about the users, meaning any person who moves for a specific purpose, allows us to search for new or different causes of their movement. As a result, we might improve what we know about them and, supposedly, improve the future prediction behavior of similar users.

Here we have the first concept error, called a “predictive model” to something that in reality could not be.

The use we give to Big Data in our internal analysis is generating a descriptive model because we are looking for the reasons for why people move, classifying them into a certain group. Analyzing their behavior, if we find a new user that fits within that category, we assume that this new user will behave similarly to the previous ones.

On the other hand, predictive models, although similar, evaluate the probability that a person in a different area exhibits a behavior like those we have analyzed previously in another environment. The scope for this is completely different.

Four types of analytics
Source: Intellipaat.

In contrast, if we examine the descriptive model, it allows us to better understand the users of (for example) Tranvía de Parla, and we can use it to calculate new possible users of this tram. Meanwhile, a predictive model would use the behavior detected in Parla to forecast the ridership of the Trams of Barcelona.

The utility: a great unknown

If someone asks me why there is no system to predict future demand, my answer is that there is a mismatch between utility and the time scale to work.

The typical descriptive models predict the demand that will circulate on a highway with a 100% of reliability … over the next 72 hours. If the period of time is extended, the reliability decreases. That does not mean that the model is not useful, only that we have to find the right purpose for this information.

From the point of view of O&M, calculating the demand that will exist on the next two o three days can help to improve the service provided to the user; for example, to increase the number of trains in a tram or the people assiting users in a toll booth.

For Traffic Engineering this time horizon is very short. The predictions span over one year, but normally the whole life of an asset meaning several decades.

In addition, we have another problem: how do you predict the variables that allow you to obtain your demand? Another model should be generated for these variables, which in turn will be dependent on other model, an infinite loop that would make the methodology useless.

Is this prediction of these variables reliable? An error or deviation in the prediction of the base variables irremediably causes a collateral effect in the projection of traffic or demand, so the issue is not easy.

Therefore, the choice of variables that correlate traffic may be limited, and finding a variable that has reliable predictability is as important as its correlation with traffic. In many cases these variables must be rejected, and then we must come back to the “classic” ones, which have more contrasted future projections (population, GDP, employment, etc.), although not necessarily infallible.


The new data analysis tools allow us to improve the methodology of understanding the users of our highways, metros and trams, and also analyze more variables that justify the reasons for their movement.

With this new knowledge we can confirm or expand which variables justify mobility, and be more precise to determine future demand. But with certain aspects, the time scale is limited to the very short term if we want a complete reliability, and the acknowledge that supporting variables may not be valid because they are not independent, predictable or reliable.

Therefore, to ensure that a traffic prediction model can be created, a new crystal ball, which replaces a Traffic Engineer, is now risky in the short term while scientific and critical methods have to be applied to validate the models that help estimate the demand of our assets.

Cientifico de datos
Source: blogs.sas.com.

Carlos Rol Rúa – Traffic Manager of Globalvia