Using graph to reconstruct Catalan crisis events with tweets.
In 2017, over 330 million people around the world use Twitter each month to comment and react instantly to events. Over the same year around 180 billion of tweets have been sent by users! In autumn 2017, Spain experienced one of the most important social event since the Spanish Civil War: the Catalan independence referendum. Throughout the crisis, the population has widely used twitter to react to events in real time. This major event in the recent history of Spain received a massive media coverage as well.
At Bleckwen, we believe data and analytics can be used to answer challenging questions in many fields. In a daily basis, we use these techniques to fight fraud and protect our clients. Thus we decided to apply similar techniques on a completely different field: we asked ourselves if we could reconstruct the timeline of Catalonia crisis and correctly identify the main events using only twitter metadata, i.e. without analyzing the content of tweets.
Our goal: use the power of analytics to answer two questions:
- Can we identify major events during the crisis from the metadata of the tweets, and then reconstruct the timeline of events?
- Are we able to detect important events that have not been covered by traditional media?
On the morning of 27th October 2017, the Spanish Senate gives full powers to the head of government, Mariano Rajoy. The latter can now put Catalonia under guardianship. On the same day, in the afternoon, Carles Puigdemont, President of the Generalitat of Catalonia, proclaims the independence of the region, following the results of the referendum. Spain is experiencing a major crisis in its history. Could we reconstruct the events that marked this crisis using only Twitter metadata?
To study the Catalan crisis, we collected via Twitter’s API Stream all tweets sent from October 3rd to November 6th, 2017, written in Spanish, Catalan, Galician and Basque. We filtered only tweets containing the words [catalogne, catalunia, catalunya, etc.]. We created a data set of 824 influential tweets and their 1 million retweets.
In the same time, we manually listed the dates of the 18 major events of the Catalan crisis (see image below) covered by 8 major traditional media: BBC, The Independent, The Local, Fox News, NBC, Euronews, US News, and Politico.eu.
Tweets are reactions to real life events
People tend to use Twitter in different ways: to share their moments, their ideas or just to post their cat’s photo! During important events like social movements, World Cup or US election, tweets are the voice of people in reaction to what they see, feel or experience in the real life.
Based on this assumption, we collected all tweets of a given population in a delimited period of time. Then we tried to classify tweets reacting to a single event in different clusters. The result of this categorization is what we called “a set of an event abstraction”. Example of tweets clustering during the Catalonia crisis:
However, in order to group tweets together with analytical methods, we need to measure how close a tweet is to another. So we have to define a similarity metric. One could analyze the content of the tweets and assess if they talk about the same event. But as we like challenges, we tried to do this without content analysis!
Defining similarity of tweets with no content analysis
In order to compute the similarity between tweets we first need to understand two important concepts:
- The co-occurrence: tweet A and tweet B have been sent close in time
- Co-retweeting: both tweets have been retweeted by the same people
We can now state that the similarity between two tweets is defined by the product of these measures as illustrated in the figure below:
In other words, two tweets are more similar as they were sent near in time and as same users retweeted both of them.
Clustering tweets with a graph approach
Now we have a good way to measure how similar one tweet is to another, it is time to group a number of them together and discover the event they are correlated with. We use for that a data structure called Graph.
Graph is a powerful concept and widely used in many fields like chemical, security, fraud prevention and the most known, social networks.
According Wikipedia’s definition:
“Graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense “related”. The objects correspond to mathematical abstractions called nodes and each of the related pairs of nodes is called an edge.”
In our case, the nodes of the graph are the tweets we collected and the relation between them (edges) are the similarity with each other computed, according to the definition above.
We then applied a community detection algorithm called Walk Trap to the graph of Catalonia we have built. The assumption behind this approach is that each detected community correspond to a cluster containing homogeneous tweets linked to a specific event happened during the crisis. That is what we have called “the abstraction of an event”.
Is our model able to identify the events covered by the media?
We applied our approach to tweets sent between October 3 and November 6, 2017 and we found 34 sets of events’ abstraction. Are these 34 sets of events relevant ? Do they match with events that really happened in the real world ?
Remember that we assumed tweets inside a same cluster should be homogeneous because they have been sent in reaction to a same event. However, assessing the homogeneity of an event abstraction is a quite subjective task. It implies to look at the content of each tweet and judging if the majority of tweets that compose the abstraction are related to the same subject.
We manually reviewed the content of the tweets of each abstraction.
Here are the results:
Here is an example to visualize an “event abstraction”. Abstraction e34: 100% of the 18 tweets that compose this abstraction are linked to the event “Demonstration for the Union on October 8th, 2017″.
Now that we have a reasonable way to assess the relevancy of events found by our model, we are able to answer our first question. Again, our model uses only the tweet’s metadata to find events. The figure below shows that 12 events are correctly identified by our model among 18 covered by traditional media:
Of the 6 events not found by our model, 3 happened on the same day as a very large event. For example, the model does not identify two small events listed on October 3rd. However, we note that this same day there was a massive demonstration in Barcelona “against police’s violence” happening on the day before.
Partially recovered events refer to events that are not directly identified. For example, on 3rd and 5th November, two events are covered by media:
- the arrest warrant required against Carles Puigdemont
- Carles Puigdemont’s submission to the Belgian authorities
We noticed the model combined these two events into a single “three-day” event that could be defined as the “Puigdemont leak”.
Is our model able to find significant events not covered by the media?
Here again, our results are quite interesting: the model identified reactions to 11 events that are mostly less intense compared to those reported by the media. Among the events detected, we find for example:
– the publication of a press article;
– a media debate about the real or supposed indoctrination of children in Catalan schools;
– the agreement in principle between the PP (Partido Popular – People’s party) and the PSOE (Partido Socialista Obrero Español – Spanish Socialist Workers Party) on the organisation of new elections;
– the announcement of a demonstration in Brussels;
– the broadcast of the YouTube video HELP CATALONIA which denounces a “fascist Spanish state”.
The detection of minor events not mentioned by media allow a more accurate understanding of the events. Moreover, our model complements traditional media coverage of events.
Our model detects the abstractions of 12 of the 18 events mentioned by the 8 media in the first month of the Catalan crisis. It also detects 11 other “minor” events.
We have shown that it is possible to apply analytical methods to Twitter’s metadata to reconstruct a timeline of events occurring in the real world. Our approach is based on tweets’ metadata analysis (i.e. no content analysing) and graph models.
The presented model allowed us to correctly recover 12 of the 18 main events of the Catalan crisis covered by 8 media. The 6 undetected events are relatively minor.
In addition, our graph based approach identifies 11 additional events of a relatively low intensity and therefore more difficult to detect. This enable an additional understanding to the chronology of events carried out by the media.
As next step, we would like to develop a real-time version of our model. We look forward to see you for our next challenge!
On October 20th, 2017, an agreement in principle was signed between the PP and the PSOE concerning new elections to be held in Catalonia. This event has not been listed by the 8 media, nor even included in Wikipedia’s Catalan Crisis article. Our model detected this event !