Twitter Geo Stream
Currently, approximately 500 million tweets are sent per day worldwide with the open, social short message service Twitter [cf. InternetLiveStats2016]; which is about 6000 Tweets per second. Although each tweet consists only of up to 140 characters, these are enriched with numerous additional meta-information. In particular the time stamp as well as the geographic information about the user at the time of the dispatch are valuable information and offer approaches for numerous analyzes.
Within the research lab nextPlace of the University of Applied Sciences Ostwestfalen-Lippe, the authors implemented a web application for the analysis and visualization of geo-referenced tweets in order to elaborate technical data in the analysis of Twitter tweets: the data to be processed are numerous, and are continuously provided as a stream in real time. Twitter data thus meet partial requirements for common definitions of big data [cf. e.g. Samuel2015].
The focus in the development of the prototype is accordingly the processing of real-time data in the data stream with corresponding high processing speed and suitable communication patterns in a distributed system.
There are several web pages dealing with the analysis and visualization of Twitter data. For example, a map of georeferenced tweets is displayed in real-time in [OneMilliontweetMap2016]. In [NCStateUniversity2016], tweets are analyzed with a given keyword and the suspected feelings of the users are classified and visualized accordingly.
Scientific discussions on technical challenges in the processing of Twitter tweets for example can be found in [Kulkarni2015; Feng2015].
In [Amazon2016] a similar prototypical implementation is presented as the following.
The server application reads data from the Twitter Stream API [Twitter2016] and performs various work steps with the tweet data. The client application presents the result on a 3D map with the web library Cesium [Cesium2016].
An essential non-functional requirement for the web application is that every Twitter tweet is available for exactly 140 seconds for processing and visualization. On the one hand, a theoretically steadily growing data volume can be avoided and on the other hand the focus is only to be on the respective current section of the data. Furthermore, only georeferenced tweets are considered in German or English.
The prototype offers the user four different functionalities: Tweet-Bubbles, Tweet-Connections, Tag-Cloud and Tag-Cluster. Essential for the provision of these functionalities is an adequate communication of the individual components within the distributed system.
Communication in the distributed system
Figure 2 shows the UML sequence diagram of the communication in the distributed system with the actors client browser (presentation), server application (application) and Twitter API (data).
The browser performs a handshake via HTTP with the server application (1) to switch to the TC protocol for Web sockets. If this is the first client that connects to the server application, an HTTP connection is established between the server and the Twitter API (2). This connection remains active until the server actively de-assembles it (9). When the connection is established, a separate thread is opened in the server application which continuously processes the incoming data on the data stream.
According to the strategy design pattern [Gamma1995], dedicated algorithms and data structures for the individual functionalities (Tweet-Bubbles, Tag-Cloud, Tag-Cluster) are implemented for the server-side processing (4a, 4b). After each processing, a tweet is transmitted to all the logged-in clients (5a, 5b) via the existing web socket connection, processed there (6a, 6b) and displayed on the card. When the client application is terminated, the Web socket connection to the server is terminated with a dedicated message (7). If no connection to at least one client is active, the connection to the Twitter API is terminated (9).
The processing within the client server application is described below for the individual functional features.
Every tweet processed in the application is transferred to all registered browser clients. For this purpose, the relevant tweet data Id, Text, User Name and Position (as length and latitude) are extracted from the JSON format of the Twitter API and transformed into the Cesium-specific CZML format.
In CZML, temporal behavior can also be modeled on the Twitter data. To this end, the authors define tweets with no relevance (no slogans, no referencing of other users) and relevance (keywords and / or references to users are given). In the case of the latter, the relevance is correspondingly weighted by the number of occurrences in the currently available data. The information is provided by the Tag-Cloud functionality (see below). According to the metaphor of bubbles, the tweets rise as points at the georeferenced position on the map. As the duration or height increases, the points turn from green to red. The more relevant the tweet, the slower the climb speed and the larger the point diameter.
The visualization takes place near real-time. If a tweet is provided by a user, e.g. Via the Twitter web page, it will appear in the shortest time t (several measurements with different users: t <2 sec) on the application’s map.
Every relevant tweet on the map will check if there is another tweet that contains the same relevancy information. If there is a suitable one, exactly one colored connecting line between these two tweets is presented in order to emphasize the substantive cohesion. Connections related to keywords show a green, related to the user a blue line.
Because each client maintains a stateless connection to the server, the check must be performed on available tweets in the client’s Web browser. The check is implemented in Java script and a simple iteration over all currently visible tweets. The search has an upper bound of O (n) and is suboptimal with several thousands of visible elements.
The received tweets of the Twitter Stream API with relevance are additionally processed in the web application. The keywords marked with the double-cross (English hash tags) from the tweets are extracted and transferred into a search index for a full-text search. The index is implemented using the open source component Apache Lucene [Lucene2016]. For the three regions Germany, the UK and USA, a separate index is established and appropriate catchwords are stored according to their region. By querying the index, the number as well as an order of the frequency of the occurrences can be formed. By means of an additional time control, the key words are periodically converted into a graphic of a keyword cloud (tag cloud) by means of the open-source component Kumo [Kumo2016]. The number of occurrences per keyword determines the weighting and size of the font and strength used to represent the slogan. The arrangement of the words in the graphic is random. Figure 5 shows the graphs for the keyword clouds of all tweets of the last 140 seconds which were sent by Twitter users in the United Kingdom and / or Germany. This image is transferred in CZML and delivered to the web browser as a textured polygon in the corresponding country borders.
In contrast to the client-side processing in the tag connections outlined above, processing is significantly optimized with the tag cloud, Since the data structure of the index provides random access. All in all, however, there is the disadvantage that the index has to be periodically adjusted programmatically, since each keyword must be removed from the index again after 140 seconds, according to the non-functional requirements.
If the communication between the participating applications of the distributed system is stream-oriented, then the processing of the functions outlined above is only conditional. Basically, the processing of (real-time) data streams differs in some features from processing with conventional data [Lui2014]; et al New data must be constantly taken into account and obsolete data must be neglected. Typical scalable batch processing of Big-Data for example using MapReduce [Dean2008] is only conditionally suitable for these requirements. Rather, software components that propagate the paradigm shift from stack to stream processing for big data are established.
For tag cluster functionality, such an algorithm is implemented prototypically (on a node) using an iterative algorithm. In this way geographic clusters (clusters) are formed with reference to the tweets and these are visualized on the map. For this purpose, a k-means algorithm [cf. Kanungo2002] to the tweets of 140 seconds currently available in the time window. The Open Source component Apache Spark [Meng2015] is used for the implementation. This provides a dedicated implementation for the processing of data streams. The user can set the number k of the clusters to be determined in the web interface. The calculation takes place dynamically on the data stream, only the visualization is periodically (polling) and is requested by the client to the server. A cluster contains a center point as well as the georeferenced tweets on which it is based. These coordinates are connected to a geometric polygon and displayed on the map as shown in Figure 6.
These polygons show geographic areas, which differ in part from the real land borders. These newly created geographic areas can, for example, be used for further analyzes.
[Amazon2016]; https://blogs.aws.amazon.com/bigdata/post/Tx7CTJ1GEWSI57/Visualizing-Real-time-Geotagged-Data-with-Amazon-Kinesis; Amazon Big Data Blog; Visualizing Real-time, Geotagged Data with Amazon Kinesis; Access date 11.07.2016
[Dean2008]; Dean et. al.; 2008; MapReduce: simplified data processing on large clusters; Communications of the ACM; Volume 51 Issue 1, January 2008
[Feng2015]; Feng et. al.; 2015; STREAMCUBE: Hierachical spatio-temporal hashtag clustering for event exploration over the Twitter stream; 2015 IEEE 31st International Conference on Data Engineering
[Gamma1995]; Gamma et. al.; 1995; Design Patterns. Elements of Reusable Object-Oriented Software; Addison-Wesley
[InternetLiveStats2016]; http://www.internetlivestats.com/twitter-statistics/; Twitter Usage Statistics; Access date11.07.2016
[Lucene2016]; https://lucene.apache.org/core/; Ultra-Fast Search Library; Zugriffsdatum 06.07.2016
[Kanungo2002]; Kanungo et. al.; 2002; An efficient k-means clustering algorithm: analysis and implementation; IEEE Transactions on Pattern Analysis and Machine Intelligence
[Kulkarni2015]; Kulkarni et. al.; 2015; Twitter Heron: Stream Processing at Scale; Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
[Kumo2016]; https://github.com/kennycason/kumo; Kumo – Java Word Cloud; Access date08.07.2016
[Lui2014]; Lui et. al.; 2016; Survey of Real-time Processing Systems for Big Data; IDEA’S 14; Proceedings of the 18th International Database Engineering & Applications Symposium; Pages 356 – 361
[Meng2016]; Meng et. al.; 2016; MLib: machine learning in apache spark; The Journal of Machine Learning Research, Volume 17, Issue 1, January 2016, Pages 1235 – 1241
[NCStateUniversity2016]; https://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/; Sentiment Viz; Access date12.07.2016
[OneMillionTweetMap2016]; http://onemilliontweetmap.com/; The one million tweet map; Access date11.07.2016
[Samuel2015]; Samuel et. al.; 2015; A survey on Big Data and it’s research challenges; ARPN Journal of Engineering and Applied Sciences; Vol 10. No. 8 May 2015
[Twitter2016]; https://dev.twitter.com/streaming/overview; Twitter Streaming APIs; Access date11.07.2016