Twinder: A Search Engine for Twitter Streams

About

How can one effectively identify relevant messages in the hundreds of millions of Twitter messages that are posted every day? Twinder, a scalable search engine for Twitter streams, is invented for solving this problem. This site gives you an introduction to Twinder, including the architecture, some implementation details and our scientific results.

1. Overview

Twinder (Twitter Finder) is a search engine for Twitter streams that aims to improve search for Twitter messages by going beyond keyword-based matching. Different types of features ranging from syntactical to contextual features are considered by Twinder in order to predict the relevance of tweets for a given search query. The figure below shows the core components of the Twinder architecture. Different components are concerned with extracting features from the incoming messages of a Twitter stream. Given the huge amount of Twitter messages that are published every day, the system is designed to be scalable. For this reason, Twinder makes use of cloud computing infrastructures for processing-intensive tasks such as feature extraction and indexing.

Feature Extraction: The features used for relevance estimation are generated and indexed by the Feature Extraction component. It receives Twitter messages from Social Web Streams and implements a suite of functions that allow for representing the tweets.
Feature Extraction Task Broker: Feature Extraction Task Broker allows for dispatching feature extraction tasks and indexing tasks to cloud computing infrastructures so that they are processed quickly.
Relevance Estimation: The Relevance Estimation component is the most crucial part of Twinder as it determines for a given topic the relevance of tweets which are represented by means of the calculated features.

2. Implemention

Twinder can leverage Cloud Computing Infrastructure to make processing the Twitter streams more efficient. In cloud computing infrastructures like Amazon, one facilitate large volume data processing by implementing the processing procedure in the programming framework of MapReduce. As a result, the users who want the large volume data to be processed can just focus on the implementation of the algorithms, the MapReduce infrastructure like Amazon EMR can take care of the resources manangement. Moreover, as the MapReduce framework is designed to solve the large scale data processing tasks, the scalability can be guaranteed naturally.

To show that Twinder can really gain more efficient data processing using the method mentioned above, we supply the pseudo code for creating index, which is most time-consuming work in a search engine. Moreover, we evaluate the performance of this attempt and compared it with indexing on a mainstream server.

2.1 Pseudo Code

The pseudo code of the generating inverted index with Map-Reduce programming framework is given below.

Mapper(LongWritable tweetId, Text tweet) { String tokens = Tokenize(tweet); Count the frequency of each token, put them into the HashMap (hm) foreach token in hm { Emit the pair of (the token with the id of the tweet, the number of occurence). Emit one special indicator for each token. (used as counter) } } Partitioner(Object tokenWithTweetId, Object emittedFrequency, int numberOfInstances) { // Assign the reducer id according to the content of the token } Reducer(Text key, Iterable value) { // the key can be the token(with the tweet Id) or an indicator of a distinct token if the key is the indicator of a distinct token output a signal (key, value) pair for starting a new line in the index file; else Split the token and the tweet Id from the key if the token is new Output the token as key; the number of tokens, the tweetId, and the number of this token in the tweet as value else Ouput a signal as key; the tweetId of tweet Y, and the number of this token in the tweet as value }

2.2 Performance

We have compared the performance of indexing on Amazon EMR and local server. Some results are listed as following.

Corpus size	Mainstream server	Amazon EMR instances (x10)
100k tweets (13MBytes)	0.4 min	5 min
1m tweets (122MBytes)	5 min	8 min
10m tweets (1.3GBytes)	48 min	19 min
32m tweets (3.9GBytes)	283 min	47 min

3. Scientific results

3.1 Publication

We have just submitted a paper to 12th International Conference on Web Engineering about Twinder. Please stay tune!

3.2 Additional results and work in progress

In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. But some indicators in the result may stimulate some interesting ideas. One could propose hypotheses from these and then conduct experimental works. For example, the negative sentiment users expressed on technology topics might not be as interesting as other topics. Also, the URLs are not very importnant features for the tweets about technology. The detailed result is given in the table below.

Performance	Measure	business	entertainment	sports	politics	technology
	precision recall F-measure	0.4533 0.7855 0.5765	0.3811 0.5791 0.4597	0.2615 0.1298 0.1735	0.3410 0.4286 0.3798	0.5055 0.4600 0.4817
Feature Category	Feature	business	entertainment	sports	politics	technology
keyword-based	keyword-based	0.2141	0.2193	0.1308	0.1725	0.2222
semantic-based	semantic-based semantic overlap	0.2377 1.6723	0.2196 0.4456	0.0576 0.9628	0.0456 1.0571	0.0255 1.9869
syntactical	hasHashtag hasURL isReply length	-0.7988 1.9821 -0.2586 0.0042	0.4924 1.2488 -0.4847 0.0003	0.2769 1.3416 -0.7376 0.0064	-0.0721 1.1434 -0.8343 -0.0003	-0.1558 0.2866 -0.5945 0.0036
semantics	#entities diversity negative sentiment neutral sentiment positive sentiment	-0.2630 -0.0168 1.0360 0.4097 -0.9105	-0.1460 0.3047 0.5576 -0.0993 -0.1835	0.0515 -0.0431 1.1343 0.2685 -1.1752	0.0432 0.0923 0.5368 0.3741 -1.0252	0.0965 -0.0865 0.0687 0.4480 -1.0383
contextual	#followers #lists Twitter age #tweets	0.0000 0.0002 0.0588 0.0000	0.0000 0.0004 0.3066 0.0000	-0.0001 0.0010 0.2764 0.0000	0.0000 0.0001 0.1963 0.0000	-0.0001 0.0022 0.3265 0.0000