About

How can one effectively identify relevant messages in the hundreds of millions of Twitter messages that are posted every day? Twinder, a scalable search engine for Twitter streams, is invented for solving this problem. This site gives you an introduction to Twinder, including the architecture, some implementation details and our scientific results.

1. Overview

Twinder (Twitter Finder) is a search engine for Twitter streams that aims to improve search for Twitter messages by going beyond keyword-based matching. Different types of features ranging from syntactical to contextual features are considered by Twinder in order to predict the relevance of tweets for a given search query. The figure below shows the core components of the Twinder architecture. Different components are concerned with extracting features from the incoming messages of a Twitter stream. Given the huge amount of Twitter messages that are published every day, the system is designed to be scalable. For this reason, Twinder makes use of cloud computing infrastructures for processing-intensive tasks such as feature extraction and indexing.

2. Implemention

Twinder can leverage Cloud Computing Infrastructure to make processing the Twitter streams more efficient. In cloud computing infrastructures like Amazon, one facilitate large volume data processing by implementing the processing procedure in the programming framework of MapReduce. As a result, the users who want the large volume data to be processed can just focus on the implementation of the algorithms, the MapReduce infrastructure like Amazon EMR can take care of the resources manangement. Moreover, as the MapReduce framework is designed to solve the large scale data processing tasks, the scalability can be guaranteed naturally.

To show that Twinder can really gain more efficient data processing using the method mentioned above, we supply the pseudo code for creating index, which is most time-consuming work in a search engine. Moreover, we evaluate the performance of this attempt and compared it with indexing on a mainstream server.

2.1 Pseudo Code

The pseudo code of the generating inverted index with Map-Reduce programming framework is given below.
Mapper(LongWritable tweetId, Text tweet) {
 String tokens = Tokenize(tweet);
 Count the frequency of each token, put them into the HashMap (hm)

 foreach token in hm {
  Emit the pair of (the token with the id of the tweet, the number of occurence).
  Emit one special indicator for each token. (used as counter)
 }
}

Partitioner(Object tokenWithTweetId, Object emittedFrequency, int numberOfInstances) {
 // Assign the reducer id according to the content of the token
}

Reducer(Text key, Iterable value) {
 // the key can be the token(with the tweet Id) or an indicator of a distinct token
 if the key is the indicator of a distinct token
  output a signal (key, value) pair for starting a new line in the index file;
 else
  Split the token and the tweet Id from the key
  if the token is new
   Output the token as key; the number of tokens, the tweetId, and the number of this token in the tweet as value
  else
   Ouput a signal as key; the tweetId of tweet Y, and the number of this token in the tweet as value
}

2.2 Performance

We have compared the performance of indexing on Amazon EMR and local server. Some results are listed as following.

Corpus size Mainstream server Amazon EMR instances (x10)
100k tweets (13MBytes) 0.4 min 5 min
1m tweets (122MBytes) 5 min 8 min
10m tweets (1.3GBytes) 48 min 19 min
32m tweets (3.9GBytes) 283 min 47 min

3. Scientific results

3.1 Publication

We have just submitted a paper to 12th International Conference on Web Engineering about Twinder. Please stay tune!

3.2 Additional results and work in progress

In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. But some indicators in the result may stimulate some interesting ideas. One could propose hypotheses from these and then conduct experimental works. For example, the negative sentiment users expressed on technology topics might not be as interesting as other topics. Also, the URLs are not very importnant features for the tweets about technology. The detailed result is given in the table below.

Performance Measure business entertainment sports politics technology
precision
recall
F-measure
0.4533
0.7855
0.5765
0.3811
0.5791
0.4597
0.2615
0.1298
0.1735
0.3410
0.4286
0.3798
0.5055
0.4600
0.4817
Feature Category Feature business entertainment sports politics technology
keyword-based keyword-based 0.2141 0.2193 0.1308 0.1725 0.2222
semantic-based semantic-based
semantic overlap
0.2377
1.6723
0.2196
0.4456
0.0576
0.9628
0.0456
1.0571
0.0255
1.9869
syntactical hasHashtag
hasURL
isReply
length
-0.7988
1.9821
-0.2586
0.0042
0.4924
1.2488
-0.4847
0.0003
0.2769
1.3416
-0.7376
0.0064
-0.0721
1.1434
-0.8343
-0.0003
-0.1558
0.2866
-0.5945
0.0036
semantics #entities
diversity
negative sentiment
neutral sentiment
positive sentiment
-0.2630
-0.0168
1.0360
0.4097
-0.9105
-0.1460
0.3047
0.5576
-0.0993
-0.1835
0.0515
-0.0431
1.1343
0.2685
-1.1752
0.0432
0.0923
0.5368
0.3741
-1.0252
0.0965
-0.0865
0.0687
0.4480
-1.0383
contextual #followers
#lists
Twitter age
#tweets
0.0000
0.0002
0.0588
0.0000
0.0000
0.0004
0.3066
0.0000
-0.0001
0.0010
0.2764
0.0000
0.0000
0.0001
0.1963
0.0000
-0.0001
0.0022
0.3265
0.0000