About
How can one effectively identify relevant messages in the hundreds of millions of Twitter messages that are posted every day? Twinder, a scalable search engine for Twitter streams, is invented for solving this problem. This site gives you an introduction to Twinder, including the architecture, some implementation details and our scientific results.
1. Overview
Twinder (Twitter Finder) is a search engine for Twitter streams that aims to improve search for Twitter messages by going beyond keyword-based matching. Different types of features ranging from syntactical to contextual features are considered by Twinder in order to predict the relevance of tweets for a given search query. The figure below shows the core components of the Twinder architecture. Different components are concerned with extracting features from the incoming messages of a Twitter stream. Given the huge amount of Twitter messages that are published every day, the system is designed to be scalable. For this reason, Twinder makes use of cloud computing infrastructures for processing-intensive tasks such as feature extraction and indexing.
- Feature Extraction: The features used for relevance estimation are generated and indexed by the Feature Extraction component. It receives Twitter messages from Social Web Streams and implements a suite of functions that allow for representing the tweets.
- Feature Extraction Task Broker: Feature Extraction Task Broker allows for dispatching feature extraction tasks and indexing tasks to cloud computing infrastructures so that they are processed quickly.
- Relevance Estimation: The Relevance Estimation component is the most crucial part of Twinder as it determines for a given topic the relevance of tweets which are represented by means of the calculated features.
2. Implemention
Twinder can leverage Cloud Computing Infrastructure to make processing the Twitter streams more efficient. In cloud computing infrastructures like Amazon, one facilitate large volume data processing by implementing the processing procedure in the programming framework of MapReduce. As a result, the users who want the large volume data to be processed can just focus on the implementation of the algorithms, the MapReduce infrastructure like Amazon EMR can take care of the resources manangement. Moreover, as the MapReduce framework is designed to solve the large scale data processing tasks, the scalability can be guaranteed naturally.
To show that Twinder can really gain more efficient data processing using the method mentioned above, we supply the pseudo code for creating index, which is most time-consuming work in a search engine. Moreover, we evaluate the performance of this attempt and compared it with indexing on a mainstream server.
2.1 Pseudo Code
The pseudo code of the generating inverted index with Map-Reduce programming framework is given below.
Mapper(LongWritable tweetId, Text tweet) {
String tokens = Tokenize(tweet);
Count the frequency of each token, put them into the HashMap (hm)
foreach token in hm {
Emit the pair of (the token with the id of the tweet, the number of occurence).
Emit one special indicator for each token. (used as counter)
}
}
Partitioner(Object tokenWithTweetId, Object emittedFrequency, int numberOfInstances) {
// Assign the reducer id according to the content of the token
}
Reducer(Text key, Iterable value) {
// the key can be the token(with the tweet Id) or an indicator of a distinct token
if the key is the indicator of a distinct token
output a signal (key, value) pair for starting a new line in the index file;
else
Split the token and the tweet Id from the key
if the token is new
Output the token as key; the number of tokens, the tweetId, and the number of this token in the tweet as value
else
Ouput a signal as key; the tweetId of tweet Y, and the number of this token in the tweet as value
}
2.2 Performance
We have compared the performance of indexing on Amazon EMR and local server. Some results are listed as following.
Corpus size | Mainstream server | Amazon EMR instances (x10) |
---|---|---|
100k tweets (13MBytes) | 0.4 min | 5 min |
1m tweets (122MBytes) | 5 min | 8 min |
10m tweets (1.3GBytes) | 48 min | 19 min |
32m tweets (3.9GBytes) | 283 min | 47 min |
3. Scientific results
3.1 Publication
We have just submitted a paper to 12th International Conference on Web Engineering about Twinder. Please stay tune!3.2 Additional results and work in progress
In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. But some indicators in the result may stimulate some interesting ideas. One could propose hypotheses from these and then conduct experimental works. For example, the negative sentiment users expressed on technology topics might not be as interesting as other topics. Also, the URLs are not very importnant features for the tweets about technology. The detailed result is given in the table below.
Performance | Measure | business | entertainment | sports | politics | technology |
---|---|---|---|---|---|---|
precision recall F-measure |
0.4533 0.7855 0.5765 |
0.3811 0.5791 0.4597 |
0.2615 0.1298 0.1735 |
0.3410 0.4286 0.3798 |
0.5055 0.4600 0.4817 |
|
Feature Category | Feature | business | entertainment | sports | politics | technology |
keyword-based | keyword-based | 0.2141 | 0.2193 | 0.1308 | 0.1725 | 0.2222 |
semantic-based | semantic-based semantic overlap |
0.2377 1.6723 |
0.2196 0.4456 |
0.0576 0.9628 |
0.0456 1.0571 |
0.0255 1.9869 |
syntactical | hasHashtag hasURL isReply length |
-0.7988 1.9821 -0.2586 0.0042 |
0.4924 1.2488 -0.4847 0.0003 |
0.2769 1.3416 -0.7376 0.0064 |
-0.0721 1.1434 -0.8343 -0.0003 |
-0.1558 0.2866 -0.5945 0.0036 |
semantics | #entities diversity negative sentiment neutral sentiment positive sentiment |
-0.2630 -0.0168 1.0360 0.4097 -0.9105 |
-0.1460 0.3047 0.5576 -0.0993 -0.1835 |
0.0515 -0.0431 1.1343 0.2685 -1.1752 |
0.0432 0.0923 0.5368 0.3741 -1.0252 |
0.0965 -0.0865 0.0687 0.4480 -1.0383 |
contextual | #followers #lists Twitter age #tweets |
0.0000 0.0002 0.0588 0.0000 |
0.0000 0.0004 0.3066 0.0000 |
-0.0001 0.0010 0.2764 0.0000 |
0.0000 0.0001 0.1963 0.0000 |
-0.0001 0.0022 0.3265 0.0000 |