Building a Microblog Corpus for Search Result Diversification

About

This site provides supplemental material and information about the paper Building a Microblog Corpus for Search Result Diversification.

1. Datasets

Tweets2013 corpus: For this work, we use the corpus same as TREC 2013 Microblog Track. According to the Guidelines, the Tweet collection, to which we refer as Tweets2013 corpus, could be obtained by using the tool provided by Jimmy Lin. The collection contains the tweets collected via the Twitter Streaming API over a two-month period (from February 1st to March 31st, 2013, inclusive) on EC2 platform based in Europe. It results in 291,922,201 tweets. Unfortnately, we cannot distribute the corpus due to the TOS of Twitter API. However, you can interact with this corpus via a Track-as-Service (not always up), please find the specification in the Guideline and the discussions on TREC Microblog Mailing-lists hosted on Google Groups. Given the way of obtaining the tweets, the corpus is considered as a representative sampling of whole Twitter stream. Therefore, it is suitable for us to investigate search result diversification on Microblog.

Search topics: In order to pick a part of Tweets2013 for further subtopic annotations, we checked the archive of the Current Events entried on English Wikipedia portal page and manually created 50 topics, which are evenly distributed among the two-month period. You can find these 50 topics in TREC format via this link. As we don't consider the time of query, the querytime as well as querytweettime fields in the file is intentionally set to Sun Mar 31 23:59:59 +0000 2013 and 318513034039529472 respectively.

Manual Run: Given the search topics, we have to create a pool of documents to judge for each topic. One of the paper's authors manually created complext Indri queries for each topic. You many download the search result (top 10,000 tweets) of querying against the corpus using the manually created queries here. After filtering out the duplicates (tweets that are similar with cosine similarity > 0.9 to a tweet higher in the ranking), we got the input for the subtopic annotation process.

Subtopics Annotation: The annotation task is distributed to two annotators. Each of them judge the top 500 tweets for 25 topics given by the input file. Later, the topics of AIRS023, AIRS025, AIRS035 were dropped due to lack of relevant tweets. The annotation result in MySQL dump format may be downloaded via this link.

Automatic Run: The Automatic run is a standard query likelihood based retrieval run (language modeling with Dirichlet smoothing, μ = 1000) as implemented in the Lemur Toolkit for IR. It is used as the input for various diversification strategies. You may download this run result via this link.

Semantics: Given the content of Twitter messages we extract entities (from DBpedia Spotlight, OpenCalais, Wikipedia Miner) and topics (rely on OpenCalais) to better understand the semantics of tweets in order to detect duplicates.

name	number of records	description
source_semantics_tweets_db.sql.bz2	601,890	entity assignments (DBpedia Spotlight) extracted from tweets; 35,885 distinct entities
source_semantics_tweets_oc_entity.sql.bz2	370,396	entity assignments (OpenCalais) extracted from tweets; 117,893 distinct entities
source_semantics_tweets_oc_topic.sql.bz2	154,938	topic assignments (18 distinct topics) extracted from tweets
source_semantics_tweets_wpm.sql.bz2	896,702	entity assignments (Wikipedia Miner) extracted from tweets; 64,065 distinct entities

2. Further Findings and Updates

We will give further updates on the results in this section.

Acknowledgement

Special thanks to SARA: For this project we make use the BiG Grid Hadoop cluster at SARA.