About

This site provides supplemental material and information about the paper Groundhog: Near Duplicate Detection on Twitter.

1. Datasets

Tweets2011 corpus: The corpus is a representative sample from tweets posted during a period of 2 weeks (January 23rd to February 8th, 2011, inclusive). As the corpus is designed to be a reusable test collection for investigating Twitter search and ranking, it is used for the experiments of duplicate de- tection and search result diversification in the rest of the paper. According to the TOS of Twiter, we are not allowed to make this dataset publicly available. However, a list of tweetID and the corresponding Twitter user name can be downloaded after signing an agreement with TREC. (see details on TREC website).

External Resources: To allow for the enrichment of the semantics in the tweets, we also crawled all the external resources by retrieving the links that were mentioned in the tweets. You can download the the SQL dump for the external resources via this link.

Enriched Semantics: Given the content of Twitter messages and external resources we extract entities (from DBpedia Spotlight and OpenCalais) and topics (rely on OpenCalais) to better understand the semantics of tweets in order to detect duplicates.

name number of records description
dbpediaEntity_www2013.sql.gz 6,995 entity assignments (DBpedia Spotlight) extracted from tweets; 2,102 distinct entities
semanticsTweetsEntity_www2013.sql.gz 6,292 entity assignments (OpenCalais) extracted from tweets; 3,492 distinct entities
semanticsTweetsTopic_www2013.sql.gz 2,811 topic assignments (18 distinct topics) extracted from tweets
externalResources_www2013.sql.gz 1,661 External resources are crawled by retrieving the links mentioned in the tweets.
dbpediaEntityExternalResources_www2013.sql.gz 56,081 entity assignments (DBpedia Spotlight) extracted from external resources; 7,906 distinct entities
semanticsExternalResourcesEntity_www2013.sql.gz 35,774 entity assignments (OpenCalais) extracted from external resources; 11,277 distinct entities
semanticsExternalResourcesTopic_www2013.sql.gz 1,688 topic assignments (18 distinct topics) extracted from external resources

Ground Truth: Given the IDs of tweet pairs, the duplicate scale can be determined according to the definition of duplciate in 5 different scales. We manually labeled 55,362 pairs with the labels of duplicate scales or non-duplicate. You may download this dataset.

2. Further Findings

In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. The extreme coefficients for the topic themes of business, sports, and technology, are due to an artifact of the small training size, especially the positive instances.

Performance Measure business entertainment sports politics technology
#topics
#samples
6
11,445
12
7,678
5
1,722
21
30,037
2
1,622
precision
recall
F-measure
0.6865
0.6615
0.6737
0.6844
0.7153
0.6995
0.5000
0.6071
0.5484
0.4399
0.4713
0.4551
0.6383
0.7143
0.6742
Feature Category Feature business entertainment sports politics technology
syntactical Levenshtein distance
overlap in terms
overlap in hashtags
overlap in URLs
overlap in expanded URLs
length difference
-4.3846
8.4204
-49.8329
29.9630
3.4109
3.3496
-5.3021
4.6977
-4.1458
0.7779
-0.3927
1.5237
-19.7525
15.6092
-75.3741
0.2776
17.5296
4.0178
-2.8661
1.5605
0.6000
-0.2451
1.1813
1.1982
-3.7745
20.1287
13.5441
-0.7363
4.9237
3.3533
semantics overlap in entities
overlap in entity types
overlap in topics
overlap in WordNet concepts
overlap in WordNet Synset concepts
WordNet similarity
0.5382
-0.2645
2.8981
11.4964
-7.1660
0.3735
-0.2384
-0.5131
1.6343
3.6632
3.4901
0.6025
-0.7858
6.0619
11.8274
-0.7459
-4.9551
-2.2303
-2.1629
1.2680
1.4209
8.5163
-3.0406
-0.6915
0.1235
1.9349
7.3317
0.0574
-9.1508
3.0477
enriched semantics overlap in entities
overlap in entity types
overlap in topics
overlap in WordNet concepts
overlap in WordNet Synset concepts
WordNet similarity
-2.7019
1.9580
0.1529
-4.8839
5.3933
2.8634
1.8265
0.0012
0.4116
1.7720
-1.5413
-0.0972
4.9434
-0.3730
7.0104
-6.9916
7.7378
7.7378
-2.4143
0.9447
-0.2560
-2.1296
3.0700
0.7149
0.6195
14.4670
-11.9908
3.0021
-11.9908
3.0021
contextual temporal difference
difference in #followees
difference in #followers
same client
-0.2223
-2.3410
2.8036
0.3290
-6.1550
-2.0069
0.4106
0.8414
6.0158
-6.3268
1.2073
-6.7151
-15.5547
0.8683
-1.0078
-0.2858
-0.7180
0.1782
3.7630
-4.2674

3. Code

Duplicate Detection framework: If you are interested in the code for Duplicate Detection component in Twinder, please click on this link.

Acknowledgement

Special thanks to SARA: For this project we make use the BiG Grid Hadoop cluster at SARA.