About
This site provides supplemental material and information about the paper Groundhog: Near Duplicate Detection on Twitter.
1. Datasets
Tweets2011 corpus: The corpus is a representative sample from tweets posted during a period of 2 weeks (January 23rd to February 8th, 2011, inclusive). As the corpus is designed to be a reusable test collection for investigating Twitter search and ranking, it is used for the experiments of duplicate de- tection and search result diversification in the rest of the paper. According to the TOS of Twiter, we are not allowed to make this dataset publicly available. However, a list of tweetID and the corresponding Twitter user name can be downloaded after signing an agreement with TREC. (see details on TREC website).
External Resources: To allow for the enrichment of the semantics in the tweets, we also crawled all the external resources by retrieving the links that were mentioned in the tweets. You can download the the SQL dump for the external resources via this link.
Enriched Semantics: Given the content of Twitter messages and external resources we extract entities (from DBpedia Spotlight and OpenCalais) and topics (rely on OpenCalais) to better understand the semantics of tweets in order to detect duplicates.
name | number of records | description |
dbpediaEntity_www2013.sql.gz | 6,995 | entity assignments (DBpedia Spotlight) extracted from tweets; 2,102 distinct entities |
semanticsTweetsEntity_www2013.sql.gz | 6,292 | entity assignments (OpenCalais) extracted from tweets; 3,492 distinct entities |
semanticsTweetsTopic_www2013.sql.gz | 2,811 | topic assignments (18 distinct topics) extracted from tweets |
externalResources_www2013.sql.gz | 1,661 | External resources are crawled by retrieving the links mentioned in the tweets. |
dbpediaEntityExternalResources_www2013.sql.gz | 56,081 | entity assignments (DBpedia Spotlight) extracted from external resources; 7,906 distinct entities |
semanticsExternalResourcesEntity_www2013.sql.gz | 35,774 | entity assignments (OpenCalais) extracted from external resources; 11,277 distinct entities |
semanticsExternalResourcesTopic_www2013.sql.gz | 1,688 | topic assignments (18 distinct topics) extracted from external resources |
Ground Truth: Given the IDs of tweet pairs, the duplicate scale can be determined according to the definition of duplciate in 5 different scales. We manually labeled 55,362 pairs with the labels of duplicate scales or non-duplicate. You may download this dataset.
2. Further Findings
In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. The extreme coefficients for the topic themes of business, sports, and technology, are due to an artifact of the small training size, especially the positive instances.
Performance | Measure | business | entertainment | sports | politics | technology |
---|---|---|---|---|---|---|
#topics #samples |
6 11,445 |
12 7,678 |
5 1,722 |
21 30,037 |
2 1,622 |
|
precision recall F-measure |
0.6865 0.6615 0.6737 |
0.6844 0.7153 0.6995 |
0.5000 0.6071 0.5484 |
0.4399 0.4713 0.4551 |
0.6383 0.7143 0.6742 |
|
Feature Category | Feature | business | entertainment | sports | politics | technology |
syntactical | Levenshtein distance overlap in terms overlap in hashtags overlap in URLs overlap in expanded URLs length difference |
-4.3846 8.4204 -49.8329 29.9630 3.4109 3.3496 |
-5.3021 4.6977 -4.1458 0.7779 -0.3927 1.5237 |
-19.7525 15.6092 -75.3741 0.2776 17.5296 4.0178 |
-2.8661 1.5605 0.6000 -0.2451 1.1813 1.1982 |
-3.7745 20.1287 13.5441 -0.7363 4.9237 3.3533 |
semantics | overlap in entities overlap in entity types overlap in topics overlap in WordNet concepts overlap in WordNet Synset concepts WordNet similarity |
0.5382 -0.2645 2.8981 11.4964 -7.1660 0.3735 |
-0.2384 -0.5131 1.6343 3.6632 3.4901 0.6025 |
-0.7858 6.0619 11.8274 -0.7459 -4.9551 -2.2303 |
-2.1629 1.2680 1.4209 8.5163 -3.0406 -0.6915 |
0.1235 1.9349 7.3317 0.0574 -9.1508 3.0477 |
enriched semantics | overlap in entities overlap in entity types overlap in topics overlap in WordNet concepts overlap in WordNet Synset concepts WordNet similarity |
-2.7019 1.9580 0.1529 -4.8839 5.3933 2.8634 |
1.8265 0.0012 0.4116 1.7720 -1.5413 -0.0972 |
4.9434 -0.3730 7.0104 -6.9916 7.7378 7.7378 |
-2.4143 0.9447 -0.2560 -2.1296 3.0700 0.7149 |
0.6195 14.4670 -11.9908 3.0021 -11.9908 3.0021 |
contextual | temporal difference difference in #followees difference in #followers same client |
-0.2223 -2.3410 2.8036 0.3290 |
-6.1550 -2.0069 0.4106 0.8414 |
6.0158 -6.3268 1.2073 -6.7151 |
-15.5547 0.8683 -1.0078 -0.2858 |
-0.7180 0.1782 3.7630 -4.2674 |
3. Code
Duplicate Detection framework: If you are interested in the code for Duplicate Detection component in Twinder, please click on this link.
Acknowledgement
Special thanks to SARA: For this project we make use the BiG Grid Hadoop cluster at SARA.