Groundhog: Near Duplicate Detection on Twitter

About

This site provides supplemental material and information about the paper Groundhog: Near Duplicate Detection on Twitter.

1. Datasets

Tweets2011 corpus: The corpus is a representative sample from tweets posted during a period of 2 weeks (January 23rd to February 8th, 2011, inclusive). As the corpus is designed to be a reusable test collection for investigating Twitter search and ranking, it is used for the experiments of duplicate de- tection and search result diversification in the rest of the paper. According to the TOS of Twiter, we are not allowed to make this dataset publicly available. However, a list of tweetID and the corresponding Twitter user name can be downloaded after signing an agreement with TREC. (see details on TREC website).

External Resources: To allow for the enrichment of the semantics in the tweets, we also crawled all the external resources by retrieving the links that were mentioned in the tweets. You can download the the SQL dump for the external resources via this link.

Enriched Semantics: Given the content of Twitter messages and external resources we extract entities (from DBpedia Spotlight and OpenCalais) and topics (rely on OpenCalais) to better understand the semantics of tweets in order to detect duplicates.

name	number of records	description
dbpediaEntity_www2013.sql.gz	6,995	entity assignments (DBpedia Spotlight) extracted from tweets; 2,102 distinct entities
semanticsTweetsEntity_www2013.sql.gz	6,292	entity assignments (OpenCalais) extracted from tweets; 3,492 distinct entities
semanticsTweetsTopic_www2013.sql.gz	2,811	topic assignments (18 distinct topics) extracted from tweets
externalResources_www2013.sql.gz	1,661	External resources are crawled by retrieving the links mentioned in the tweets.
dbpediaEntityExternalResources_www2013.sql.gz	56,081	entity assignments (DBpedia Spotlight) extracted from external resources; 7,906 distinct entities
semanticsExternalResourcesEntity_www2013.sql.gz	35,774	entity assignments (OpenCalais) extracted from external resources; 11,277 distinct entities
semanticsExternalResourcesTopic_www2013.sql.gz	1,688	topic assignments (18 distinct topics) extracted from external resources

Ground Truth: Given the IDs of tweet pairs, the duplicate scale can be determined according to the definition of duplciate in 5 different scales. We manually labeled 55,362 pairs with the labels of duplicate scales or non-duplicate. You may download this dataset.

2. Further Findings

In the submitted paper, we investigated the influence of the topics characteristics in terms of 3 aspects: popularity, temporal persistency, and whether the topics are global or local topics. Besides this, we also study the problem on another point of view. The topics were classified as belonging to one of five themes, either business, entertainment, sports, politics or technology. While 2022 FIFA soccer (MB002) is a sports topic, Cuomo budget cuts (MB019) for instance is considered to be a political topic. As there are not a lot of topics in every category of topics, we can not make the strong conclusion from it. The extreme coefficients for the topic themes of business, sports, and technology, are due to an artifact of the small training size, especially the positive instances.

Performance	Measure	business	entertainment	sports	politics	technology
	#topics #samples	6 11,445	12 7,678	5 1,722	21 30,037	2 1,622
	precision recall F-measure	0.6865 0.6615 0.6737	0.6844 0.7153 0.6995	0.5000 0.6071 0.5484	0.4399 0.4713 0.4551	0.6383 0.7143 0.6742
Feature Category	Feature	business	entertainment	sports	politics	technology
syntactical	Levenshtein distance overlap in terms overlap in hashtags overlap in URLs overlap in expanded URLs length difference	-4.3846 8.4204 -49.8329 29.9630 3.4109 3.3496	-5.3021 4.6977 -4.1458 0.7779 -0.3927 1.5237	-19.7525 15.6092 -75.3741 0.2776 17.5296 4.0178	-2.8661 1.5605 0.6000 -0.2451 1.1813 1.1982	-3.7745 20.1287 13.5441 -0.7363 4.9237 3.3533
semantics	overlap in entities overlap in entity types overlap in topics overlap in WordNet concepts overlap in WordNet Synset concepts WordNet similarity	0.5382 -0.2645 2.8981 11.4964 -7.1660 0.3735	-0.2384 -0.5131 1.6343 3.6632 3.4901 0.6025	-0.7858 6.0619 11.8274 -0.7459 -4.9551 -2.2303	-2.1629 1.2680 1.4209 8.5163 -3.0406 -0.6915	0.1235 1.9349 7.3317 0.0574 -9.1508 3.0477
enriched semantics	overlap in entities overlap in entity types overlap in topics overlap in WordNet concepts overlap in WordNet Synset concepts WordNet similarity	-2.7019 1.9580 0.1529 -4.8839 5.3933 2.8634	1.8265 0.0012 0.4116 1.7720 -1.5413 -0.0972	4.9434 -0.3730 7.0104 -6.9916 7.7378 7.7378	-2.4143 0.9447 -0.2560 -2.1296 3.0700 0.7149	0.6195 14.4670 -11.9908 3.0021 -11.9908 3.0021
contextual	temporal difference difference in #followees difference in #followers same client	-0.2223 -2.3410 2.8036 0.3290	-6.1550 -2.0069 0.4106 0.8414	6.0158 -6.3268 1.2073 -6.7151	-15.5547 0.8683 -1.0078 -0.2858	-0.7180 0.1782 3.7630 -4.2674

3. Code

Duplicate Detection framework: If you are interested in the code for Duplicate Detection component in Twinder, please click on this link.

Acknowledgement

Special thanks to SARA: For this project we make use the BiG Grid Hadoop cluster at SARA.