Efficient semantic-aware detection of near duplicate resources

Ekaterini Ioannou, Odysseas Papapetrou, Dimitrios Skoutas, Wolfgang Nejdl

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

11 Citations (Scopus)

Abstract

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

Original languageEnglish
Title of host publicationThe Semantic Web
Subtitle of host publicationResearch and Applications - 7th Extended Semantic Web Conference, ESWC 2010, Proceedings
Pages136-150
Number of pages15
EditionPART 2
DOIs
Publication statusPublished - 14 Jul 2010
Externally publishedYes
Event7th Extended Semantic Web Conference, ESWC 2010 - Heraklion, Crete, Greece
Duration: 30 May 20103 Jun 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume6089 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Extended Semantic Web Conference, ESWC 2010
Country/TerritoryGreece
CityHeraklion, Crete
Period30/05/103/06/10

Keywords

  • data integration
  • near duplicate detection

Fingerprint

Dive into the research topics of 'Efficient semantic-aware detection of near duplicate resources'. Together they form a unique fingerprint.

Cite this