Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance

Abstract

Automatic plagiarism detection considering a reference corpus compares a suspicious text to a set of original documents in order to relate the plagiarised fragments to their potential source. Publications on this task often assume that the search space (the set of reference documents) is a narrow set where any search strategy will produce a good output in a short time. However, this is not always true. Reference corpora are often composed of a big set of original documents where a simple exhaustive search strategy becomes practically impossible.

Publication
Computational Linguistics and Intelligent Text Processing