Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…