The Study of Detecting Replicate Documents Using MD5 Hash Function
Pushpendra Singh Tomar, Maneesh Shreevastava
Abstract
A great deal of the Web is replicate or near- replicate content. Documents may be served in different formats: HTML, PDF, and Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Algorithms for detecting replicate documents are critical in applications where data is obtained from multiple sources. The removal of replicate documents is necessary, not only to reduce runtime, but also to improve search accuracy. Today, search engine crawlers are retrieving billions of unique URL’s, of which hundreds of millions are replicates of some form. Thus, quickly identifying replicate detection expedites indexing and searching. One vendor’s analysis of 1.2 billion URL’s resulted in 400 million exact replicates found with a MD5 hash. Reducing the collection sizes by tens of percentage point’s results in great savings in indexing time and a reduction in the amount of hardware required to support the system. Last and probably more significant, users benefit by eliminating replicate results. By efficiently presenting only unique documents, user satisfaction is likely to increase.
Keyword
Unique documents, detecting replicate, replication, search engine.
Cite this article
.The Study of Detecting Replicate Documents Using MD5 Hash Function. International Journal of Advanced Computer Research. 2011;1(2):14-17.