How to avoid crawling duplicate URLs at Google scale?
What about the html content? Would hashing the html then using that as the key for the bloom filter work?
What about the html content? Would hashing the html then using that as the key for the bloom filter work?