How to avoid crawling duplicate URLs at Google scale? Option 1: Use a Set data structure to check if a URL already exists or not. Set is fast, but it is not space-efficient. Option 2: Store URLs in a database and check if a new URL is in the database. This can work but the load to the database will be very high.
Share this post
Large scale deduplication (Episode 2)
Share this post
How to avoid crawling duplicate URLs at Google scale? Option 1: Use a Set data structure to check if a URL already exists or not. Set is fast, but it is not space-efficient. Option 2: Store URLs in a database and check if a new URL is in the database. This can work but the load to the database will be very high.