Although I'm personally all for open distribution of crawl data like this and all of my personal websites are CC-licensed, isn't there something to be said for the copyright status of the pages in the crawl file?
The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.
There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.
Surprisingly, I couldn't see anything about this covered in the FAQs
Hi, you can view our terms of use at http://www.commoncrawl.org/about/terms-of-use/full-terms-of-.... We adhere to the robots.txt standard, try to do all our crawling above board, and (strictly personal opinion here) we are definitely not in the business of diminishing or subverting peoples rights with regards to the content they produce. There are many other options available to those who are determined to crawl a site's content, whether the site owner wants them to or not. Our goal is to democratize access to our crawl for the betterment of Web ecosystem as a whole and we believe storing the data on S3 and making it accessible to a wide audience is the right way to accomplish this goal.
The crawl file presumably contains the contents of websites and so the owners of those websites could assert that Common Crawl Foundation is distributing their work without permission or license.
There are all sorts of republishing/splog 'opportunities' with this crawl data that goes beyond the original expected use.
Surprisingly, I couldn't see anything about this covered in the FAQs