Although I'm personally all for open distribution of crawl data like this and al...

ahadrana · on Nov 9, 2011

Hi, you can view our terms of use at http://www.commoncrawl.org/about/terms-of-use/full-terms-of-.... We adhere to the robots.txt standard, try to do all our crawling above board, and (strictly personal opinion here) we are definitely not in the business of diminishing or subverting peoples rights with regards to the content they produce. There are many other options available to those who are determined to crawl a site's content, whether the site owner wants them to or not. Our goal is to democratize access to our crawl for the betterment of Web ecosystem as a whole and we believe storing the data on S3 and making it accessible to a wide audience is the right way to accomplish this goal.

ohashi · on Nov 8, 2011

I see it in the ToS:

http://www.commoncrawl.org/about/terms-of-use/

-Violate other people’s rights (IP, proprietary, etc.)