More

It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites.

The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible.

Their solution: ignore robots.txt altogether. What? That will just annoy many website operators.

My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected.

Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators.

driverdan · on April 21, 2017

Archive Team is not associated with Internet Archive. AT does not crawl the web at large, it only targets specific sites.

duskwuff · on April 21, 2017

There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index.

What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.

Spare_account · on April 21, 2017

Here's a slight modification to the GP proposal:

- Respect robots.txt at the time you crawl it.

- If robots.txt appears later, stop archiving from that date forwards.

- Preserve access to old archived copies of the site by default.

- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.

pbhjpbhj · on April 21, 2017

>mechanism that allows a proven site owner to explicitly request retrospective access removal. //

It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.

ss64 · on April 21, 2017

How about you respect the robots.txt until the IP address where it is hosted changes. Once the IP has changed, then any new robots.txt exclusions apply only to the new pages not the archived pages under the old IP, which continue respecting the old archived robots.txt.

The IP address changing is a pretty solid indicator that control of that content has moved to a new organisation. Note this does not always coincide with the domain name owner changing.

A scenario that I can imagine becoming litigious: company owns a domain for promoting some product and they use robots.txt to prevent copies. The product reaches end of life and domain is allowed to expire. Someone else buys the domain and starts hosting content with no robots restriction. Archive.org start to display pages from the old company. Company then sues archive.org for copyright violation.

r721 · on April 21, 2017

>may cause the Internet Archive to be banned from crawling some websites.

It looks like Facebook banned ia_archiver (recently? I recall it worked a few weeks ago):

>User-agent: ia_archiver

>Disallow: /

https://www.facebook.com/robots.txt

rz2k · on April 21, 2017

The logic is sound, and I see that it was mostly written in 2011, but I can also see it being harmful.

How about an IETF RFC to clarify?

Libraries operate under a lot of unwritten social conventions, perhaps even more than most other institutions. (robots.txt even if largely ignored is a popular convention) Aggressive or confrontational wording, regardless of whether they are "right" doesn't seem in libraries' interests.

mushiake · on April 6, 2017

sl [0] is always my favorite.

[0]https://github.com/mtoyoda/sl

sanpan · on April 6, 2017

Is there a way to lolcat sl?

mushiake · on March 27, 2017

FFTW[0] is also written like that (generator written in OCaml emitting C).

[0]http://www.fftw.org/

mushiake · on March 25, 2017

Gtk3 has native file chooser api (only for windows, no mac one yet).

mushiake · on March 25, 2017

Racket uses cocoa for macos, win32 for windows, gtk+ for linux. But it is really minimal and you may find feature lacking.

soegaard · on March 25, 2017

For anyone interested in how Racket (Matthew Flatt) pulled it off : http://blog.racket-lang.org/2010/12/rebuilding-rackets-graph...

mushiake · on Sept 26, 2016

smoke/kde was supposed to be contender for gobject-introspection, however it is barely maintained.

It is only(?) used by common lisp[0].

And there was claro[1].

[0]https://github.com/Shinmera/qtools

[1]https://github.com/Araq/Claro

daurnimator · on Sept 26, 2016

It too seems to be stuck on Qt4.

Only python bindings seem to have made it up to Qt5.