> *The nice things about IA links is you can pretty reasonably assume that they ...

ggreer · on Aug 20, 2014

This is a huge problem. Sites like NASA's NTRS are retroactively blocked.[1] It's not clear which user agents one must allow in robots.txt. NTRS allows archive.org_bot, but apparently ia_archiver is also needed. At some point the allow directive in the NTRS robots.txt[2] no longer matched, nuking all historical data.

1. See http://web.archive.org/web/20121029225832/http://ntrs.nasa.g... for an example.

2. http://ntrs.nasa.gov/robots.txt

db48x · on Aug 21, 2014

It's not been nuked, merely hidden.

jleader · on Aug 20, 2014

A few years ago I talked to an IA engineer, who said they were planning on dealing with this by not crawling sites whose nameservers were known to point to a domain parking company. The idea was that if they never retrieved the robots.txt, they wouldn't retroactively apply it. I don't know if that filtering out of parking nameservers ever happened, and it wouldn't help for parked domains whose robots.txt they'd already retrieved, but but it would help with domains that lapse in the future.

stavros · on Aug 20, 2014

But why retroactively remove the data? The original owner was fine with holding it, why should the snapshot be deleted because a completely different person wants his completely different website to not be crawled?

bjt · on Aug 20, 2014

It's hard for a bot to understand concepts of 'owner' and 'completely different person' based on the data they have available. Companies can use this robots.txt feature to un-index old marketing content after a re-branding, for example. Or after an acquisition.

stavros · on Aug 20, 2014

Sure, but, surely, the bot has timestamps saying "robots.txt allowed me to keep these documents last time I spidered them". Why do they have to be retroactively removed? robots.txt only disallows spidering, it doesn't mandate that you should delete all the data you've already spidered.

db48x · on Aug 21, 2014

Because most of the problems come from people who want to hide old material that they didn't realize was being indexed. The automatic behavior is simple and easy to implement, and doesn't require any human intervention.

im3w1l · on Aug 20, 2014

Imo, a better approach than nuking old data, could be to keep the data, but not show it.

psykovsky · on Aug 20, 2014

I may be mistaken, but I think I read somewhere that's exactly what they do. But don't take my word for it.

ersii · on Aug 20, 2014

You are not mistaken. The Internet Archive does not delete or "nuke" the data that is blocked by a robots.txt. Even though cough some people believe so (see parent thread).

Source: IA staffers.

pimlottc · on Aug 20, 2014

What you say is true for sites mirrored by IA's Wayback Machine, though my understanding is they retain the data in case the robots.txt is lifted later on.

Linking to media uploaded to the main archive itself should be safer, though.

walterbell · on Aug 20, 2014

Have you heard of other caches/archives (e.g. Google) applying the same retroactive policy? Presmably IA has no way of finding out that domain ownership has changed. I wonder if they are applying this policy to pages referenced by Wikipedia, http://blog.archive.org/2013/10/25/fixing-broken-links/

The safest archive of a web page is a local PDF.

toomuchtodo · on Aug 20, 2014

> The safest archive of a web page is a local PDF.

Minor quibble: The safest archive of a web page is a local WARC archive:

http://www.archiveteam.org/index.php?title=Wget_with_WARC_ou...

walterbell · on Aug 20, 2014

Good point, I would say both are needed. The WARC is only useful if there is a matching web browser and operating system in a VM which could render the HTML+Javascript and produce the original layout. PDF/A would mostly retain the browser layout.

im3w1l · on Aug 20, 2014

I use virtual notary for this. You give it a URL, and vn fetches it, and gives you a cryptographic certificate of time of retrieval and website's content at that time.