Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The nice things about IA links is you can pretty reasonably assume that they won't suffer from link rot, right?

The catch there being, the Internet Archive retroactively respects robots.txt that forbid crawling, so if someone gets control of a domain they can block the archived pages. This is a big problem with lapsed domains that get swept under the umbrella of a holding company that has lots of domains pointing to the same content, with a blanket robots.txt.



This is a huge problem. Sites like NASA's NTRS are retroactively blocked.[1] It's not clear which user agents one must allow in robots.txt. NTRS allows archive.org_bot, but apparently ia_archiver is also needed. At some point the allow directive in the NTRS robots.txt[2] no longer matched, nuking all historical data.

1. See http://web.archive.org/web/20121029225832/http://ntrs.nasa.g... for an example.

2. http://ntrs.nasa.gov/robots.txt


It's not been nuked, merely hidden.


A few years ago I talked to an IA engineer, who said they were planning on dealing with this by not crawling sites whose nameservers were known to point to a domain parking company. The idea was that if they never retrieved the robots.txt, they wouldn't retroactively apply it. I don't know if that filtering out of parking nameservers ever happened, and it wouldn't help for parked domains whose robots.txt they'd already retrieved, but but it would help with domains that lapse in the future.


But why retroactively remove the data? The original owner was fine with holding it, why should the snapshot be deleted because a completely different person wants his completely different website to not be crawled?


It's hard for a bot to understand concepts of 'owner' and 'completely different person' based on the data they have available. Companies can use this robots.txt feature to un-index old marketing content after a re-branding, for example. Or after an acquisition.


Sure, but, surely, the bot has timestamps saying "robots.txt allowed me to keep these documents last time I spidered them". Why do they have to be retroactively removed? robots.txt only disallows spidering, it doesn't mandate that you should delete all the data you've already spidered.


Because most of the problems come from people who want to hide old material that they didn't realize was being indexed. The automatic behavior is simple and easy to implement, and doesn't require any human intervention.


Imo, a better approach than nuking old data, could be to keep the data, but not show it.


I may be mistaken, but I think I read somewhere that's exactly what they do. But don't take my word for it.


You are not mistaken. The Internet Archive does not delete or "nuke" the data that is blocked by a robots.txt. Even though cough some people believe so (see parent thread).

Source: IA staffers.


What you say is true for sites mirrored by IA's Wayback Machine, though my understanding is they retain the data in case the robots.txt is lifted later on.

Linking to media uploaded to the main archive itself should be safer, though.


Have you heard of other caches/archives (e.g. Google) applying the same retroactive policy? Presmably IA has no way of finding out that domain ownership has changed. I wonder if they are applying this policy to pages referenced by Wikipedia, http://blog.archive.org/2013/10/25/fixing-broken-links/

The safest archive of a web page is a local PDF.


> The safest archive of a web page is a local PDF.

Minor quibble: The safest archive of a web page is a local WARC archive:

http://www.archiveteam.org/index.php?title=Wget_with_WARC_ou...


Good point, I would say both are needed. The WARC is only useful if there is a matching web browser and operating system in a VM which could render the HTML+Javascript and produce the original layout. PDF/A would mostly retain the browser layout.


I use virtual notary for this. You give it a URL, and vn fetches it, and gives you a cryptographic certificate of time of retrieval and website's content at that time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: