Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They also have their own web scraper called ByteSpider that scrapes websites with lots of text very aggressively and ignores robots.txt. I've had to block it by useragent on one of my sites.


I don't think it ignores robots.txt, I think it just doesn't have a very good parser and you need to give them their own user-agent block. I had a similar level of frustration.

https://www.feitsui.com/en/article/32


After all, if they wanted to completely ignore the wishes of the website owners they probably would not announce their spider as such in the user agent. They’d just pretend to be a web browser.


It is trivial to detect a spider from human traffic based on requests alone. Lying about the UA would just be bad press for them.


If it's really trivial as you say, Google's reCAPTCHA and similar products like hCAPTCHA would instantly have no reason to exist.


Bot intentionally trying to look human =! Spider

A spider will generally have a pretty predictable route through a web site.


The various CAPTCHA implementations are primarily designed to prevent bot submissions, not spiders.


Some of them yes. But not all. Try for example to browse a Cloudflare protected site from Tor and you will be hit with a constant barrage of captchas even though you are only doing GET requests.


Yes, huristicly, a tor browser is more likely to be nefarious than a regular browser user. Note the use of huristisc - such as IP address - not related to user agent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: