They also have their own web scraper called ByteSpider that scrapes websites wit...

tech2 · on Dec 16, 2023

I don't think it ignores robots.txt, I think it just doesn't have a very good parser and you need to give them their own user-agent block. I had a similar level of frustration.

https://www.feitsui.com/en/article/32

codetrotter · on Dec 16, 2023

After all, if they wanted to completely ignore the wishes of the website owners they probably would not announce their spider as such in the user agent. They’d just pretend to be a web browser.

dotancohen · on Dec 16, 2023

It is trivial to detect a spider from human traffic based on requests alone. Lying about the UA would just be bad press for them.

kccqzy · on Dec 16, 2023

If it's really trivial as you say, Google's reCAPTCHA and similar products like hCAPTCHA would instantly have no reason to exist.

jpc0 · on Dec 16, 2023

Bot intentionally trying to look human =! Spider

A spider will generally have a pretty predictable route through a web site.

dotancohen · on Dec 16, 2023

The various CAPTCHA implementations are primarily designed to prevent bot submissions, not spiders.

codetrotter · on Dec 16, 2023

Some of them yes. But not all. Try for example to browse a Cloudflare protected site from Tor and you will be hit with a constant barrage of captchas even though you are only doing GET requests.

dotancohen · on Dec 16, 2023

Yes, huristicly, a tor browser is more likely to be nefarious than a regular browser user. Note the use of huristisc - such as IP address - not related to user agent.