I've been working with scrapers quite a lot. I started with python requests, then to scrapy, then selenium, then selenium via undetected_chromedriver, and once that started being detected during a chrome update about a year ago, I've switched over to seleniumbase. It got by undetected, but to get it working with pre-downloaded drivers, I had to look into the code. I have never, and I mean never, in all my python years, seen such a horrible mess of code. We are talking 1000lines long methods, with 20-30 different flags and branches Just horrible. I have since switched to Playwright, which seems to be also undetected, and offers a much saner interface.
SeleniumBase modifies the webdriver so that it doesn't get detected when used alongside the CDP stealth mode and methods. It'll download chromedriver for you. Not sure what you mean by the multiple branches, as there's just the primary one. What 1000-line methods are you referring to? By "flags", do you mean the different command-line options available? As for Playwright, they aren't undetected: See https://github.com/microsoft/playwright/issues/23884#issueco... - "Playwright is an end-to-end testing framework, where we expect you test on your own environments. Bypassing any form of bot protection is not something we can act on. Thanks for your understanding." On the contrary, SeleniumBase is OK with bypassing bot detection: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
Not the commenter, but “multiple branches” in this context is referring to if/else statements in the code, not source-control branches. Similarly, “flags” is referring to function arguments like a boolean “is_original.” More generally, they are just saying that the code has long, complicated, bug-prone functions.
That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.
That's not amazing code but that's not that bad. In the grand scheme of things, that's not code debt that would ever seriously make my life any harder.
Yup. At least it's self-contained and easy to step through and modify if something breaks or needs to be changed.
And, a my previous PM would point out, even the copy-pasting and verifying no mistakes were made was a solution that took a fraction of the time a modern "clean" approach would. She had a point; as much as I'm against writing this simple code in the general case, plenty of devs tend to err towards overcomplicating solutions when given a chance.
I mean, the modern, proper, Clean Code™ solution would have this split into multiple files (not counting tests), and across two or three abstraction levels. I've seen this happen enough that I can tell I'd much prefer working with code like this capabilities parser (and hell, it can be beaten into near-perfection in an hour or three).
Call it "legacy code" if you'd like. That specific part is from a less common feature for setting options when running on a Selenium Grid. The new CDP Mode isn't compatible with The Grid (since CDP Mode makes direct CDP API calls without making Selenium API calls).
Maybe I am just a cynic but I would expect Playwright to be detected when using Chrome, I mean I would expect it was to the benefit of Google to make that happen for the sake of making reCaptcha detect bots better.
That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.
Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.
well I said when using Chrome, how would they make it happen?
well it's not like it's using AutoHotkey to automate things, it must be using underlying browser apis to move to move the mouse to mouseover something etc. as opposed to actually using the mouse, as an example
naive workflow -
I would think the browser sends message to google that instance (unique id) is being automated, recaptcha is detected by chrome on page, chrome calls hidden recaptcha method .setUniqueId(uniqueID) uniqueID is sent back to Google response tells it this is actually an automated browser that is being used as opposed to recaptcha, recaptcha gives 90% chance browser is automated to site, site stops browser access.
Site happy it uses recaptcha because they stopped automation.
Sure, Playwright can use FF, but most often people just use Chrome.
I meant that some of the code reminds me of enterprise python. The kicker is that code that works > pretty code. People here act as if ugly code is somehow lesser just because it’s ugly. Meanwhile there’s a lot of ugly code making millions of dollars.
Didn’t mean to bash your project. Sorry if it came across that way.
It's OK. No offense was taken. It almost looked like the conversation was expanding into a "Python vs Java" debate, but (thankfully) it did not. I've seen both worlds. I've seen advantages to both. I decided to stay in the Python world.
Not sure if you have explored rolling captcha solving services into your code. Its easy as fuck and you can do it in a few lines of code. Check out DeathByCaptcha or AntiCaptcha. It's like $2.99 per 1,000 successfully solved captchas.
I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.