Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been working with scrapers quite a lot. I started with python requests, then to scrapy, then selenium, then selenium via undetected_chromedriver, and once that started being detected during a chrome update about a year ago, I've switched over to seleniumbase. It got by undetected, but to get it working with pre-downloaded drivers, I had to look into the code. I have never, and I mean never, in all my python years, seen such a horrible mess of code. We are talking 1000lines long methods, with 20-30 different flags and branches Just horrible. I have since switched to Playwright, which seems to be also undetected, and offers a much saner interface.


SeleniumBase modifies the webdriver so that it doesn't get detected when used alongside the CDP stealth mode and methods. It'll download chromedriver for you. Not sure what you mean by the multiple branches, as there's just the primary one. What 1000-line methods are you referring to? By "flags", do you mean the different command-line options available? As for Playwright, they aren't undetected: See https://github.com/microsoft/playwright/issues/23884#issueco... - "Playwright is an end-to-end testing framework, where we expect you test on your own environments. Bypassing any form of bot protection is not something we can act on. Thanks for your understanding." On the contrary, SeleniumBase is OK with bypassing bot detection: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...


Not the commenter, but “multiple branches” in this context is referring to if/else statements in the code, not source-control branches. Similarly, “flags” is referring to function arguments like a boolean “is_original.” More generally, they are just saying that the code has long, complicated, bug-prone functions.

That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.


rather than point-by-point rebuttal as the sibling requests, I think this sums up the coding style pretty well: https://github.com/seleniumbase/SeleniumBase/blob/v4.33.11/s...


That's not amazing code but that's not that bad. In the grand scheme of things, that's not code debt that would ever seriously make my life any harder.


Yup. At least it's self-contained and easy to step through and modify if something breaks or needs to be changed.

And, a my previous PM would point out, even the copy-pasting and verifying no mistakes were made was a solution that took a fraction of the time a modern "clean" approach would. She had a point; as much as I'm against writing this simple code in the general case, plenty of devs tend to err towards overcomplicating solutions when given a chance.

I mean, the modern, proper, Clean Code™ solution would have this split into multiple files (not counting tests), and across two or three abstraction levels. I've seen this happen enough that I can tell I'd much prefer working with code like this capabilities parser (and hell, it can be beaten into near-perfection in an hour or three).


Amen!

I think the more experienced you get in coding the more you appreciate straight forward code you can immediately look at and understand.


That method came from code that I accepted in a PR from December 31, 2019: https://github.com/seleniumbase/SeleniumBase/pull/459 Not a true representation of most of the code today.


It's not really bad thought.

It's clear, it's intuitive, it's easy to understand on first glance, it's a single purpose function, it's easy to step through.

you don't have anything to defend here.


The code is in the code base. Presumably, it still gets run. It doesn't make a difference if new code doesn't look like that.


Bad old code has been battle tested. Bad new code has not, and is more likely to have the show stopper bugs you want to avoid.


There's actually a lot of examples being used for testing (https://github.com/seleniumbase/SeleniumBase/tree/master/exa...), which are run regularly (locally and in GitHub Actions). Plus, a lot of major companies are using SeleniumBase: https://github.com/seleniumbase/SeleniumBase/blob/master/hel... (if something breaks, I find out quickly)


Call it "legacy code" if you'd like. That specific part is from a less common feature for setting options when running on a Selenium Grid. The new CDP Mode isn't compatible with The Grid (since CDP Mode makes direct CDP API calls without making Selenium API calls).


it's always easier for people today to look at the work of other people in the past and draw stupid conclusions.. don't mind them..


Maybe I am just a cynic but I would expect Playwright to be detected when using Chrome, I mean I would expect it was to the benefit of Google to make that happen for the sake of making reCaptcha detect bots better.

That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.

Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.


How can Google make that happen? Playwright's made by Microsoft. It can use Firefox as a browser as well as Chrome.


well I said when using Chrome, how would they make it happen?

well it's not like it's using AutoHotkey to automate things, it must be using underlying browser apis to move to move the mouse to mouseover something etc. as opposed to actually using the mouse, as an example

naive workflow -

I would think the browser sends message to google that instance (unique id) is being automated, recaptcha is detected by chrome on page, chrome calls hidden recaptcha method .setUniqueId(uniqueID) uniqueID is sent back to Google response tells it this is actually an automated browser that is being used as opposed to recaptcha, recaptcha gives 90% chance browser is automated to site, site stops browser access.

Site happy it uses recaptcha because they stopped automation.

Sure, Playwright can use FF, but most often people just use Chrome.


Enterprise Python code. Somehow ends up being worse than Java enterprise code. I’m too used to it at this point.


The "Python vs Java" debate is probably one for a different Hacker News post. :)


I meant that some of the code reminds me of enterprise python. The kicker is that code that works > pretty code. People here act as if ugly code is somehow lesser just because it’s ugly. Meanwhile there’s a lot of ugly code making millions of dollars.

Didn’t mean to bash your project. Sorry if it came across that way.


It's OK. No offense was taken. It almost looked like the conversation was expanding into a "Python vs Java" debate, but (thankfully) it did not. I've seen both worlds. I've seen advantages to both. I decided to stay in the Python world.


Same. Although enterprise python is akin to wrestling a boa constrictor.


Not sure if you have explored rolling captcha solving services into your code. Its easy as fuck and you can do it in a few lines of code. Check out DeathByCaptcha or AntiCaptcha. It's like $2.99 per 1,000 successfully solved captchas.

I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.


SeleniumBase is free, open-source, can bypass CAPTCHAs with a few lines of code, and it works from the free tier of GitHub Actions.


It cant bypass all captchas and thats what im talking about.


According to live demos seen in https://www.youtube.com/watch?v=Mr90iQmNsKM, it'll bypass Cloudflare, Akamai, Shape Security, DataDome, Incapsula, Kasada, and PerimeterX.


Okay, and? DeathByCaptcha can bypass all of those + all other captchas.

Write a ton of code or just roll in a solving service API. Ez decision and save a ton of time + get to scraping faster.


I feel like what you're saying is you have a vested interest in the services you mentioned with all of this scope creep to your OG argument.


With SeleniumBase, you can bypass CAPTCHAs with one line of code: `sb.uc_gui_click_captcha()`


okay but it doesnt solve all captchas but a solving service does with a few more lines of code.

Can your script even do Google CAPTCHA and HCaptcha? What about the captcha from Dread? (aint no way it can)

There is no need to bypass them when you can just solve them.


There's a reCAPTCHA on the Pokemon website. This SeleniumBase example bypassed it: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...


> There is no need to bypass them when you can just solve them.

There is no need to solve them when you can just bypass them.


the point is you cant bypass them all but you CAN solve them all.


Why pay to solve CAPTCHAs when SeleniumBase can bypass them for free? SeleniumBase can also "solve" CAPTCHAs (such as Cloudflare via click).


It's like you're not even reading what he wrote.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: