I've been working with scrapers quite a lot. I started with python requests, the...

seleniumbase · on Dec 17, 2024

SeleniumBase modifies the webdriver so that it doesn't get detected when used alongside the CDP stealth mode and methods. It'll download chromedriver for you. Not sure what you mean by the multiple branches, as there's just the primary one. What 1000-line methods are you referring to? By "flags", do you mean the different command-line options available? As for Playwright, they aren't undetected: See https://github.com/microsoft/playwright/issues/23884#issueco... - "Playwright is an end-to-end testing framework, where we expect you test on your own environments. Bypassing any form of bot protection is not something we can act on. Thanks for your understanding." On the contrary, SeleniumBase is OK with bypassing bot detection: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...

cyanmagenta · on Dec 18, 2024

Not the commenter, but “multiple branches” in this context is referring to if/else statements in the code, not source-control branches. Similarly, “flags” is referring to function arguments like a boolean “is_original.” More generally, they are just saying that the code has long, complicated, bug-prone functions.

That said, I just spent a few minutes browsing the SeleniumBase repro, and honestly it didn’t seem that unusual to me. Would be interested in seeing a specific example of what the commenter had in mind.

mdaniel · on Dec 17, 2024

rather than point-by-point rebuttal as the sibling requests, I think this sums up the coding style pretty well: https://github.com/seleniumbase/SeleniumBase/blob/v4.33.11/s...

harrall · on Dec 18, 2024

That's not amazing code but that's not that bad. In the grand scheme of things, that's not code debt that would ever seriously make my life any harder.

TeMPOraL · on Dec 18, 2024

Yup. At least it's self-contained and easy to step through and modify if something breaks or needs to be changed.

And, a my previous PM would point out, even the copy-pasting and verifying no mistakes were made was a solution that took a fraction of the time a modern "clean" approach would. She had a point; as much as I'm against writing this simple code in the general case, plenty of devs tend to err towards overcomplicating solutions when given a chance.

I mean, the modern, proper, Clean Code™ solution would have this split into multiple files (not counting tests), and across two or three abstraction levels. I've seen this happen enough that I can tell I'd much prefer working with code like this capabilities parser (and hell, it can be beaten into near-perfection in an hour or three).

the_real_cher · on Dec 19, 2024

Amen!

I think the more experienced you get in coding the more you appreciate straight forward code you can immediately look at and understand.

seleniumbase · on Dec 17, 2024

That method came from code that I accepted in a PR from December 31, 2019: https://github.com/seleniumbase/SeleniumBase/pull/459 Not a true representation of most of the code today.

the_real_cher · on Dec 19, 2024

It's not really bad thought.

It's clear, it's intuitive, it's easy to understand on first glance, it's a single purpose function, it's easy to step through.

you don't have anything to defend here.

parineum · on Dec 18, 2024

The code is in the code base. Presumably, it still gets run. It doesn't make a difference if new code doesn't look like that.

wisty · on Dec 18, 2024

Bad old code has been battle tested. Bad new code has not, and is more likely to have the show stopper bugs you want to avoid.

seleniumbase · on Dec 18, 2024

There's actually a lot of examples being used for testing (https://github.com/seleniumbase/SeleniumBase/tree/master/exa...), which are run regularly (locally and in GitHub Actions). Plus, a lot of major companies are using SeleniumBase: https://github.com/seleniumbase/SeleniumBase/blob/master/hel... (if something breaks, I find out quickly)

seleniumbase · on Dec 18, 2024

Call it "legacy code" if you'd like. That specific part is from a less common feature for setting options when running on a Selenium Grid. The new CDP Mode isn't compatible with The Grid (since CDP Mode makes direct CDP API calls without making Selenium API calls).

MstWntd · on Dec 18, 2024

it's always easier for people today to look at the work of other people in the past and draw stupid conclusions.. don't mind them..

bryanrasmussen · on Dec 18, 2024

Maybe I am just a cynic but I would expect Playwright to be detected when using Chrome, I mean I would expect it was to the benefit of Google to make that happen for the sake of making reCaptcha detect bots better.

That's actually why I've been scrapping my Playwright automation (because I expect I will encounter problems even if hasn't happened yet, cynical and paranoid) and moving towards writing a browser extension to automate Firefox.

Basically my use case is automating tedious things for myself not running bots at scale, so that's why it is imperative not to get caught being "not human", because then risk account problems.

robertlagrant · on Dec 18, 2024

How can Google make that happen? Playwright's made by Microsoft. It can use Firefox as a browser as well as Chrome.

bryanrasmussen · on Dec 20, 2024

well I said when using Chrome, how would they make it happen?

well it's not like it's using AutoHotkey to automate things, it must be using underlying browser apis to move to move the mouse to mouseover something etc. as opposed to actually using the mouse, as an example

naive workflow -

I would think the browser sends message to google that instance (unique id) is being automated, recaptcha is detected by chrome on page, chrome calls hidden recaptcha method .setUniqueId(uniqueID) uniqueID is sent back to Google response tells it this is actually an automated browser that is being used as opposed to recaptcha, recaptcha gives 90% chance browser is automated to site, site stops browser access.

Site happy it uses recaptcha because they stopped automation.

Sure, Playwright can use FF, but most often people just use Chrome.

pryelluw · on Dec 18, 2024

Enterprise Python code. Somehow ends up being worse than Java enterprise code. I’m too used to it at this point.

seleniumbase · on Dec 18, 2024

The "Python vs Java" debate is probably one for a different Hacker News post. :)

pryelluw · on Dec 18, 2024

I meant that some of the code reminds me of enterprise python. The kicker is that code that works > pretty code. People here act as if ugly code is somehow lesser just because it’s ugly. Meanwhile there’s a lot of ugly code making millions of dollars.

Didn’t mean to bash your project. Sorry if it came across that way.

seleniumbase · on Dec 18, 2024

It's OK. No offense was taken. It almost looked like the conversation was expanding into a "Python vs Java" debate, but (thankfully) it did not. I've seen both worlds. I've seen advantages to both. I decided to stay in the Python world.

pryelluw · on Dec 19, 2024

Same. Although enterprise python is akin to wrestling a boa constrictor.

edm0nd · on Dec 17, 2024

Not sure if you have explored rolling captcha solving services into your code. Its easy as fuck and you can do it in a few lines of code. Check out DeathByCaptcha or AntiCaptcha. It's like $2.99 per 1,000 successfully solved captchas.

I guess my point is, you dont have to be undetected nor write 1000 lines of code to scrape or do whatever you are needing to do always. Saved me a ton of headaches and time when captchas are involved.

mintzworld · on Dec 17, 2024

SeleniumBase is free, open-source, can bypass CAPTCHAs with a few lines of code, and it works from the free tier of GitHub Actions.

edm0nd · on Dec 17, 2024

It cant bypass all captchas and thats what im talking about.

mintzworld · on Dec 17, 2024

According to live demos seen in https://www.youtube.com/watch?v=Mr90iQmNsKM, it'll bypass Cloudflare, Akamai, Shape Security, DataDome, Incapsula, Kasada, and PerimeterX.

edm0nd · on Dec 17, 2024

Okay, and? DeathByCaptcha can bypass all of those + all other captchas.

Write a ton of code or just roll in a solving service API. Ez decision and save a ton of time + get to scraping faster.

windexh8er · on Dec 18, 2024

I feel like what you're saying is you have a vested interest in the services you mentioned with all of this scope creep to your OG argument.

seleniumbase · on Dec 18, 2024

With SeleniumBase, you can bypass CAPTCHAs with one line of code: `sb.uc_gui_click_captcha()`

edm0nd · on Dec 18, 2024

okay but it doesnt solve all captchas but a solving service does with a few more lines of code.

Can your script even do Google CAPTCHA and HCaptcha? What about the captcha from Dread? (aint no way it can)

There is no need to bypass them when you can just solve them.

seleniumbase · on Dec 18, 2024

There's a reCAPTCHA on the Pokemon website. This SeleniumBase example bypassed it: https://github.com/seleniumbase/SeleniumBase/blob/master/exa...

Funnnny · on Dec 18, 2024

> There is no need to bypass them when you can just solve them.

There is no need to solve them when you can just bypass them.

edm0nd · on Dec 18, 2024

the point is you cant bypass them all but you CAN solve them all.

mintzworld · on Dec 18, 2024

Why pay to solve CAPTCHAs when SeleniumBase can bypass them for free? SeleniumBase can also "solve" CAPTCHAs (such as Cloudflare via click).

parineum · on Dec 18, 2024

It's like you're not even reading what he wrote.