To expand on this... Engineers often love to say you can't do this because regul...

roywiggins · on Oct 26, 2020

This works as long as you're really sure that the language you'll want to parse tomorrow will be regular also. It doesn't take much to accidentally add a new requirement that isn't, and once you've committed to regexps you may be tempted to break out the non-regular extensions that most regexp engines support, and that way lies madness.

Starting with a real HTML parser is a good way to future-proof your code for when someone asks you to add just one more thing.

danpalmer · on Oct 26, 2020

That's true, although I've also seen scraping fail because it was being too precise – looking for something at a particular point in the DOM tree because the parser encourages things like XPaths or CSS selectors, where a regex would have been less brittle _for that use-case_.

For me this just highlights why it's important that engineers understand at some basic what these different things all mean, and what limitations you may have with your solutions, or even those you may want.

bigiain · on Oct 27, 2020

  <h1 class='Rocks Mineral Banana Poison'>Things I won't eat!</h1>

danpalmer · on Oct 27, 2020

I assume this is a counterpoint to my Banana example? It still depends on your language. Maybe this is ok! I wasn't clear on whether I meant it being in the human readable page, or the raw text of the page, but maybe either is sufficient for this contrived hypothetical.

There are definitely cases like this where you have to be careful, but my point still stands that it's important to understand the language you are parsing, and the fact that it might be a regular language. Hell, it could even be Turing complete and then you're out of luck!

bigiain · on Oct 28, 2020

Yep.

In my experience, until you've made the mistakes that not properly parsing html leads to - you mostly jump to naive regex/substring solutions too quickly where you should learn/use well tested html parsing libraries instead. Those mo4re advanced techniques aren't always required, but they're worth knowing and once you know them it's smarter to "over solve" the problem sometimes than "cowboy it" with a regex just because it looks like it'll do the job.