Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What I do is convert to markdown, that way you still get some semantic structure. Even built an Elixir library for this: https://github.com/agoodway/html2markdown


Seems to be the most common method I've seen, it makes sense given how well LLMs understand markdown.


Why do LLMs understand markdown really well? (besides the simple, terse and readable syntax of markdown)

They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?


I think it says in the Anthropic docs they use markdown internally (I assume that means were trained on it to a significant extent).


I think Anthropic actually uses xml and OpenAI markdown.


They're trained on lots of code, and pretty much every public repo has markdown in it, even if it's just the README.


I did that with json too, and got better result




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: