What I do is convert to markdown, that way you still get some semantic structure...

bearjaws · on Sept 6, 2024

Seems to be the most common method I've seen, it makes sense given how well LLMs understand markdown.

wis · on Sept 7, 2024

Why do LLMs understand markdown really well? (besides the simple, terse and readable syntax of markdown)

They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?

nprateem · on Sept 7, 2024

I think it says in the Anthropic docs they use markdown internally (I assume that means were trained on it to a significant extent).

cpursley · on Sept 7, 2024

I think Anthropic actually uses xml and OpenAI markdown.

ascorbic · on Sept 8, 2024

They're trained on lots of code, and pretty much every public repo has markdown in it, even if it's just the README.

audessuscest · on Sept 7, 2024

I did that with json too, and got better result