Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"lol :Man facepalming: :medium light skintone:" becomes the skintone applying to nothing (which might crash?) and the wrong coloured man. (e+accent a) making éa becomes (a+accent e) incorrectly making áe - or possibly invalidly making an error combination. Right-to-left markers[1] and left-to-right markers will change which sections of the text are reversed unless you swap them over.

Codepoints can combine more than once, to the point where if you're too nitpicky you can't validly substring either, you can only read a string from the first codepoint onwards; they could become invalid sequences if reversed, possibly?

[1] https://en.wikipedia.org/wiki/Right-to-left_mark



Agree. I think reversing in non-ascii should always be thought of as "per-token", where English is character-as-token. So the reverse of what you gave would be:

":medium light skintone: :Man facepalming: lol"

(with the lol reversed). In this problem, it is a much harder problem than, say in python, mystring[::-1]. Therefore, it is a different problem "reverse a string" than to "reverse an array".

Accented characters would be kept as is in my scenario.


The "tokens" you're thinking of are "grapheme clusters" in Unicode.

Unfortunately just reversing by grapheme clusters doesn't solve the problem because of directional formatting codes; if you have e.g. a right-to-left embedding followed by a pop directional formatting you can't naively reverse them.


Grapheme clusters are a poor approximation of the vaguely-defined linguistic-level concept you're groping for.


Well, yes, but we gotta stop somewhere or just give up any hope of computers operating on text.

Although I think grapheme clusters are a pretty good approximation in that it's usually what you want to backspace in a word processor.


Is there a better approximation?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: