> The woman golfing uses 4 bytes in UTF-8, UTF-16 and UTF-32. No. It's made up o...

alganet · on Aug 21, 2022

Let's go to the source:

https://www.unicode.org/faq/utf_bom.html

> Q: Should I use UTF-32 (or UCS-4) for storing Unicode strings in memory? > > This depends. It may seem compelling to use UTF-32 as your internal string format because it uses one code unit per code point. (...)

It also confirms what I said about implementation vs interface (or storage vs access, whatever):

> Q: How about using UTF-32 interfaces in my APIs? > > Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. (...) While a UTF-32 representation does make the programming model somewhat simpler, the increased average storage size has real drawbacks, making a complete transition to UTF-32 less compelling

Finally:

> Q: What is UTF-32? > > Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character.

So, it does in fact reduce the complexity of implementing it _for storage_, as I suspected. And there is a tradeoff, as I mentioned. And the Unicode documentation explicitly separates the interface from the storage side of things.

That's good enough for me. I mentioned before there might be some edge cases like ligatures, and you came up with a zero-width joiner example. None of this changes these fundamental properties of UTF-32 though.

jsjohnst · on Aug 22, 2022

Reading this thread, I feel bad for the person you’re arguing with. It’s clear you are in the “knowledgeable enough to be dangerous” stage and no amount of trying to guide you will sway you from your mistaken belief that you are right.

Now to try one last time, you are misreading the spec and not understanding important concepts. Take the “woman golfing” emoji as an example. That emoji is not a Unicode “character” and is part of why it can’t be represented by a single UTF32. That emoji is a grapheme which combines multiple “characters” together with a zero width joiner, “person golfing” and “female” in this case. Rather than have a single “character” for every supported representation of gender and skin color, modern emoji use ZWJ sequences instead, which means yes, something you incorrectly think is a “character” can in fact take up more than 4 bytes in UTF32.

alganet · on Aug 23, 2022

I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that?

I said I might be wrong _multiple times_, and its genuine. I'm glad you appeared with an in-depth explanation that proves me wrong. I asked exactly for that.

The first examples in this thread are not zero-width-joiners (á, â) or complex graphemes. They all could be stored in 4 bytes. It took some time to come up with the woman golfing example.

By the way, one can still implement reading "character" by "character" in sequences of 4 UTF-32 bytes, and decide to abstract the grapheme units on top of that. It still saves a lot of leading byte checks.

Maybe someone else learned a little bit reading through the whole thing as well. If you are afraid I'm framing the subject with the wrong facts, this is me assuming, once again, that I never claimed to be a Unicode expert. I don't regret a single thing, I love being actually proven wrong.

jsjohnst · on Aug 23, 2022

> I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that? I said I might be wrong _multiple times_, and its genuine.

I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

> I don't regret a single thing, I love being actually proven wrong.

Me too!

alganet · on Aug 23, 2022

> I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

To me it sounds ok. I'm not an english native speaker though, so there is a handicap on my side on tone, vocabulary and phrasing.

My intent was to admit I had wrong assumptions. At some point before this whole thread, I _really_ believed all graphemes (which, in my mind, were "character combinations") could be stored in just 4 bytes. I was aware of combining characters, I just assumed all of them could fit in 32bit. You folks taught me that it can't.

However, there's another subject we're dealing with here as well. Storing these characters at a lower level, whether they form graphemes or not at a higher level of abstraction.

The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

jsjohnst · on Aug 23, 2022

> The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

As I said in the other thread, to try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. The world you’ve made up that everything is characters and they all take exactly four bytes in UTF32 is just wrong. Yes, many graphemes are a single codepoint, so yes, they are 4 bytes in UTF32, but not ALL (and it’s not just emoji’s to blame).

If what you’re on about is the leading zeros, yes, they don’t matter individually. Unicode by rule is limited to 21bits to represent all codepoints so the 11 bits left as leading zeros are wasted, which is why folks don’t use UTF32 typically as it’s the least efficient storage wise and doesn’t have really any advantage over UTF-16 outside easy counting of codepoints (but again, codepoints aren’t characters).

alganet · on Aug 23, 2022

I'm talking more about C (or any low level stuff) than Unicode now. The world I'm using as a reference has only byte arrays, smokes and mirrors.

I'm constantly pointing you _to the awk codebase_. It's a relevant context, the title of the post and it matters. Can you please stop ignoring this? There's no Unicode library there, it's an implementation from scratch.

If you are doing it from scratch, there's a part of the code that will deal with raw bytes, way before they are recognized as Unicode things (whatever they might be).

Ever since this entire post was created, the main context was always this: an implementation from scratch, in an environment that does not have unicode primitives as a first-class citizens. Your string functions don't have \u escapes, YOU are doing the string functions that support \u.

jsjohnst · on Aug 23, 2022

Ok, now I know you’re just trolling. Enjoy and goodbye!

alganet · on Aug 24, 2022

I'm not kidding. Just look at the code:

https://github.com/onetrueawk/awk/commit/d322b2b5fc16484affb...

I am talking about that environment. You are not.

jsjohnst · on Aug 23, 2022

To make the point more clear, the female golfing Unicode “character” is encoded as follows in various UTFs:

UTF16 (12 bytes total):

\ud83c\udfcc\ufe0f\u200d\u2640\ufe0f

UTF32 (20 bytes total):

u+0001f3ccu+0000fe0fu+0000200du+00002640u+0000fe0f

UTF8 (16 bytes total):

\xf0\x9f\x8f\x8c\xef\xb8\x8f\xe2\x80\x8d\xe2\x99\x80\xef\xb8\x8f

alganet · on Aug 23, 2022

You are still presenting me with abstractly encoded data. \u and u+ are in a higher level of abstraction. The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals (did you had a choice? why?).

If you had all of these expressed as pure hexadecimals (or octals, or any single-byte unit), how would they be serialized?

Then, once all of them are just hexadecimals, how would you go about parsing one of these sequences of hexadecimals into _characters_? (each hexa representing a raw byte, just like in the awk codebase we are talking about)

Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?

jsjohnst · on Aug 23, 2022

> You are still presenting me with abstractly encoded data.

That’s the actual encoding for that grapheme as specified by the spec for UTF8, UTF16, and UTF32.

> \u and u+ are in a higher level of abstraction.

No, it’s not, it’s how you write escaped 16bit and 32bit hexadecimal strings for UTF-16 and UTF-32 respectively. Notice there’s 4 hex characters after \u and 8 hex after u+. Those are the “raw bytes” in hex.

> The only raw bytes I am seeing here are in the UTF-8 string, which you decided to serialize as hexadecimals

All three forms are “raw bytes” in hex form. \x is how you represent an escaped 8 bit UTF-8 byte in hex.

> Another question: do you need first to parse the raw bytes into characters before recognizing graphemes, or can you do it all at once for both variable-length and fixed-length encodings?

You need to “parse” (more like read for UTF16 and UTF32, as there’s not much actual parsing outside byte order handling) the raw bytes into codepoints. To try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. It really doesn’t matter if it’s variable length or fixed length, you still have to get the codepoints before you can determine character/graphemes.

alganet · on Aug 23, 2022

I am asking you "how do you purify water"? And you're holding a bottle of Fiji and telling me "look, it's simple".

You're absolutely right about what is a character and what is a grapheme. I already said that, this subject is done. You're right, no need to come back to it. You win, I already yielded several comments ago.

Now, to the other subject: I would very much prefer if we talked only about bytes. Yes, talking only about bytes makes things harder. Somewhere down the line, there must be an implementation that deals with the byte sequence. I'm talking about these, just above the assembler (for awk). There IS confusion at this level, no way to avoid it except abstracting it by yourself, byte by byte (or, 4 bytes by 4 bytes in UTF-32).