Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that?

I said I might be wrong _multiple times_, and its genuine. I'm glad you appeared with an in-depth explanation that proves me wrong. I asked exactly for that.

The first examples in this thread are not zero-width-joiners (á, â) or complex graphemes. They all could be stored in 4 bytes. It took some time to come up with the woman golfing example.

By the way, one can still implement reading "character" by "character" in sequences of 4 UTF-32 bytes, and decide to abstract the grapheme units on top of that. It still saves a lot of leading byte checks.

Maybe someone else learned a little bit reading through the whole thing as well. If you are afraid I'm framing the subject with the wrong facts, this is me assuming, once again, that I never claimed to be a Unicode expert. I don't regret a single thing, I love being actually proven wrong.



> I am reading the spec, discussing online and trying to understand the subject better, what is wrong with that? I said I might be wrong _multiple times_, and its genuine.

I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

> I don't regret a single thing, I love being actually proven wrong.

Me too!


> I encourage you to reread the comment I replied to again and see if it has the tone of someone “trying to learn” or rather something different.

To me it sounds ok. I'm not an english native speaker though, so there is a handicap on my side on tone, vocabulary and phrasing.

My intent was to admit I had wrong assumptions. At some point before this whole thread, I _really_ believed all graphemes (which, in my mind, were "character combinations") could be stored in just 4 bytes. I was aware of combining characters, I just assumed all of them could fit in 32bit. You folks taught me that it can't.

However, there's another subject we're dealing with here as well. Storing these characters at a lower level, whether they form graphemes or not at a higher level of abstraction.

The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.


> The fact that I was wrong about graphemes, directly impacts the _example_ that I gave about the length of a string, but not the _fundamental principle_ of UTF-32 I was trying to convey (that you don't need to count lead bytes at the character level). Can we agree on that? If we can't, I need a more in-depth explanation on _why_ regarding this as well, and if given, that would mean I am wrong once again, and if that happens, I am fine, but it hasn't yet.

As I said in the other thread, to try to minimize confusion, consider Character and Grapheme as synonymous. They are made up of one or more codepoints. The world you’ve made up that everything is characters and they all take exactly four bytes in UTF32 is just wrong. Yes, many graphemes are a single codepoint, so yes, they are 4 bytes in UTF32, but not ALL (and it’s not just emoji’s to blame).

If what you’re on about is the leading zeros, yes, they don’t matter individually. Unicode by rule is limited to 21bits to represent all codepoints so the 11 bits left as leading zeros are wasted, which is why folks don’t use UTF32 typically as it’s the least efficient storage wise and doesn’t have really any advantage over UTF-16 outside easy counting of codepoints (but again, codepoints aren’t characters).


I'm talking more about C (or any low level stuff) than Unicode now. The world I'm using as a reference has only byte arrays, smokes and mirrors.

I'm constantly pointing you _to the awk codebase_. It's a relevant context, the title of the post and it matters. Can you please stop ignoring this? There's no Unicode library there, it's an implementation from scratch.

If you are doing it from scratch, there's a part of the code that will deal with raw bytes, way before they are recognized as Unicode things (whatever they might be).

Ever since this entire post was created, the main context was always this: an implementation from scratch, in an environment that does not have unicode primitives as a first-class citizens. Your string functions don't have \u escapes, YOU are doing the string functions that support \u.


Ok, now I know you’re just trolling. Enjoy and goodbye!


I'm not kidding. Just look at the code:

https://github.com/onetrueawk/awk/commit/d322b2b5fc16484affb...

I am talking about that environment. You are not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: