Use String instead of char or code points in Java

spullara · on Oct 9, 2023

I had an interview question at Twitter that no one ever got 100% right because no one ever asked what the entity indexes in a tweet actually were, code point indexes, everyone assumed they were whatever their language used in substring(). Here is an example correct solution using the Java code point API:

https://github.com/spullara/interviewcode/blob/master/src/ma...

Really unfortunate that Java/Windows standardized on Unicode-16 a little early.

josefx · on Oct 9, 2023

> And as an aside to the aside: Check out this race condition that Wouter Coekaerts found in the String(char[]) constructor as a result of this optimization.

Quite sure neither the char array nor the String constructor pretend to be thread safe and messing with the input of a constructor in parallel would break most java code in existence in weird and unexpected ways.

dzaima · on Oct 9, 2023

The linked problem goes far past the result object being unstable though - namely, with it you can affect any string in any class file that the JVM loads in the future. That is, you can get yourself into a situation where the following function returns false:

    boolean foo() {
      System.out.println("hello world".startsWith("hello"));
    }

distract8901 · on Oct 9, 2023

Yeah, I'd expect that almost any code would react badly to modifying a ctor argument in parallel. That seems like a very bad thing to do in general.

zb3 · on Oct 9, 2023

I'll stop if I want to.

Zondartul · on Oct 9, 2023

You know what, just stop coding alltogether. Go touch grass.

Code points are Unicode's closest equivalent to "single graphical character" and everything you've been doing when iterating ASCII strings can be done using code points. It's also the closest thing to "a single key-stroke" and "how much text should disappear when I press backspace once".

The linked article suggests basically doing nothing because a few edge cases exist.

pie_flavor · on Oct 9, 2023

No, Unicode's closest equivalent to 'single graphical character' is extended grapheme clusters, and they are explicitly defined as what should disappear when you press backspace once.

Neither code points nor graphemes are equivalent to keystrokes in CJK languages; each character, precomposed or not, is written as a combination of other characters.

I appreciate the desire to round off a few edge cases, but remember that this is the majority of the world's population here (not counting English/commonly-used-in-English words like jalapeño - hope that ñ is precomposed!).

The only thing you can iterate ASCII-style is English minus all the diacritics in loanwords, same as ASCII itself.

dwaite · on Oct 9, 2023

> Code points are Unicode's closest equivalent to "single graphical character" and everything you've been doing when iterating ASCII strings can be done using code points.

I would have thought the closest equivalent would be a grapheme cluster.

dmazzoni · on Oct 9, 2023

If your code never needs to handle Arabic, Hindi, or Thai, and you're 100% sure that a user will never paste an Emoji into your text box, then sure - go ahead and use unicode code points.

Seriously, just try inserting an emoji into your app, then see what happens if you iterate over code points. It doesn't work, your app is broken.

Also, note that in many languages it takes more than one keystroke to type a single character, so your idea of "a single key-stroke" as an abstraction is definitely wrong.

Your abstraction of "how much text should disappear when I press backspace once" is reasonable. One unicode code point is definitely NOT the right answer, though, it's one extended grapheme cluster. Kind of annoying, but it's the only way that works.

invalidname · on Oct 10, 2023

What's the problem with Arabic, Hindi or Thai?

Do you mean RTL?

That should work fine even with char's. Emojis and some special characters do have a problem but the base languages should be fine with the 16 bit unicode.

msie · on Oct 9, 2023

Yeah, it just seemed foolish to allow codepoints to be combined to form a "character". Some 'smart' people don't know when to stop being clever.

dmazzoni · on Oct 9, 2023

You'd prefer instead that there's a separate code point for every possible permutation of base characters and modifiers (like accents)?

So would you add z̃ as a character even though it isn't actually used in any known languages, because a tilde is used over 12 other letters already, so you might as well make characters for all of them?

Or would you skip it and only pick ones in known use...and then constantly update the tables every time you discover an obscure use of diacriticals?

My understanding is that there are some languages / writing systems where the problem is far more challenging - the possible combination of different modifiers could result in millions of possible characters, many of which are plausible even though most aren't known to be used in practice.

I think the reality is that human written language is extremely complex, and any attempt to simplify it too much will end up leaving out some languages.

msie · on Oct 10, 2023

Fair enough.