Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unicode is indeed insanely complex. There is almost no query or transform of a Unicode string you can do beyond asking its length in bytes that is at all straightforward. I suspect that very few pieces of software that 'support' Unicode and do anything non trivial with text actually do so fully correctly. It would be nice if there was a well defined 'simple' subset that handled the 80% case that could be a reasonable target for the average app to support fully.


Perl 6 has invented its own normalisation called NFG which normalises at the grapheme level by creating synthetic code points for multi-char graphemes where necessary. This vastly simplifies operations on Unicode strings and gives semantics that are intuitive - producing results you would expect from the visual appearance of a string.


It feels to me like Unicode was designed for font renderers and other such software rather than programs that have to deal with Unicode input and output.

If you're a font renderer it makes sense to have separate codepoints for each grapheme, and it'd be more complex to split a single codepoint cluster into the individual components that need to be drawn. Having separate graphemes also allows reuse (though as the article shows, there's plenty of visual, non-semantic duplication).

But as a result, the operation "length of string in terms of what a user would consider as separate characters or grapheme clusters" is a hard problem that basically requires all the core aspects of a font renderer other than the actual display code.

Which is fine, and probably reasonable, but dear lord does it make it difficult to use.


> It would be nice if there was a well defined 'simple' subset that handled the 80% case that could be a reasonable target for the average app to support fully.

Isn't that ASCII?


Well it ends up that if you don't want to go down the Unicode rabbit hole too far then yeah, your best bet is probably to stick with ASCII. As an example from my industry though, it would be nice if I could implement user name entry and display for a high score table in a simple game and support names in common European languages without needing to handle all the edge cases of e.g. mixed left to right and right to left text, combining characters, surrogates, etc.

I'm far from a Unicode or a languages expert but I'm familiar with one language with right to left non Latin characters and aware of just enough Unicode madness to know I don't know enough to handle many edge cases properly. It would be nice if a regular developer like me could support something more than plain ASCII but less than the full insanity of Unicode to accommodate at least some non English users.


You're basically throwing people with "inconvenient" character sets (i.e. everything that doesn't use something strongly resembling Latin characters) under the bus. Sure, you might be able to support Spanish, French, and German, but you're basically disregarding Japanese, Chinese, and Hindi when doing so (and possibly even ASCII, since you'd trade some symbols for accents).


That's not at all what I'm advocating. My point is that the extreme difficulty of fully supporting Unicode with all of its complex edge cases (like my examples of mixing left to right and right to left languages) means that the two most common outcomes are throwing up your hands and giving up supporting anything but English and just using ASCII or having broken, partial Unicode support.

I'm suggesting it would be nice to have another option where you could provide some level of support for non English languages with something you have some hope of implementing correctly. Applications that correctly handle editing of mixed left to right and right to left text are rare for example but you could support Farsi speakers reasonably well in many applications without handling that scenario.


There’s Latin-1 which gets you full coverage on many European languages and almost complete coverage on several others (e.g. French uses œ, which it doesn’t have but French readers will be able to understand if oe is substituted).

The problem with Latin-1 and other 1-byte options is that, unlike ASCII, they aren’t forward compatible with utf-8, which is the emerging de facto string exchange encoding. For a stand alone video game, maybe that doesn’t matter but for anything network enabled it can be a big issue.


No. Computers are in use by about 3 billion people right now. Only a minority of them use only ASCII characters in their day-to-day writing. Turns out the US and UK comprise only a small fraction of the world’s population.


But for many of the companies that the people on HN happen to work for the majority or even overwhelming majority of customers.


And therein lies a terrible misconception: the world does not speak English or German or Mandarin or French, but a horrible mix of all these languages. Eventually, almost any system will have to deal with that.

Simple example and a current pet peeve of mine while staying in the US: my name is spelled with an ü. I will likely try to enter that in your web form, because it is part of my legal name. A lot of systems happen to "sanitize" that input when it is passed across some invisible internal boundaries and it becomes a u. Now that system has actually changed my name. ü and u are completely different letters. The proper conversion actually is the transcription ue - two letters!

If that were - say - the address for a letter to Germany, it might well return to sender because there is noone with the (altered) name living at the given address.


The unicode encoding formats are relatively simple and quite elegant (UTF-8 in particular manages to get an impressive set of capabilities with a relatively simple format).

> I suspect that very few pieces of software that 'support' Unicode and do anything non trivial with text actually do so fully correctly.

Why do you suspect this? Nearly all software that works with Unicode does so using pre-existing libraries (either a language's standard library, or something like libiconv).


I've read so many stories over the years of Unicode edge cases being broken in major applications and frameworks. I don't know the current state of all of these but Chrome had all kinds of Unicode bugs for years, standard Windows text boxes didn't correctly support combining characters, many consoles and edit boxes don't support mixed right to left and left to right text properly, many widely used languages seem to lack good standard library support for correctly manipulating strings with combining characters, although its hard to find clear explanations of what the right thing to do even is.


Most of what you describe sounds like applications that don't bother to use unicode-aware string manipulation routines, period, rather than applications that use buggy unicode handling.


A lot of the problems I'm describing revolve around text input / editing / display rather than straight string manipulation.

Let's say I want to type 'My name is Matt.' in Farsi which is right to left. Transliterated that is 'esme (name) man (my) Matt ast (is)' and you might think I'd type that by typing the words in that order.

اسم من Matt است

The above is what I get if I type the words in that order while switching between Persian / English keyboards but it's not really what I'd expect as a user. In terms of right to left word order the above says 'ast Matt esme man' which is not the order I typed and is not correct. Now I'm not an expert in Farsi, Unicode or multi-lingual text entry and I don't know if the problem here is user error, browser implementation, Windows, or what. I know there's something about directional override characters in Unicode and a complex algorithm for dealing with text with mixed directions. I discovered while trying to write this post that it's been exploited in file names as a malware vector. I just know that this stuff is more complicated than I have time to deal with as a programmer on most projects I've worked on, so if Unicode is supported I don't even know if it's working correctly or how to test it in many cases.

Try to select, copy and paste the above text into a text editor and then move the cursor around with the cursor keys in it. Is the behavior what you'd expect? Is it correct? I don't know enough to be able to tell honestly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: