|
|
|
SearchCategories
Books by the AuthorTwitter AccountOther Ruby Projects |
The Unicode Character Set and Encodings
Posted over 3 years ago
in Character Encodings.
This is the second post in my series on Character Encodings. Please see the table of contents for the series if you have not yet read the previous posts. Since the rise of the various character encodings, there has been a quest to find the one perfect encoding we could all use. It's hard to get everyone to agree about whether or not this has truly been accomplished, but most of us agree that Unicode is as close as it gets. The goal of Unicode was literally to provide a character set that includes all characters in use today. That's letters and numbers for all languages, all the images needed by pictographic languages, and all symbols. As you can imagine that's quite a challenging task, but they've done very well. Take a moment to browse all the characters in the current Unicode specification to see for yourself. The Unicode Consortium often reminds us that they still have room for more characters as well, so we will be all set when we start meeting alien races. Now in order to really understand what Unicode is, I need to clear up a point I've played pretty loose with so far: a character set and a character encoding aren't necessarily the same thing. Unicode is one character set, and has multiple character encodings. Allow me to explain. A character set is just the mapping of symbols to their magic number representations inside the computer. Unicode calls these numbers code points and they are usually written in the form U+0061 where the U+ means Unicode and the four digit number is hexadecimal for a code point. Thus 0061 is is 97. That happens to be the Unicode code point for a and if you remember my previous post well, you will recognize that matches up with US-ASCII. We'll talk more about that in a bit. It is worth noting though that Ruby 1.8 and 1.9 can show you these code points:
The U pattern for Code points aren't what actually gets recorded in a file, they are just abstract numbers for each character. How those characters get written into an actual data stream is an encoding. There are multiple encodings for Unicode or multiple ways to record those abstract numbers into files. Different encodings have different strengths. For example, one possible encoding of Unicode is UTF-32, where 32 bits (or four bytes) are reserved for each code point. This has the advantage that you can always count on four bytes being used (unlike variable length encodings, which we will discuss shortly). An obvious downside though is the wasted space. I mean if you have all ASCII data, you only really need one byte each, but UTF-32 will use four without exception. You do need to be very careful how you work with multibyte encodings. UTF-32 is a good example of one that can be pretty tricky, because parts of the data can look normal. For example, look at this simple
There are a lot of null bytes in there, but notice how there are also normal "a", "b", and "c" bytes. I'm not going to show how this could happen to avoid encouraging bad habits, but if you replaced just the "a" byte with two bytes like "ab" your encoding is now broken and will eventually cause you problems. You also have to be careful anytime you slice up a Another possible encoding of Unicode is UTF-8. It has become pretty popular for things like email and web pages in recent years for several reasons. First, UTF-8 is 100% compatible with US-ASCII. The lowest 128 code points match their US-ASCII equivalents and UTF-8 encodes these in a single byte. Ruby 1.9 can show us this:
I've used several new Ruby 1.9 features here. I don't want to go too deeply into these at this point but briefly: For now the key point to notice about this example is that US-ASCII and UTF-8 are the same all the way down to the bytes. Of course, 128 characters isn't enough to contain the super large Unicode character set. Eventually you need more bytes. UTF-8 is a variable length encoding that uses more bytes to represent larger code points as needed. It does this with a simple set of rules:
Again, we can ask Ruby 1.9 to show this:
Notice how different characters are different lengths and how the byte patterns show what to expect as I just described. This makes UTF-8 a little safer to manipulate, because you won't see a bare "a" byte that isn't really an "a" in the data. You do still have to be careful how you slice up a All of these facts combine to make UTF-8 a very good choice for universal character encodings, in my opinion. The characters you need will be there. Simple ASCII content will be unchanged. Most software has at least some support for UTF-8 now as well. Is Unicode perfect? No, it's not. Some characters have multiple representations. For example, the Unicode code points are actually a super set of Latin-1 and thus include single byte versions of accented characters like é. Unicode also has the concept of combining marks though, where the accent would have one point and the letter another. Those are combined into one character when displayed. This creates some oddities where two Asian cultures have also been slow to adopt Unicode for a few reasons. First, Unicode usually makes their data larger. For example, Shift JIS can represent all the Japanese characters in two bytes while most of them will be three bytes in UTF-8. Hard drive space is pretty cheap these days, but a 1.5x multiplier on most of your data can be a factor in some cases. The Unicode Consortium also had to make some hard choices when specifying all of these characters. One such choice, known as Han Unification, was heavily debated for a while. I think many people recognize why the decision was made these days, but the debate definitely slowed Unicode adoption, especially in Japan. Finally, there's a lot of data out there not in a Unicode encoding. Unfortunately, there are issues that can make it hard to convert this data to Unicode flawlessly. All of these factors combine to make a Unicode-as-a-one-encoding-fits-all philosophy not totally flawless. Still, it's absolutely your best bet for support of a wide audience in a single encoding. Key take-away points:
|
|
|
|
While you indirectly say so, I think it is worth putting emphasis on the fact that UTF-8 data implicitly carry a checksum in the multi-byte sequences.
This is nice because plain text files are normally not tagged with encoding (as they have no natural place for such tag), but the checksum can be used instead.
For example a user who has been using CP-1252 for all of his text files can in practice move to UTF-8 file-by-file by performing an UTF-8 validity check when loading a file, should the sequence fail to be valid UTF-8, then it is one of his old CP-1252 files.
Allan has a great point there. You can use code like the following in Ruby 1.8 to validate UTF-8:
In Ruby 1.9, you could check that a String is UTF-8 with the simple code:
Great article! It made the difference between charset and encoding clear to me.
I think it would be great to have 2 links per page which takes the user to the next and previous topics. Instead of showing the link text as 'next' and 'previous' it should show the title of the text directly such as 'ruby 1.8 encoding' which is say the title of the next topic. It would even be greater if you have the table of contents on each page.
I've been working on a rewrite of this blog which I will get finished with eventually. It will handle series much better, I promise. I write a lot of them so the blog needs to be more tuned to that.
Hi James,
First off, thanks for a great article. In fact thanks for the whole series. I'm not through yet but I'm sure I'll love the rest as much as the first ones.
On to my question: It's about the part where you show the binary representation of UTF-8 encoded bytes. Specifically this part:
Should this mentally be read as
Otherwise, it somehow confuses me as per the rules above it is a single byte and "1100001" is the binary representation for "a".
Correct. Ruby left out the most significant bit since it's unset. It has to be a 0 though by the rules. You've got it.
Thanks James for the excellent tutorial! Thanks Alan for giving the straight solution to my problem! :-)