|
|
|
SearchCategories
Books by the AuthorOther Ruby Projects |
Bytes and Characters in Ruby 1.8
Posted 2 months ago
in Character Encodings and Ruby Tutorials.
This is the fourth post in my series on Character Encodings. Please see the table of contents for the series if you have not yet read the previous posts. Gregory Brown said, in a training session at the Lone Star Rubyconf, "Ruby 1.8 works in bytes. Ruby 1.9 works in characters." The truth of Ruby 1.9 is maybe a little more complicated and we will discuss all of that eventually, but Greg is dead right about Ruby 1.8. In Ruby 1.8, a The important question is, how does that one golden rule relate to all that we've learned about character encodings? Essentially, it puts all the responsibility on you as the developer. Ruby 1.8 leaves it to you to determine what to do with those bytes and it doesn't provide a lot of encoding savvy help. That's why knowing at least the basics of encodings is so important when working with Ruby 1.8. There are plusses and minuses to every system and this one is no exception. On the side of plusses, Ruby 1.8 can pretty much support any encoding you can imagine. After all, a character encoding is just some bytes that somehow map to a set of characters and all Ruby 1.8 I won't lie to you though, there are more minuses than plusses to this approach. Latin-1 is a pretty simple case since each byte is a character. With many other encodings though, like the UTF-8 encoding I've recommended we rely on, things get a lot more complicated. Slicing up a Ruby 1.8 Ruby 1.8 is also never going to police the contents of a This may be starting to sound a little bleak and it probably is. However, Ruby 1.8 throws one major exception into the works that can help you in many cases: the regex engine is aware of four character encodings. Often we can use this simple fact to work with characters. What encodings does Ruby 1.8 know? Here's the full list:
The None encoding is the default in Ruby 1.8. It's just the golden rule I've already mentioned: treat everything as bytes. If your encoding isn't on this list, you will need to use None and be darn sure you don't do anything to the data that could damage the encoding. That's very hard and the fact is that doing significant work with an encoding not on the above list in Ruby 1.8 will be quite a challenge for you. Both EUC (Extended Unix Code) and SHIFT_JIS are primarily Asian character encodings. SHIFT_JIS is a Japanese encoding and EUC is mainly used for Japanese, Korean, and simplified Chinese. You can tell Ruby comes from Japan, can't you? Obviously these are very helpful if you are Asian, but the rest of us won't need these much. Now we get to the good news: our champion UTF-8 made the list! Yes, this means Ruby 1.8 has limited support for working with UTF-8 data. It's not comprehensive, but we get some help. The letters listed after each encoding are used in multiple places inside Ruby 1.8 to tell it which encoding you need to work with. I'll point those places out as we get into the details. What does it mean if to have a character encoding on the above list? It means that the regex engine can recognize characters in that encoding, even if they are multi-byte. That assures us that regular expression constructs that target characters, like character classes ( Let's look at some examples of this, so you can see how it works. I'll play around with a simple UTF-8 A common task in working with characters in Ruby 1.8 is to convert a
You probably know that So what went wrong above? Well, the Again, that used the default None mode, because we didn't tell it to do otherwise. However, if we throw the regex engine into UTF-8 mode, we will get actual characters:
Notice how the two bytes needed for the I chose UTF-8 mode by adding the Using this one simple trick, we can fix some of the unsafe
but we can now count characters, if desired:
We can also fix the dangerous
This time we use the regex engine to divide the Really study these examples above until you understand what's going on here. This is all the support Ruby 1.8 provides for working with characters, so you need to understand how to use it. Here's one last set of examples showing the other regex change I mentioned:
In the default None mode, Ruby 1.8 doesn't provide a whole lot of additional encoding support outside the regex engine. There is one magic variable and some helpful standard libraries we will discuss in future posts, but the main part of Ruby 1.8's character encoding support is just this. One other small feature that may be worth a quick mention is that you can get Unicode code points using
The I don't find myself needing to work with character points often, but you can use this for one interesting cheat. The Unicode code points are a superset of the byte values used in Latin-1, so you can actually convert between the two encodings using just
However, I'll show you a superior way to handle encoding conversions in a future post. It's important to remember that this is not full character encoding support. For example, there is a long list of rules about how to correctly convert some Unicode characters to upper case, but |
|
|
|