The $KCODE Variable and jcode Library

This is the fifth post in my series on Character Encodings. Please see the table of contents for the series if you have not yet read the previous posts.

All of the Ruby files I create start with the same Shebang line:

#!/usr/bin/env ruby -wKU

It's not really needed for every file since it generally only matters if the file is executed. However, I tend to go ahead and add it to all Ruby files I build for several reasons:

  • You never know when a file may be executed (if __FILE__ == $PROGRAM_NAME; end sections are often added to libraries, for example)
  • It makes it obvious the file is Ruby code
  • It shows the rules this code expects -w and -KU

The rules I mention here, specified by command-line switches, are the main point of interest. -w turns on Ruby's warnings which are very handy. I recommend doing that whenever you can. But that doesn't have anything to do with character encodings. -KU does.

-KU sets a magic Ruby variable: $-K or $KCODE. You can do the same in your code if you aren't in a position to control the command-line arguments:

$KCODE = "U"

You probably recognize the U as a name for Ruby 1.8's UTF-8 encoding, from my earlier list of encodings. It can also be set to N (the default), E, or S. Modern versions of Rails do set $KCODE = "U" for you.

So what does changing this magic variable do? First, it has the tiny effect of changing what Ruby escapes in inpect() output. Have a look:

$ ruby -e 'p "Résumé"'
"R\303\251sum\303\251"
$ ruby -KUe 'p "Résumé"'
"Résumé"

It's nice to be able to see your data as it actually is, assuming your terminal correctly handles UTF-8. However, that's really just a side-effect of setting $KCODE.

The main purpose of $KCODE is that it changes the default encoding of all regular expressions that do not specify otherwise. Thus we can split up UTF-8 data by characters without adding a /u to the end of our expression:

$ ruby -e 'p "Résumé".scan(/./m)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]
$ ruby -KUe 'p "Résumé".scan(/./m)'
["R", "é", "s", "u", "m", "é"]
$ ruby -KUe 'p "Résumé".scan(/./mn)'
["R", "\303", "\251", "s", "u", "m", "\303", "\251"]

Notice that the default encoding for that second example was switched to UTF-8. However, I can still override this with an explicit encoding, as I did in example three by adding the /n option for None.

Now, I tend to prefer $KCODE over $-K because the former seems more common in Ruby literature. In fact, Ruby 1.8 uses the term in another place, providing a method to get the encoding used in a Regexp:

$ ruby -e 'p /./.kcode'
nil
$ ruby -e 'p /./u.kcode'
"utf8"

Beware of that harmless looking kcode() method though as it hides quite a few gotchas. First, you can see that it has its own names for the options that don't really match up with what we've seen elsewhere. It also doesn't seem to be aware of the $KCODE variable, in an ironic twist of naming:

$ ruby -e '$KCODE = "U"; re = /./m; p "Résumé".scan(re); p re.kcode'
["R", "é", "s", "u", "m", "é"]
nil

As you can see, the encoding of the expression was clearly set correctly, but kcode() didn't report the change. If you really want to know the encoding of a Regexp in Ruby 1.8, I suggest using code like the following:

class Regexp
  def encoding
    if kcode
      kcode[0, 1]
    elsif %w[n N u U e E s S].include? $KCODE
      $KCODE.downcase
    else
      "n"
    end
  end
end

Using just the first letter of kcode() should get us back to a standard set of letters. If kcode() isn't set, we can use $KCODE. However, do note that I make sure it's set to an expected value. You can set $KCODE to any junk value and Ruby will just silently ignore it (defaulting back to N), so it's good to reality check the contents when you rely on it. Finally, we just return the default if neither appear to be set.

That's really all there is to know about $KCODE, but Ruby 1.8 ships with a simple standard library called jcode that combines well with everything we've been discussing in these last two posts.

To use the jcode library, set $KCODE and then require the library. Setting $KCODE first is important, and you will receive a warning if you require jcode without setting it (as long as you took my advice and turned them on):

$ ruby -r jcode -e 'p "Résumé".jsize'
8
$ ruby -w -r jcode -e 'p "Résumé".jsize'
Warning: $KCODE is NONE.
8

See, I told you -w was important.

As long as you do have $KCODE set properly, jcode adds a bunch of methods to String that work in characters. These methods are just simple wrappers over the techniques I showed you in my last post, so you get methods like jsize() which returns a count of characters instead of bytes:

$ ruby -KU -r jcode -e 'p "Résumé".jsize'
6

Probably the most useful method jcode adds is each_char():

$ ruby -KU -r jcode -e '"Résumé".each_char { |c| p c }'
"R"
"é"
"s"
"u"
"m"
"é"

See the documentation for the full method list.

Tim Morgan added about 5 hours later:

This is the best post yet. I was afraid the only way to work with Unicode strings properly in 1.8 was with Regexps. I'll be taking a peek at jcode.

BTW, thanks for this series of posts. I'm not sure there is anything this comprehensive anywhere else. If there is, I haven't found it.

James Edward Gray II added about 17 hours later:

jcode is far from comprehensive, but it can save you a few trips to regular expression for some simple cases, yes. For real character savvy manipulations, see Ruby 1.9.

Add Your Thoughts

You can use Markdown in the body of your comment to format text and make links.

Note that I reserve the right to edit any content you post here. I typically exercise this right to fix formatting issues. All posts must be approved so spam will never be seen on these pages.

Author:
URL or Email (optional):
Body: