MySQL

The world's most popular open source database

Contact a MySQL Representative


  • MySQL.com
  • Developer Zone
  • Partners & Solutions
  • Customer Login
  • DevZone
  • Downloads
  • Documentation
  • Articles
  • Forums
  • Bugs
  • Forge
  • Blogs
 
  • Pages

    • Profile
  • Archives

    • October 2008
  • Categories

    • encoding (1)



Yoshinori Matsunobu’s blog

Unicode is not a silver bullet

Recently many applications have used Unicode(UTF-8) as server/client character sets, not latin1.

This is a good trend. Unicode is much better than latin1/ISO-8859-1 from internationalization(i18n) perspective. Multi-byte characters (i.e. Chinese/Japanese/Korean characters: called CJK) are corrupted on latin1. CJK characters are handled correctly on Unicode (in most cases).

But Unicode doesn’t solve all provlems. Good enough in many cases, but not the best especially from data size and performance perspective.

The biggest problem is that UTF-8 uses 3 bytes per single Japanese/Korean/Chinese character.  UTF-16 uses 2 bytes, but it also uses 2 bytes for ASCII characters.
On the other hand, Japanese local encodings (Shift_JIS/EUC_JP) use one byte for ASCII characters, 2 bytes for Japanese characters.

If your text data has x ASCII characters and y Japanese characters, you can calculate total bytes as follows:

  • UTF-8: x + 3y
  • UTF-16: 2x + 2y
  • CP932(Shift_JIS, local encoding): x + 2y
  • EUCJP-MS (EUC-JP, local encoding): x + 2y

Not only multi-byte characters, but also A-Z, a-z, 0-9, [, ], (, ), ?, !, white space etc are frequently used in Japan.  So almost all cases UTF-8 and UTF-16 requires many more spaces than local encodings.

Note that there are some exceptions (i.e. 4 bytes special kanji for UTF-8/UTF-16, 1-byte Half-width katakana for Shift_JIS, 3-bytes special kanji for EUC-JP), but they can be negligible for roughly calculating space requirements.

To verify the above elementary school formula, I calculated the size of Wikipedia Japan’s contents, whose data file is published.

$ time wget http://download.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
real    26m36.267s
user    0m1.130s
sys     0m9.953s
$ time bunzip2 -c jawiki-latest-pages-articles.xml.bz2 | /usr/local/bin/xml2sql -v
real    8m6.294s
user    7m56.311s
sys     0m21.586s
$ ls -l *.txt
-rw-r--r--  1 ymatsunobu ymatsunobu   85057296 Oct  2 21:58 page.txt
-rw-r--r--  1 ymatsunobu ymatsunobu  100036735 Oct  2 21:58 revision.txt
-rw-r--r--  1 ymatsunobu ymatsunobu 2869843281 Oct  2 21:58 text.txt

text.txt has actual data of Wikipedia Japan (This can be loaded into MySQL by LOAD DATA INFILE). Encoding is UTF-8.  To see size differences by encodings, iconv is useful.

$ mv text.txt text-utf8.txt
$ time iconv -f UTF-8 -t UTF-16 text-utf8.txt > text-utf16.txt
real    1m28.138s
user    1m10.088s
sys     0m14.627s

$ time iconv -f UTF-8 -t CP932 -c text-utf8.txt > text-cp932.txt
real    1m53.932s
user    1m41.135s
sys     0m11.646s

“iconv -c” is skipping conversion errors. Since Unicode is a superset of local encodings, there are some words (i.e. Chinese) which can not be converted to Japanese encodings. The number of these words are small enough so can be negligible to see size differences.

$ time iconv -f UTF-8 -t EUCJP-MS -c text-utf8.txt > text-eucjpms.txt
real    1m41.212s
user    1m27.993s
sys     0m10.791s
$ ls -l *.txt
-rw-r--r--  1 ymatsunobu ymatsunobu 2141663168 Oct  2 22:09 text-cp932.txt
-rw-r--r--  1 ymatsunobu ymatsunobu 2144718476 Oct  2 22:07 text-eucjpms.txt
-rw-r--r--  1 ymatsunobu ymatsunobu 2869843281 Oct  2 21:58 text-utf8.txt
-rw-r--r--  1 ymatsunobu ymatsunobu 2848619232 Oct  2 22:03 text-utf16.txt

As you see, UTF-8 uses 34% more spaces than CP932(Japanese local encoding) in Wikipedia Japan. Although it is often said that UTF-16 is good for handling East Asian characters (because uses only 2 bytes per multi-byte character), UTF-16 uses 33% more spaces than CP932.  This is obviously because ASCII characters are used many times in this text file.  The difference between EUCJP-MS and CP932 is only 0.14% (caused by half-width katakana and vendor specific kanji).

The size differences highly depend on how many ASCII or multi-byte characters are used, as described above formula. For ascii-mostly-values, UTF-8 fits well.
For multibyte-mostly-values, UTF-16 fits well.  I felt Wikipedia’s text file contains more ASCII characters([, ], etc) than other typical Japanese sites’ text data, but ASCII characters are frequently used anyways. So neither UTF-8 nor UTF-16 fits well in many cases.

This is one of the biggest reasons why local encodings are still widely used in Japan. Not many people want to buy additional disks/servers to implement same functionality.

Another approach to reduce data size is compression.

$ time gzip --fast text-utf8.txt

real    1m52.792s
user    1m41.835s
sys     0m7.268s

$ time gzip --fast text-utf16.txt

real    1m56.880s
user    1m42.211s
sys     0m7.961s

$ time gzip --fast text-cp932.txt
real    1m39.633s
user    1m32.007s
sys     0m6.433s

$ time gzip --fast text-eucjpms.txt
real    1m29.093s
user    1m21.081s
sys     0m6.078s
$ ls -l *.gz
-rw-r--r--  1 ymatsunobu ymatsunobu 1000234325 Oct  2 22:09 text-cp932.txt.gz
-rw-r--r--  1 ymatsunobu ymatsunobu  997667856 Oct  2 22:07 text-eucjpms.txt.gz
-rw-r--r--  1 ymatsunobu ymatsunobu 1121780050 Oct  2 22:03 text-utf16.txt.gz
-rw-r--r--  1 ymatsunobu ymatsunobu 1187938525 Oct  2 21:58 text-utf8.txt.gz

In this case, compressed UTF-8 is 19% larger than compressed CP932, but 45% smaller than uncompressed CP932. This would be acceptable in many cases though local encodings are still better.

This entry was posted on Thursday, October 2nd, 2008 at 5:33 pm and is filed under encoding. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “Unicode is not a silver bullet”

  1. Tim Bray Says:
    October 3rd, 2008 at 12:59 am

    Quoting: “Since Unicode is a superset of local encodings, there are some words (i.e. Chinese) which can not be converted to Japanese encodings.” and that’s the problem. If you never have to mix text, local encodings are great. But in general, usually you do.

    Also, my disk is almost full of pictures and audio and video, the text is in the rounding error :)

  2. yoshinori Says:
    October 3rd, 2008 at 1:31 am

    Tim,
    I understand. Requirements highly depend on countries. In Japan, many applications are fine if both Japanese and English characters are correctly stored and retrieved.

    In general, pictures, audio and video should not be stored on database servers, but on file servers because database servers are not easier to scale out. Then size difference between encodings is relatively going bigger.

Leave a Reply

Yoshinori Matsunobu’s blog is proudly powered by WordPress MU running on Blogs.mysql.com.
Entries (RSS) and Comments (RSS).