Unicode is not a silver bullet
Recently many applications have used Unicode(UTF-8) as server/client character sets, not latin1.
This is a good trend. Unicode is much better than latin1/ISO-8859-1 from internationalization(i18n) perspective. Multi-byte characters (i.e. Chinese/Japanese/Korean characters: called CJK) are corrupted on latin1. CJK characters are handled correctly on Unicode (in most cases).
But Unicode doesn’t solve all provlems. Good enough in many cases, but not the best especially from data size and performance perspective.
The biggest problem is that UTF-8 uses 3 bytes per single Japanese/Korean/Chinese character. UTF-16 uses 2 bytes, but it also uses 2 bytes for ASCII characters.
On the other hand, Japanese local encodings (Shift_JIS/EUC_JP) use one byte for ASCII characters, 2 bytes for Japanese characters.
If your text data has x ASCII characters and y Japanese characters, you can calculate total bytes as follows:
- UTF-8: x + 3y
- UTF-16: 2x + 2y
- CP932(Shift_JIS, local encoding): x + 2y
- EUCJP-MS (EUC-JP, local encoding): x + 2y
Not only multi-byte characters, but also A-Z, a-z, 0-9, [, ], (, ), ?, !, white space etc are frequently used in Japan. So almost all cases UTF-8 and UTF-16 requires many more spaces than local encodings.
Note that there are some exceptions (i.e. 4 bytes special kanji for UTF-8/UTF-16, 1-byte Half-width katakana for Shift_JIS, 3-bytes special kanji for EUC-JP), but they can be negligible for roughly calculating space requirements.
To verify the above elementary school formula, I calculated the size of Wikipedia Japan’s contents, whose data file is published.
$ time wget http://download.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 real 26m36.267s user 0m1.130s sys 0m9.953s
$ time bunzip2 -c jawiki-latest-pages-articles.xml.bz2 | /usr/local/bin/xml2sql -v real 8m6.294s user 7m56.311s sys 0m21.586s
$ ls -l *.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 85057296 Oct 2 21:58 page.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 100036735 Oct 2 21:58 revision.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 2869843281 Oct 2 21:58 text.txt
text.txt has actual data of Wikipedia Japan (This can be loaded into MySQL by LOAD DATA INFILE). Encoding is UTF-8. To see size differences by encodings, iconv is useful.
$ mv text.txt text-utf8.txt
$ time iconv -f UTF-8 -t UTF-16 text-utf8.txt > text-utf16.txt real 1m28.138s user 1m10.088s sys 0m14.627s $ time iconv -f UTF-8 -t CP932 -c text-utf8.txt > text-cp932.txt real 1m53.932s user 1m41.135s sys 0m11.646s
“iconv -c” is skipping conversion errors. Since Unicode is a superset of local encodings, there are some words (i.e. Chinese) which can not be converted to Japanese encodings. The number of these words are small enough so can be negligible to see size differences.
$ time iconv -f UTF-8 -t EUCJP-MS -c text-utf8.txt > text-eucjpms.txt real 1m41.212s user 1m27.993s sys 0m10.791s
$ ls -l *.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 2141663168 Oct 2 22:09 text-cp932.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 2144718476 Oct 2 22:07 text-eucjpms.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 2869843281 Oct 2 21:58 text-utf8.txt -rw-r--r-- 1 ymatsunobu ymatsunobu 2848619232 Oct 2 22:03 text-utf16.txt
As you see, UTF-8 uses 34% more spaces than CP932(Japanese local encoding) in Wikipedia Japan. Although it is often said that UTF-16 is good for handling East Asian characters (because uses only 2 bytes per multi-byte character), UTF-16 uses 33% more spaces than CP932. This is obviously because ASCII characters are used many times in this text file. The difference between EUCJP-MS and CP932 is only 0.14% (caused by half-width katakana and vendor specific kanji).
The size differences highly depend on how many ASCII or multi-byte characters are used, as described above formula. For ascii-mostly-values, UTF-8 fits well.
For multibyte-mostly-values, UTF-16 fits well. I felt Wikipedia’s text file contains more ASCII characters([, ], etc) than other typical Japanese sites’ text data, but ASCII characters are frequently used anyways. So neither UTF-8 nor UTF-16 fits well in many cases.
This is one of the biggest reasons why local encodings are still widely used in Japan. Not many people want to buy additional disks/servers to implement same functionality.
Another approach to reduce data size is compression.
$ time gzip --fast text-utf8.txt real 1m52.792s user 1m41.835s sys 0m7.268s $ time gzip --fast text-utf16.txt real 1m56.880s user 1m42.211s sys 0m7.961s $ time gzip --fast text-cp932.txt real 1m39.633s user 1m32.007s sys 0m6.433s $ time gzip --fast text-eucjpms.txt real 1m29.093s user 1m21.081s sys 0m6.078s
$ ls -l *.gz -rw-r--r-- 1 ymatsunobu ymatsunobu 1000234325 Oct 2 22:09 text-cp932.txt.gz -rw-r--r-- 1 ymatsunobu ymatsunobu 997667856 Oct 2 22:07 text-eucjpms.txt.gz -rw-r--r-- 1 ymatsunobu ymatsunobu 1121780050 Oct 2 22:03 text-utf16.txt.gz -rw-r--r-- 1 ymatsunobu ymatsunobu 1187938525 Oct 2 21:58 text-utf8.txt.gz
In this case, compressed UTF-8 is 19% larger than compressed CP932, but 45% smaller than uncompressed CP932. This would be acceptable in many cases though local encodings are still better.

October 3rd, 2008 at 12:59 am
Quoting: “Since Unicode is a superset of local encodings, there are some words (i.e. Chinese) which can not be converted to Japanese encodings.” and that’s the problem. If you never have to mix text, local encodings are great. But in general, usually you do.
Also, my disk is almost full of pictures and audio and video, the text is in the rounding error
October 3rd, 2008 at 1:31 am
Tim,
I understand. Requirements highly depend on countries. In Japan, many applications are fine if both Japanese and English characters are correctly stored and retrieved.
In general, pictures, audio and video should not be stored on database servers, but on file servers because database servers are not easier to scale out. Then size difference between encodings is relatively going bigger.