MySQL

The world's most popular open source database

Contact a MySQL Representative


  • MySQL.com
  • Developer Zone
  • Partners & Solutions
  • Customer Login
  • DevZone
  • Downloads
  • Documentation
  • Articles
  • Forums
  • Bugs
  • Forge
  • Blogs
 
  • You are currently browsing the New Features In MySQL 6.x weblog archives for April, 2008.

  • Pages

    • About
    • Find and store the error return value in procedures or functions
  • Archives

    • August 2009
    • July 2009
    • June 2009
    • May 2009
    • April 2009
    • March 2009
    • February 2009
    • January 2009
    • December 2008
    • November 2008
    • October 2008
    • August 2008
    • July 2008
    • June 2008
    • May 2008
    • April 2008
  • Categories

    • MySQL 5.1 Features (3)
    • MySQL 5.4 New Features (2)
    • MySQL 6.0 New Features (5)
    • MySQL 6.x New Features (5)
    • News (8)
    • Personal Opinion (4)
    • Tiny Tweaks (10)
    • Uncategorized (19)



New Features In MySQL 6.x

Archive for April, 2008

Starting new blog on MySQL 6.x

Thursday, April 24th, 2008

MySQL Version 6.x is alpha. This blog is about its new features.

Expect a new article every week or two.

The first article,
MySQL 6.0 Feature #1: Supplementary Characters
is out now.
http://blogs.mysql.com/peterg/2008/04/

Posted in News | No Comments »

MySQL 6.0 Feature #1: Supplementary Characters

Sunday, April 20th, 2008

Supplementary Characters
MySQL 6.0 New Features Document#1
2008-04-20

MySQL 6.0.5 is about to appear on the “MySQL 6.0 Downloads” page http://dev.mysql.com/downloads/mysql/6.0.html

This is MySQL’s alpha version, not to be confused with MySQL 5.1, which has been ‘release candidate’ for a while. The big 6.0 features are “online backup”, “Falcon”, and “supplementary characters”. This document is about the new supplementary characters feature.

What

Since MySQL 4.1, MySQL has supported two Unicode character sets, ucs2 and utf8. Luckily we don’t have to explain all about “Unicode” because there’s good definition and explanation. So if you need help with Unicode in general, read here:
http://www.unicode.org (Unicode reference)
or here:
http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html (MySQL manual)
We can concentrate in this document on just those aspects that affect the MySQL 6.0 implementation.

Specifically MySQL has supported in the past the characters of Unicode Version 3.0, in the Basic Multilingual Plane (BMP). The BMP characters have these characteristics:
* their code values are between 0 and 65535 (or U+0000 … U+FFFF)
* they can be encoded with a fixed 16-bit word, as in ucs2
* they can be encoded with variable 8, 16, or 24 bits, as in utf8
* they’re good enough for almost all characters in major languages.

So almost all MySQL users are almost always happy, and that’s no longer good enough. In the latest Unicode version 5.0, there are 99089 defined characters. They can’t all fit in the BMP. Characters that aren’t in the BMP are called “Supplementary Characters”. Most of the new characters are rare ideographs in Chinese or Japanese, some others are for languages that weren’t in Unicode before.

To support supplementary characters, MySQL has to introduce new character sets which have a broader range than, and therefore can take more space than, the old ucs2 and utf8.

The task description for supplementary characters went up on the MySQL Forge site in January 2007, with the title
WL#1213: Implement 4-byte UTF8, UTF16 and UTF32
http://forge.mysql.com/worklog/task.php?id=1213
(MySQL Forge is where MySQL occasionally publishes specifications for public review. It shows some but not all of the “Work Log” tasks that MySQL may or may not be introducing in coming versions.) The “Architecture Review” stage was passed in July. The coding is now complete except for fixing the inevitable bugs.

Comparing old and new versions, see these major differences:

MySQL 5.1                                MySQL 6.0

---------                                ---------

all Unicode 3.0 characters               all Unicode 5.0 characters

no supplementary characters              with supplementary characters

ucs2 character set                       no change

utf8 character set for up to 3 bytes     utf8 character set for up to 4 bytes

                                         utf16 character set

                                         utf32 character set

                                         utf8mb3 character set for up to 3 bytes

The changes are upward compatible, except that the maximum length of a utf8 column or index is, in terms of characters, smaller.

Why

MySQL always pays attention to Unicode improvements, so it was inevitable that this change would occur eventually. The reason that it has to happen now is: Microsoft Windows Vista. Vista supports supplementary characters for display, and the Windows API has the necessary functionality for manipulating them.

In particular, Vista allows all the characters defined by JIS X 0213:2004 (”Japanese Industrial Standard 0213 as revised in 2004″), which is often referred to as “JIS:2004″. The new characters of JIS:2004, although rare, might appear in Japanese texts created in the future. Some of the JIS:2004 characters are in the BMP, but many are supplementary. For a quick introduction to JIS:2004 considerations, see
http://en.wikipedia.org/wiki/JIS_X_0213
paying especial attention to the link about “Mapping Tables between JIS X 0213 and Unicode”
http://x0213.org/codetable/index.en.html
Similar considerations apply for the Chinese Industrial Standard, gb18030, which supersedes gb2312 and gbk. See
http://en.wikipedia.org/wiki/GB18030
It’s important to emphasize here that the MySQL 6.0 “Supplementary Characters” feature is adding support for more Japanese and Chinese characters, but not adding support for new Japanese and Chinese character sets. The new characters will only exist in MySQL’s Unicode character sets. Getting into Unicode is an essential first step.

Particularly, some users objected to this behaviour:
User defines a MySQL table with a UTF8 column.
User enters a string containing two characters: an invalid supplementary character, and a valid BMP character.
MySQL does not recognize the supplementary-character code.
But MySQL doesn’t know where the ‘invalid’ character ends. (UTF8 characters are variable length.)
So MySQL skips the rest of the string too.
The second character, the valid BMP character, is lost.
See Bug#9337 and Bug#12600 in MySQL’s bugs database, bugs.mysql.com. Users had to avoid trying to insert supplementary characters, or redefine the column as a BLOB.

Apart from East Asian language considerations, MySQL’s new feature will also interest users of languages or writing systems which were not in Unicode 3.0, such as: Gothic, Tagalog, Glagolitic, Runic, Balinese, and Cuneiform. See also
http://babelstone.blogspot.com/2005/11/whats-new-in-unicode-50.html
Sorry, folks, no Egyptian hieroglyphs until Unicode 5.2 arrives.

How Much

There is a Unicode page which shows when characters first got Unicode values
http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
By counting the items in “Newly assigned” lists on this page, we see that there are 2816 BMP characters which are “new” since version 3.0 and 47046 supplementary characters which are “new” since version 3.0. (Prior to Unicode 3.0, there were no supplementary characters.)

                      BMP        Supplementary Total

                      ---------  ------------- -----

In 3.1: March 2001       2               44976 44978

In 3.2: March 2002    1014                   2  1016

In 4.0: April 2003     449                 777  1226

In 4.1: March 2005    1011                 262  1273

In 5.0: July 2006      340                1029  1369

                      ----               ----- -----

                      2816               47046 49862

Most of the “new BMP” characters are already okay in MySQL 5.1, and all of them will be okay in all Unicode character sets in MySQL 6.0.

None of the “new supplementary” characters are okay in MySQL 5.1, but all of them will be okay in most of the Unicode character sets in MySQL 6.0.

Relating the Unicode Standard to the SQL Standard

MySQL’s goal is to “follow” the 2003 version of the SQL standard (SQL:2003), like the MySQL Reference Manual says. But the next version of the SQL standard, tentatively known as SQL:2008, is already circulating in draft form and is unlikely to change much between now and its introduction, which is imminent.

SQL:2008 refers many times to the Unicode standard, and depends on it for definitions, repertoires, and names. It does not name the Unicode version that vendors will have to support, but does refer to the “Latest Unicode Version”, which means it refers to Unicode 5.0
http://www.unicode.org/versions/enumeratedversions.html

A subtle point is that Unicode has an “encoding” named “UTF-8″, while SQL:2008 has a “character set” (as well as an encoding) named “UTF8″. MySQL of course follows the SQL standard for definitions and names. In documentation, we try to remember to put the “-” in when we specifically mean Unicode, and not Unicode as seen through SQL lenses.

The ucs2 character set is not in SQL:2008.

The utf16 and utf32 and utf8 character sets are in SQL:2008, and correspond of course to Unicode encodings UTF-16, UTF-32, and UTF-8.

Old Character set ucs2

The ucs2 character set does not change in essence.

There are some Unicode-5.0 characters which are inside the BMP, which MySQL will support with ucs2. But ucs2 remains what it always has been: a fixed 16-bit encoding for Unicode BMP characters.

New Character set utf16

The utf16 character set is the ucs2 character set, with an extension that’s necessary for supplementary characters.

For a BMP character, utf16 and ucs2 have identical storage characteristics — same code values, same encoding, same length.

For a supplementary character, utf16 has a special sequence, reminiscent of an ‘escape sequence’, for representing the character using 32 bits. This is called the “surrogate” mechanism. Briefly it works like this:
for a number greater than 0xffff, take 10 bits and add them to 0xd800 and put them in the first 16-bit word, take 10 more bits and add them to 0xdc00 and put them in the next 16-bit word.
So all supplementary characters require 32 bits in all, and the first 16 bits are a number between 0xd800 and 0xdbff, and the last 16 bits are a number between 0xdc00 and 0xdfff. Examples are in “15.5 Surrogates Area” in the Unicode 4.0 document
http://unicode.org/versions/Unicode4.0.0/ch15.pdf

For example, suppose you have the three supplementary characters illustrated in Figure 1.


Supplementary

As the figure shows, 100CC becomes D800 DCCC, and so on. You can confirm this by doing the calculation as we have described, or by using a handy online converter which lets you input code points and get UTF16 or UTF8 values:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=ff9d&mode=hex

Because utf16 supports surrogates and ucs2 doesn’t, there is a validity check which applies only in utf16: you can’t insert a top surrogate without a bottom surrogate, or vice versa. For example:

INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */

There is no validity check for characters which are technically valid but are not true Unicode (that is, Unicode considers them to be ‘unassigned code points’ or ‘private use’ characters or even ‘illegals’ like 0xffff). For example, since U+F8FF is the Apple Logo:

INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */

Don’t expect such characters to mean the same thing to everyone.

Since MySQL must allow for the worst case — that 1 character requires 4 bytes — the maximum length of a utf16 column or index is only one half of the maximum length for a ucs2 column or index. For example, since at the time of writing the maximum length of a Falcon index key is 1100 bytes, this is legal:

CREATE TABLE tf (s1 VARCHAR(550) CHARACTER SET ucs2) engine=falcon;

CREATE INDEX i ON tf (s1);

CREATE TABLE tg (s1 VARCHAR(275) CHARACTER SET utf16) engine=falcon;

CREATE INDEX i ON tg (s1);

You cannot use utf16 as a client character set.

Revised character set utf8

We changed the characteristics of the utf8 character set, from (maximum 3 bytes per character) to (maximum 4 bytes per character). We had reasons:
* utf8 is the ‘SQL:2003 standard’ name for the full-repertoiree character set
* because the new is ‘upward compatible’ with the old, with respect to repertoire, switching to MySQL 6.0 will involve no conversion of existing data or definitions, unless columns are very long.

Let’s not minimize that subordinate clause “unless columns are very long”. The catch, as with utf16 versus ucs2, is that the maximum length of a column or index key is unchanged in terms of bytes. Therefore it is smaller in terms of characters, because the maximum length of a character is 4 bytes instead of 3. The three things to watch for are:
(1) look at all application programs, and make sure that buffer lengths aren’t calculated with an algorithm like “number of characters times 3″
(2) look at all indexes on utf8 columns, and make sure they won’t exceed the maximum length for the storage engine. Sometimes the maximum can change due to storage engine enhancements.
(3) look at the definitions of utf8 columns, and make sure they won’t exceed the maximum length for the storage engine.
(4) do the same for TEXT as well as CHAR or VARCHAR.
The MySQL Reference Manual will call this an INCOMPATIBLE CHANGE.

If the above situations apply for you, then you will have to reduce the defined length, or you will have to change the column definition from “CHARACTER SET utf8″ to “CHARACTER SET utf8mb3″ (see utf8mb3 description in next section). Other than that, we currently expect that the transition from 5.1 to 6.0 will be painless.

For a BMP character, utf8 (5.1) and utf8 (6.0) have identical storage characteristics — same code values, same encoding, same length.

For a supplementary character, utf8 (5.1) can’t store the character at all, while utf8 (6.0) requires 4 bytes to store. Since utf8 (5.1) can’t store the character at all, you don’t have any supplementary characters in UTF8 columns in your 5.1 database, so you don’t have to worry about converting characters or losing data.

Renamed character set utf8mb3

For the few cases where it’s desirable to have complete compatibility with the old utf8 character set, you can define with “CHARACTER SET utf8mb3″, that is, “utf8 with multibyte 3″, with the maximum number of bytes = 3. The utf8mb3 character set has these characteristics:
* no support for supplementary characters
* maximum 3 bytes per character
In other words, version-6.0 utf8mb3 has exactly the same characteristics as version-5.1 utf8. We expect that utf8mb3 will only be needed when updating from MySQL version 5.1, as described in the previous section. In a few years we’ll probably deprecate utf8mb3.

We know that there will be some people who say: “MySQL should allow utf8mb3 as an alias for utf8 in MySQL 5.1, to make replication easier.” We won’t accommodate those people; sorry.

Exactly the same set of characters is available in utf8mb3 as in ucs2, that is, they have the same repertoire.

New character set utf32

The utf32 character set is fixed length (like ucs2), but allows 32 bits for every character (unlike ucs2 which uses 16 bits for every character).

Probably most people will avoid utf32 for space reasons. But it does have two advantages: it’s predictable for storage (number of bytes = number of characters times 4), and there are no tricks in encoding so the stored value = the code value.

To show how the latter advantage is useful, here’s a trick that shows how a user can figure out the utf8 value given the code value.

/* Assume code value = 100cc LINEAR B CHARIOT WHEEL */

CREATE TABLE tmp (utf32 CHAR(1) CHARACTER SET utf32,utf8 CHAR(1) CHARACTER SET utf8);

INSERT INTO tmp VALUES (0x000100cc,NULL);

UPDATE tmp SET utf8 = utf32;

SELECT HEX(utf32),HEX(utf8) FROM tmp;

Storage is big-endian, that is, the highest byte comes first and there is no need for a mark to specify the byte ordering.

MySQL is very forgiving about additions of unassigned Unicode characters, private-use-area characters, and other code values that you won’t find in the official Unicode 5.0 Character Database (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt). There is in fact only one validity check for utf32: no code value may be greater than 0×10ffff. For example this is illegal:

INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */

You cannot use utf32 as a client character set.

It should be especially useful in prepared statements.

New collations, same as the old collations

With all these changes to character sets, it would be disruptive to change all the collations too. We made the absolute minimum of changes. To wit:

The new character sets all have collations with the same naming conventions as the old ones. The complete list of Unicode collations is:

ucs2_bin           utf16_bin           utf8_bin           utf8mb3_bin_ci        utf32_bin

ucs2_czech_ci      utf16_czech_ci      utf8_czech_ci      utf8mb3_czech_ci      utf32_czech_ci

ucs2_danish_ci     utf16_danish_ci     utf8_danish_ci     utf8mb3_danish_ci     utf32_danish_ci

ucs2_esperanto_ci  utf16_esperanto_ci  utf8_esperanto_ci  utf8mb3_esperanto_ci  utf32_esperanto_ci

ucs2_estonian_ci   utf16_estonian_ci   ut8_estonian_ci    utf8mb3_estonian_ci   utf32_estonian_ci

ucs2_general_ci    utf16_general_ci    utf8_general_ci    utf8mb3_general_ci    utf32_general_ci

ucs2_hungarian_ci  utf16_hungarian_ci  utf8_hungarian_ci  utf8mb3_hungarian_ci  utf32_hungarian_ci

ucs2_icelandic_ci  utf16_icelandic_ci  utf8_icelandic_ci  utf8mb3_icelandic_ci  utf32_icelandic_ci

ucs2_latvian_ci    utf16_latvian_ci    utf8_latvian_ci    utf8mb3_latvian_ci    utf32_latvian_ci

ucs2_lithuanian_ci utf16_lithuanian_ci utf8_lithuanian_ci utf8mb3_lithuanian_ci ut32_lithuanian_ci

ucs2_persian_ci    utf16_persian_ci    utf8_persian_ci    utf8mb3_persian_ci    utf32_persian_ci

ucs2_polish_ci     utf16_polish_ci     utf8_polish_ci     utf8mb3_polish_ci     utf32_polish_ci

ucs2_roman_ci      utf16_roman_ci      utf8_roman_ci      utf8mb3_roman_ci      utf32_roman_ci

ucs2_romanian_ci   utf16_romanian_ci   utf8_romanian_ci   utf8mb3_romanian_ci   utf32_romanian_ci

ucs2_slovak_ci     utf16_slovak_ci     utf8_slovak_ci     utf8mb3_slovak_ci     utf32_slovak_ci

ucs2_slovenian_ci  utf16_slovenian_ci  utf8_slovenian_ci  utf8_slovenian_ci     utf32_slovenian_ci

ucs2_spanish_ci    utf16_spanish_ci    utf8_spanish_ci    utf8mb3_spanish_ci    utf32_spanish_ci

ucs2_spanish2_ci   utf16_spanish2_ci   utf8_spanish2_ci   utf8mb3_spanish2_ci   utf32_spanish2_ci

ucs2_swedish_ci    utf16_swedish_ci    utf8_swedish_ci    utf8mb3_swedish_ci    utf32_swedish_ci

ucs2_turkish_ci    utf16_turkish_ci    utf8_turkish_ci    utf8mb3_turkish_ci    utf32_turkish_ci

ucs2_unicode_ci    utf16_unicode_ci    utf8_unicode_ci    utf8mb3_unicode_ci    utf32_unicode_ci

The ‘general’ collations are still default but Unicode collations are recommended.

How MySQL collates new characters

For all collations except the “binary” (_bin) collations, MySQL performs a table lookup to find a character’s collating weight. This weight is visible with the new weight_string() function. (See here for description of weight_string: http://forge.mysql.com/worklog/task.php?id=3716)
But what if a character isn’t in the table, for example because it’s a “new” character? In that case things get more complex.

For BMP characters in general collations, e.g. utf8_general_ci, weight = code point.

For BMP characters in uca collations, e.g. utf8_unicode_ci,
if (code >= 0×3400 && code <= 0×4DB5)
base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0×4E00 && code <= 0×9FA5)
base= 0xFB40; /* CJK Ideograph */
else
base= 0xFBC0; /* All other characters */
aaaa= base + (code >> 15);
bbbb= (code & 0×7FFF) | 0×8000;
For example,

mysql> SELECT HEX(weight_string(_ucs2 0x04CF COLLATE ucs2_unicode_ci));

+-----------------------------------------------------------+

| hex(weight_string(_ucs2 0x04CF collate ucs2_unicode_ci))  |

+-----------------------------------------------------------+

| FBC084CF                                                  |

+-----------------------------------------------------------+

1 row in set (0.00 sec)

Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA is, with all uca collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA. Only later, after MySQL upgrades the collations according to WL#2673, will all palochkas sort together.

For supplementary characters:
the weight is the weight for 0xfffd REPLACEMENT CHARACTER.
That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

An example with Deseret characters and COUNT(DISTINCT):

CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);

INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */

INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */

INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */

SELECT COUNT(DISTINCT s1) FROM t;

The result is: 1, because Deseret Bee = Deseret Tee = Replacement Character, in the MySQL Unicode collation.

An example with cuneiform characters and WEIGHT_STRING():

/*

The four characters in the INSERT string are

00000041 # LATIN CAPITAL LETTER A

0001218F # CUNEIFORM SIGN KAB

000121A7 # CUNEIFORM SIGN KISH

00000042 # LATIN CAPITAL LETTER B

*/

CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);

INSERT INTO t VALUES (0x000000410001218f000121a700000042);

SELECT HEX(WEIGHT_STRING(s1)) FROM t;

/*

The result is:

0E33 FFFD FFFD 0E4A

*/

where 0E33 and 0E4a are primary weights as in UCA 4.0.0
ftp://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
where FFFD is the weight KAB and also the weight for KISH.

The “supplementary character collation” rule is non-wonderful. It was necessary, so that we wouldn’t have to rewrite the collation algorithms at the same time that we were making possibly disruptive changes to the character sets. We don’t expect the rule to cause trouble, because the characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user doesn’t care what order they’re in anyway. However, if you really want to get rows sorted by MySQL’s rule and secondarily by code point value, it’s easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

The utf16_bin collation

There is a difference between “ordering by the character’s code value” and “ordering by the character’s binary representation”, a difference which only appears with utf16_bin, because of surrogates.

Suppose utf16_bin was a binary comparison “byte by byte” rather than “character by character”. If that were so, then the order of characters in utf16_bin (binary collation for utf16) would differ from the order in utf8_bin. For example:

Here is a chart showing two rare characters. The first character is in the range E000-FFFF, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.

Code point  Character                    utf8         utf16

----------  ---------                    ----         -----

0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D

10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84

The two characters in the chart are in order by code point value, because 0xff9d < 0×10384. And they are in order by utf8 value, because 0xef < 0xf0. But they are not in order by utf16 value, if we use byte-by-byte comparison, because 0xff > 0xd8.

So MySQL’s utf16_bin collation is not “byte by byte”. It is “code point by code point”. When MySQL sees a supplementary-character encoding in utf16, it converts to the character’s code-point value, and then compares. Therefore utf8_bin and utf16_bin are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: “UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11 — The Unicode scalar value of a character is its code point treated as an unsigned integer.”

If the character set is ucs2, then comparison is byte by byte, but ucs2 strings shouldn’t contain surrogates anyway.

Comparing Unicode to Unicode

Generally, if you try to compare two character columns and the character sets are different and neither character set is a proper superset of the other one, this error message appears:

"ERROR 1267 (HY000): Illegal mix of collations (...,IMPLICIT) and (...,IMPLICIT) for operation '...'")

But the mix may be legal if one character set is a subset of the other. For example:

CREATE TABLE t1 (

  latin1_column VARCHAR(5) CHARACTER SET latin1,

  utf8_column VARCHAR(5) CHARACTER SET utf8);

...

SELECT * FROM t1 WHERE latin1_column = utf8_column; /* legal */

The superset (utf8 in this case) takes precedence.

Now, it is clear that the set of characters in utf8mb3/ucs2 is a proper subset of the set of characters in utf8/utf16/utf32. So the intent is that comparisons will be legal for all collations except binary collations. However, that part of the implementation has not begun.

Not to worry. Just cast to the character set whose rules you prefer:

CREATE TABLE t1 (

  utf16_column VARCHAR(5) CHARACTER SET utf16,

  utf8_column VARCHAR(5) CHARACTER SET utf8);

...

SELECT *

FROM t1

WHERE CAST(utf16_column AS CHAR(2) CHARACTER SET utf8) = utf8_column;

Identifiers

Currently all identifiers (names of tables or columns or triggers or variables and so on) are UTF8. In the new version they will also be UTF8. If you’ve followed so far, you know that means that, technically, identifiers are in the “UTF8 with maximum 4 bytes” character set.

However, that does not mean that MySQL will allow use of supplementary characters for identifiers. If you try to use such a character in a name, you will get
an error message. For compatibility reasons, MySQL will restrict such use, at least in version 6.0.

For the moment, because the identifier character set is restricted even though it’s officially termed utf8, the maximum length of a name is the same as it used to be.
For example,

SELECT CHARACTER_MAXIMUM_LENGTH

FROM INFORMATION_SCHEMA.COLUMNS

WHERE column_name='TABLE_NAME'

AND table_schema = 'information_schema'

AND table_name='columns';

gives us ‘64′, in both MySQL 5.1 and MySQL 6.0.

That’s another step that we took to make upgrade easier, although we think it is a trifle odd.

Update

You should run mysql_upgrade whenever you change versions. That is still true when you move from MySQL 5.1 to MySQL 6.0. However, you won’t have problems that are due to existing data containing too-long index keys or existing metadata containing too-long identifiers. (You might have to look at any scripts that have long index keys or long identifiers or mention UTF8, as mentioned elsewhere, but that’s not strictly speaking an ‘upgrade’ problem.)

It’s possible to go from MySQL 5.1 to 6.0 and back again to 5.1 if you do not run mysql_upgrade. You might have these problems:
* Your database contains a UTF8 column value with supplementary characters. The supplementary characters, and everything following them in the same column, will be truncated.
* Your database contains utf8mb3, utf32, or utf16 columns. The version-5.1 DBMS will not recognize those character sets and will return an error.
It means that, if you’re using MySQL Replication and you have a 5.1 master for a 6.0 slave, failure can happen.

Update, Names Change but Numbers Stay the Same

You can see the collation number by selecting ID from INFORMATION_SCHEMA.COLLATIONS. For example (with MySQL 5.1):

mysql> select * from information_schema.collations

-> where collation_name = 'utf8_general_ci'\G

*************************** 1. row ***************************

COLLATION_NAME: utf8_general_ci

CHARACTER_SET_NAME: utf8

ID: 33

IS_DEFAULT: Yes

IS_COMPILED: Yes

SORTLEN: 1

1 row in set (0.00 sec)

But in MySQL 6.0, you will see different numbers:

mysql> SELECT * FROM information_schema.collations

-> WHERE collation_name = 'utf8_general_ci'

-> OR collation_name = 'utf8mb3_general_ci'\G

*************************** 1. row ***************************

COLLATION_NAME: utf8mb3_general_ci

CHARACTER_SET_NAME: utf8mb3

ID: 33

IS_DEFAULT: Yes

IS_COMPILED: Yes

SORTLEN: 1

*************************** 2. row ***************************

COLLATION_NAME: utf8_general_ci

CHARACTER_SET_NAME: utf8

ID: 45

IS_DEFAULT: Yes

IS_COMPILED: Yes

SORTLEN: 1

2 rows in set (0.01 sec)

What’s going on? Has the ID changed?
No, the COLLATION_NAME has changed.

Numbering for utf8 collations does not change. So when MySQL sees (for example) ‘collation id = 33′ in internal tables, it considers the column to be utf8mb3. And, since MySQL-5.1 utf8/utf8_general_ci columns were created with id=33, that means MySQL will see any ‘created-by-5.1′ utf8 columns as utf8mb3.

For example:

/* in MySQL 5.1 */

CREATE TABLE t (s1 CHAR(5) CHARACTER SET UTF8);

SHOW CREATE TABLE t;

mysql> SHOW CREATE TABLE t;

+-------+------------------------------------------+

| Table | Create Table |

+-------+------------------------------------------+

| t | CREATE TABLE `t` (

`s1` char(5) CHARACTER SET utf8 DEFAULT NULL

) ENGINE=MyISAM DEFAULT CHARSET=latin1 |

+-------+------------------------------------------+

1 row in set (0.01 sec)/* in MySQL 6.0 */

mysql> SHOW CREATE TABLE t;

+-------+------------------------------------------+

| Table | Create Table |

+-------+------------------------------------------+

| t | CREATE TABLE `t` (

`s1` char(5) CHARACTER SET utf8mb3 DEFAULT NULL

) ENGINE=MyISAM DEFAULT CHARSET=latin1 |

+-------+------------------------------------------+

1 row in set (0.01 sec)

To change any column to utf8, the user will have to say:

ALTER TABLE t MODIFY COLUMN s1 CHAR(5) CHARACTER SET UTF8;

If you don’t ALTER any columns to utf8, or create new ones with utf8, downgrade is still possible.

But there is a downside:
* Outside column definitions, ‘utf8′ remains unchanged.
So there will be _utf8 introducers, utf8 stored-procedure variables, utf8 function returns, utf8 defaults, utf8 targets in “SET NAMES” scripts, utf8 in mysqldump, etc.
* Users won’t be able to get the switch over with in a few key strokes.

So some things may change while 6.0 is still alpha.

Update, conclusion, two concepts

Coming to a conclusion about all I’ve just said re updating: conceptually there are two ways to go from 5.1 to 6.0.

1. If you start 6.0 over an existing 5.1 database, it will work without changes in keys!
However SHOW commands will report “utf8mb3″.
So will INFORMATION_SCHEMA queries.
In short:
* there are no key differences
* there are name differences.

2. If you dump a 5.1 database and then start 6.0 and load the dump using 6.0, character set data will be reported as utf8, as in 5.1.
However it will already mean “new” utf8, and that’s why I said earlier there are no big problems “unless columns are very long”. The “unless columns are very
long” constraint is applicable only in this case.
In short:
* there are key differences
* there are no name differences.

Performance

It is well known that sorting with Unicode collations takes longer than sorting with 8-bit-byte collations. However, our tests so far indicate that:
* sorting with 6.0 is slower than sorting with 5.0 anyway, as one would expect in an early product version
* sorting with utf8_unicode_ci is only about 10% slower than sorting the same values with utf8mb3_unicode_ci.

At this stage, we can only say: test before converting, but, well, you won’t be converting immediately anyway.

Converting between Unicode character sets

Here the word “conversion” includes the explicit functions CONVERT() and CAST(), and the implicit conversion that happens when you assign a source
string with CHARACTER SET x to a target with CHARACTER SET y.

CONVERTING TO UTF16:
MySQL does little validity checking. Users are able to insert 0xd800 (half a surrogate) in ucs2 columns. That is illegal in utf16 columns, where surrogates have meaning. So conversion from ucs2 to utf16 can fail. For example:

CREATE TABLE t1 (s1 VARCHAR(1) CHARACTER SET ucs2);

CREATE TABLE t2 (s1 VARCHAR(1) CHARACTER SET utf16);

NSERT INTO t1 VALUES (0xd800);     /* legal */

INSERT INTO t2 VALUES (0xd800);    /* illegal */

INSERT INTO t2 SELECT * FROM t1;   /* fails */

But valid BMP characters are okay (for valid BMP characters the repertoire of ucs2 and utf16 is the same). And valid surrogate characters are okay. For example:

CREATE TABLE t1 (s1 VARCHAR(1) CHARACTER SET ucs2);

CREATE TABLE t2 (s1 VARCHAR(1) CHARACTER SET utf16);

INSERT INTO t1 VALUES (0xD801DC12);     /* legal */

INSERT INTO t2 VALUES (0xD801DC12);     /* legal */

INSERT INTO t2 SELECT * FROM t1;        /* succeeds */

In general: bad values (values which would make a partial surrogate) are legal in utf8 and utf8mb3 and utf32, but not in utf16. Therefore conversion to utf16 may fail if there are bad values.

CONVERTING TO UCS2 OR UTF8MB3:
Since utf8 and utf16 and utf32 strings may contain supplementary characters, and since ucs2 or utf8mb3 characters may not contain them, conversion to ucs2
or utf8mb3 may fail if there are supplementary values.

Converting to other character sets

You can put supplementary characters in Unicode columns, but not in non-Unicode columns. If you convert to sjis or gb2312 or gbk or etc., the supplementary characters become ‘?’s and you see a warning. For example:

mysql> CREATE TABLE linear_b (s1 VARCHAR(5) CHARACTER SET utf8);

Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO linear_b VALUES (0xf090838c); /* wheeled chariot, again */

Query OK, 1 row affected (0.01 sec)

mysql> ALTER TABLE linear_b MODIFY COLUMN s1 CHAR(5) CHARACTER SET gbk;

Query OK, 1 row affected, 1 warning (0.03 sec)

Records: 1  Duplicates: 0  Warnings: 0

mysql> SHOW WARNINGS;

+---------+------+---------------------------------------------------------------------+

| Level   | Code | Message                                                             |

+---------+------+---------------------------------------------------------------------+

| Warning | 1366 | Incorrect string value: '\xF0\x90\x83\x8C' for column 's1' at row 1 |

+---------+------+---------------------------------------------------------------------+

1 row in set (0.00 sec)

It would be nice to say that we hope to support gb18030 and jis2004 soon, but that depends on lots of things. We’ll say nothing re roadmaps here.

Some Examples

Because your browser might not display supplementary characters correctly, we’ll use hex encoding for these examples and we won’t try to show results.

CREATE TABLE t1 (s1 CHAR(1) CHARACTER SET utf32);

INSERT INTO t1 VALUES ('a');

SELECT OCTET_LENGTH(s1) FROM t1; /* result = 4 */

SELECT _utf8mb3 'x'; /* introducers are still legal */

/* Create table, use all Unicode character sets */

CREATE TABLE t (ucs2 VARCHAR(2) CHARACTER SET ucs2,

                utf8mb3 VARCHAR(2) CHARACTER SET utf8mb3,

                utf8 VARCHAR(2) CHARACTER SET utf8,

                utf16 VARCHAR(2) CHARACTER SET utf16,

                utf32 VARCHAR(2) CHARACTER SET utf32);

/* Insert using a literal with an introducer. */

INSERT INTO t (utf8mb3) VALUES (_utf8mb3 'A');

/* Update with hex value of 'A', specifying lead 0s. */

UPDATE t SET utf32 = 0x00000041;

/* Update with hex value of 'BB', assuming lead 0s. */

UPDATE t SET utf32 = 0x4200000042;

/* Update, expecting no change of encoded values. */

UPDATE t SET utf8 = utf8mb3;

/* Update, putting half-surrogate in ucs2 column. */

UPDATE t SET ucs2 = 0xd801;

/* Update, with invalid character, so this will fail. */

UPDATE t SET utf16 = ucs2;

/* Update, with invalid character, but it won't fail. */

UPDATE t SET utf8 = ucs2;

/* Update, putting full surrogate in utf16 column.

   It stands for code point value 0x20000, which is

   the first "CJK Ideograph Extension B". */

UPDATE t SET utf16 = 0xD840DC00;

/* Select, converting to utf32. Result = 0x00020000. */

SELECT HEX(CAST(utf16 AS CHAR(1) CHARACTER SET utf32)) FROM t;

/* Select, converting to utf8mb3. Result = '?'.

   (For a conversion failure, result is always '?'. */

SELECT CONVERT(utf16 USING utf8mb3) FROM t;

/* Select, if utf16 value is exactly equal to code point

   value 0x20000. Result: 1 row. */

SELECT * FROM t WHERE

 CAST(utf16 AS CHAR(1) CHARACTER SET utf32) COLLATE utf32_bin

 = 0x00020000;

/* Select, if utf16 value is 0xfffd with general collation. */

SELECT * FROM t WHERE utf16 = 0xfffd;

Changes to associated products

Not everything happens at once. Different MySQL products work on different schedules. Some products are from other companies so they’ll naturally take a bit longer. On the other hand, features closely connected with the server are all finished / available starting with this version. Here is a partial list of MySQL products, associated products, storage engines, and dates or version numbers when they will support supplementary characters.

Connector / Java                       NOT READY YET

Connector / NET                        NOT READY YET

Connector / PHP                        NOT READY YET

Documentation / Reference Manual       NOT READY YET

Documentation / Expert Guides          NOT READY YET

Front End: mysql                       OKAY

Front End: Monitor and Advisors        NOT READY YET

Front End: MySQL Workbench             NOT READY YET

Front End: SQLyog                      NOT READY YET

Storage Engine: Falcon                 OKAY

Storage Engine: InnoDB                 OKAY

Storage Engine: Memory                 OKAY

Storage Engine: MyISAM                 OKAY

Storage Engine: Solid                  NOT READY YET

We will update this list as products change. In the meantime …
You risk data loss if you have utf8 storage with engine=myisam and you change to a storage engine which knows nothing of the change, e.g. with

ALTER TABLE t ENGINE=soliddb;

Compatibility with other DBMSs

When migrating to or from DBMS products of other vendors, keep in mind that:

DB2 may support supplementary characters with an encoding known as CESU-8. This, and the very similar “Modified UTF-8″ encodings of Sun and Java, are not true UTF-8 so MySQL will not recognize hex-encoded INSERT statements containing such supplementary characters.

SQL Server 2005, using a Microsoft employee’s words, is “surrogate safe” but not “surrogate aware”. This should mean that it’s possible to transfer Unicode data from MySQL to SQL Server without error, but doesn’t necessarily mean that SQL Server will understand it.

Oracle 10g supports Unicode well. The only thing to watch out for is the naming. These character sets approximately correspond:

MySQL 6.0        Oracle 10g

---------        ----------

UTF8             AL32UTF8 (not "UTF8", Oracle UTF8 is CESU-8!!)

UTF16            AL16UTF16UTF32 [ we think there's no exact equivalent ]

Questions or Comments

Before contacting us about a problem with supplementary characters in MySQL 6.0, check that your operating system and browser and application program handle supplementary characters and associated fonts. If the problem comes from outside MySQL, please contact the appropriate provider.

To comment about this feature, use the MySQL “Character Sets, Collation, Unicode” Forum:
http://forums.mysql.com/list.php?103

To report bugs in a alpha product, or comment on feature requests, use the MySQL Bugs Database:
http://bugs.mysql.com

This article was written before all bugs were fixed, and they may still be unfixed at the time you read this. To find the latest active bugs according to bugs.mysql.com, click Advanced Search and ask for category = Character Sets, sorted by = ID, in descending order = yes.

Posted in MySQL 6.0 New Features | No Comments »

New Features In MySQL 6.x is proudly powered by WordPress MU running on Blogs.mysql.com.
Entries (RSS) and Comments (RSS).