| John Bafford ( @ 2006-11-02 12:48:00 |
| Current location: | Zend Conference, Doubletree, San Jose, CA |
| Entry tags: | php, unicode, zendconference2006 |
ZendCon Session Notes - Unicoding With PHP6
Presented by Andrei Zmievski (Yahoo!)
http://www.gravitonic.com/talks/
Today is Andrei's birthday, so his birthday present is getting to present an 8:30 session. Coincidentally (or not), this is also session number 2-11.
Tower of Babel
Dealing with multiple languages and encodings is a pain, but it can't be avoided.
In the past, PHP has always been a binary processor; the string type is byte-oriented and used for everything from text to images. The core language doesn't know anything about text encodings and multilingual data. And while they're a help, the iconv and mbstring extensions are not completely sufficient.
Andrei spent some time talking about some of the features of Unicode. Unicode by itself doesn't mean internationalization. I18N and L10N (localization) rely on consistent and correct local data. Locale is an identifier (like en_us) that record characteristics like date/time formats, number/currency formats, sorting order, character direction, etc. PHP uses the Unicode Common Locale Data Repository, which contains 360 locales covering 121 languages and 142 territories.
Goals for Unicode in PHP 6
Have a native unicode string type, and a distinct binary string type (that works like PHP's existing string type); update the language semantics to work correctly with unicode strings; maintain backwards compatibility.
PHP 6 uses ICU: International Components for Unicode (provided by IBM), which provides encoding conversions, collation, unicode text processing, and a large number of other features.
Introduced in PHP 6 is a new configuration option, unicode.semantics. No changes to program behavior unless it's enabled; but you can still use Unicode when it's disabled. When it's enabled, PHP converts strings into an internal unicode representation.
With unicode off, 1 character in a string is 1 byte. With unicode on, 1 character may be more than 1 byte: strlen() would return the proper number of characters. To determine the size in bytes of a unicode string, you need to use a different function. (I'm wondering if this means that, for binary safety, you can no longer rely on strlen() when you need to pass a sequence of bytes and a length to an API.)
In strings, you can use \u or \U and specify the codepoint (e.g. \u05D0), or \C{HEBREW LETTER ALEF} when you don't know the code point but do know the unicode character name.
PHP can automatically change the data encoding for different input and output sources. It will automatically convert string literals to UTF-8, unless declare(encoding="iso-8859-1"), and that code file is interpreted in that character set.
Procesing data retrieved from the browser poses a special problem: GET requests have no encoding at all, and POST only rarely comes marked with encoding. However, browsers are supposed to submit data in the same encoding as the page the form was on, and PHP will attempt to decode based on the unicode.output_encoding setting; but if decoding fails, PHP will populate request arrays with raw binary extension. Applications can then use the filter extension to decode the text.
When there is a conversion error to or from Unicode, you can specify how PHP is to handle the error, and even provide an error function so that you can handle the error via PHP code.
Also new is the TextIterator, which allows for fast iteration, forwards and backwards, over text. It allows you to iterate based on code point, character, words, lines, or even sentences.
To date, about 40% of PHP's 3070 built-in functions have been upgraded to handle unicode text.
There should be a preview release of PHP 6 in December.