Migrating to Unicode

This are some notes from a talk I attended at the International PHP Conference in 2005. I don’t think I have blogged it yet. (you find all kinds of interesting things when you try to “clean up” your hdd)

Case study
Survey center is an online survey generator written in PHP. Used to run multi-country panel portals, has interfaces to third-party applications.

Why migrate to Unicode. Before the switch non Western European languages were using html entities which caused a lot of trouble.

UTF-8 is simple to use, backwards compatible with ascii, variable bytelength. Slower than UTF-16, can waste some space on single byte characters.

PCRE supports UTF-8 with the /u modifier

Iconv and mbstring provides functionality missing in PHP. Mbstring offers the possibility to overload some of PHPs native string functions. Overloading functions, will break any binary handling. Slower but safer than iconv. MySQL has good UTF-8 database support in 4.1 and that warranted an upgrade.

The Migration: Grepped through the code and find what string functions were being used. Some functions worked with UTF-8 others had to be replaced with mb_* functions or other custom scripts.

1.Convert all files, Scripts, Templates to UTF-8
2.Enabled mbstring and iconv in PHP
3.Make sure all PCRE functions use the /u modifer. Get rid of the ereg regular expressions.
4.Change all the string functions.
5.Implemented on-the-fly character set conversions for IO, make sure that file uploads/downloads have the right character sets. Convert GET/POST to UTF-8
6.Send the HTTP Content-Type headers for the page. IE doesn’t bother reading the meta tags on SSL pages.
7.Update MySQL from 4.0 to 4.1, decide what the best collation is, discovered the most suitible is utf8_general_ci.
8.Update SQL queries which no longer worked
9.Converted all tables to UTF-8 (Set everything to Latin1 first)

Most of the third-party code wasn’t compatible. Serialized data in the database broke because the strings were no longer the same length, to fix this all data had to be unserialized converted and then serialized again.

Everything was much more complex than expected. Don’t do this because you think that UTF-8 is cool, it’s difficult, not well supported in PHP, and don’t do it without needing it. Don’t do this without a CVS.

2 Responses to “Migrating to Unicode”

  1. Filda Says:

    What tools you use to convert script and sql files?

  2. developercast.com » Aaron Wormus’ Blog: Migrating to Unicode Says:

    […] Aaron Wormus has dug up some old notes that he made at the International PHP Conference back in 2005 on the topic of Unicode in PHP that he wanted to share. […]

Leave a Reply