converting php/mysql/apache app from latin-1 to utf-8
April 27, 2009
these are the notes i wrote to myself as i was preparing to port a big and old app to utf-8. i do not claim they are correct but they worked for me. most of this is not original but derived and condensed from other web pages as noted below. the purpose of this list is as a cheat sheet or to-do list. feel free to leave comments but try to be polite and don’t yell at me if i got something wrong.
wordpress insists on displaying simple single quote and simple double quote characters in random open/close forms in the following. sorry. please ignore and imagine they were all just the simple vertical versions.
useful web sites
- http://www.phpwact.org/php/i18n/utf-8
- http://www.phpwact.org/php/i18n/charsets
- http://www.phpwact.org/php/i18n/utf-8/mysql
- http://devlog.info/2008/08/24/php-and-unicode-utf-8/
- http://www.sitepoint.com/blogs/2006/08/09/scripters-utf-8-survival-guide-slides/
- http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet
- http://www.cs.tut.fi/~jkorpela/chars.html
immediately after opening a mysql connection, either:
- SET NAMES ‘utf8′;
- or mysql_set_charset(‘utf8′, $connection_handle);
use <form accept-charset=”utf-8″> on every form
convert html, php, js, css and other text files
declare css files as utf-8: @charset “UTF-8″;
declare linked js files in html tag as utf-8
if using htmlspecialchars, use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8′);
- use ENT_COMPAT mode, e.g. so that if putting attribute values with ” into html tags from a script, it won’t screw up.
add to top of every script ?
- $default_locale = setlocale(LC_ALL, ‘en_US.UTF-8′);
- ini_set(‘default_charset’, ‘UTF-8′ );
and just before page output PHPLIBtemplates.inc.php:
- header(‘Content-Type: text/html; charset=utf-8′);
in apache config
- AddDefaultCharset utf-8
in php.ini
- mbstring.func_overload=7
- default_charset=UTF-8
- mbstring.internal_encoding=UTF-8
mbstring.func_overload=7 covers ereg and some string functions as listed in mbstring functions and detailed below. many string functions are still not safe.
PCRE
- all pregs need the utf8 u modifier: preg_match(‘/myregex/u’, $str)
- avoid pcre i modifier
- avoid \w \W \b \B
to find the byte count of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings:
- mb_strlen($utf8_string, ‘latin1′);
to validate form input as utf8, http://devlog.info/2008/08/24/php-and-unicode-utf-8 says
- (strlen($str) AND !preg_match(‘/^.{1}/us’, $str)) // true means bad utf-8
but http://www.phpwact.org/php/i18n/charsets says this cannot be trusted. so use mb_check_encoding() to get a true/false answer
to quietly sanitize utf8 input strings (http://blog.liip.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-characters.html):
- $s = iconv(“UTF-8″,”UTF-8//IGNORE”,$s);
which quietly deals with bad utf-8 input. it’s safe to use the result but it doesn’t require adding code to send the form back to the users for re-entry.
test strings
$strs = array( 'Iñtërnâtiônàlizætiøn', 'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית', 'ايران لا ترى تغييرا في الموقف الأمريكي', '独・米で死傷者を出した銃の乱射事件', '國會預算處公布驚人的赤字數據後', '이며 세계 경제 회복에 걸림돌이 되고 있다', 'В дагестанском лесном массиве южнее села Какашура', 'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่', 'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ', 'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་', 'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το', 'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը', 'რუსეთი ასევე გეგმავს სამხედრო');
to be lazy, sanitize $_GET and $_POST input with
function clean_input(&$a) {
if ( isset($a) && is_array($a) && !empty($a) )
foreach ($a as $k => &$v)
clean_input($v);
elseif ( is_string($a) && !mb_check_encoding($a, 'UTF-8'))
$a = iconv('UTF-8', 'UTF-8//IGNORE', $a);
return true;
}
replacement for strtr()
function mystrtr($s, $p1, $p2=false) {
if ( is_string($p1) && is_string($p2)
&& mb_strlen($p1, 'UTF-8') == mb_strlen($p2, 'UTF-8') ) {
$t = '';
for ( $i=0; $i < mb_strlen($s, 'UTF-8'); $i++ )
$t .= ($j = mb_strpos($p1, $c = substr($s, $i, 1), 0, 'UTF-8')) === false
? $c
: mb_substr($p2, $j, 1, 'UTF-8');
return $t;
} elseif ( $p2 === false && is_array($p1) ) {
return strtr($s, $p1);
}
trigger_error('mystrtr() called with bad parameters strlen(p1)=' . mb_strlen($p1, 'UTF-8')
. ' strlen(p2)=' . mb_strlen($p2, 'UTF-8'), E_USER_WARNING);
return $s;
}
notes on specific functions learned from own tests, links noted above and in the table
| addcslashes | DO NOT USE |
| addslashes | DO NOT USE |
| chop | see rtrim |
| chr | only use for ascii |
| chunk_split | SUSPECT, probably works on byte strings |
| count_chars | operates on byte strings, use only on ascii or 8859 |
| crc32 | see md5 |
| crypt | see md5 |
| echo | presumably mb-safe? |
| explode | SAFE, but can use preg_split |
| fprintf | DO NOT USE, http://www.php.net/manual/en/function.sprintf.php#89020 |
| fscanf | DO NOT USE, http://www.php.net/manual/en/function.sprintf.php#89020 |
| html_entity_decode | DO NOT USE, see htmlspecialchars |
| htmlentities | DO NOT USE, see htmlspecialchars |
| htmlspecialchars | OK but use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8′) |
| implode | probably OK? |
| join | same as implode |
| lcfirst | DO NOT USE, mb_convert_case |
| levenshtein | SUSPECT, testing needed |
| localeconv | ? |
| ltrim | OK without a $charlist 2nd param. or use preg_replace(‘/^\s+/u’, ”, $s); |
| mb_strtolower | DO NOT USE, confirmed buggy! mb_convert_case($s, MB_CASE_LOWER, “UTF-8″) |
| mb_strtoupper | DO NOT USE, confirmed buggy! mb_convert_case($s, MB_CASE_UPPER, “UTF-8″) |
| md5_file | probably ok |
| md5 | probably ok, i guess it returns the MD5 of the byte string, as one would want |
| metaphone | SUSPECT |
| money_format | ? |
| nl2br | DO NOT USE, preg_replace(‘/\n/u’, ‘<br>’, $s); |
| number_format | ? |
| ord | only use for ascii |
| parse_str | Use mb_parse_str |
| presumably mb-safe? | |
| printf | RISKY. ONLY use on 7-bit ascii, http://www.php.net/manual/en/function.sprintf.php#89020 |
| quotemeta | SUSPECT, preg_replace |
| rtrim | OK without a $charlist 2nd param. or use preg_replace(‘/\s+$/u’, ”, $s); |
| setlocale | ALWAYS USE |
| sha1_file | see md5 |
| sha1 | see md5 |
| similar_text | SUSPECT |
| soundex | SUSPECT |
| sprintf | RISKY. ONLY use on 7-bit ascii, http://www.php.net/manual/en/function.sprintf.php#89020 |
| sscanf | RISKY. ONLY use on 7-bit ascii, http://www.php.net/manual/en/function.sprintf.php#89020 |
| str_getcsv | OK if local and LANG set correctly |
| str_ireplace | DO NOT USE, preg_replace |
| str_pad | DO NOT USE |
| str_repeat | SUSPECT |
| str_replace | SAFE, or use preg_replace |
| str_rot13 | DO NOT USE except on 7-bit ascii only |
| str_shuffle | DO NOT USE |
| str_split | > mb_split or use preg_split instead |
| str_word_count | SUSPECT |
| strcasecmp | DO NOT USE |
| strchr | SUSPECT, use mb_strpos or mb_strrichr |
| strcmp | according to comments on php.net, ok if is locale set right |
| strcoll | according to bug reports, ok on posix systems, not windows. but set locale |
| strcspn | DO NOT USE |
| strip_tags | DO NOT USE |
| stripcslashes | DO NOT USE |
| stripos | > mb_stripos |
| stripslashes | DO NOT USE, preg_replace(array(‘/\x5C(?!\x5C)/u’, ‘/\x5C\x5C/u’), array(”,’\\’), $s) |
| stristr | > mb_stristr |
| strlen | > mb_strlen, OK unless you need byte length, e.g. to save a file, then use mb_strlen($s, ‘latin1′); |
| strnatcasecmp | SUSPECT |
| strnatcmp | SUSPECT |
| strncasecmp | SUSPECT |
| strncmp | SUSPECT |
| strpbrk | SUSPECT, use preg |
| strpos | > mb_strpos |
| strrchr | SUSPECT, use |
| strrev | DO NOT USE |
| strripos | > mb_strripos |
| strrpos | > mb_strpos |
| strspn | DO NOT USE, use preg_match |
| strstr | > mb_strstr |
| strtok | DO NOT USE |
| strtolower | DO NOT USE. mb_strtoupper fails on some cases when mb_convert_case($str, MB_CASE_UPPER, “UTF-8″) does not |
| strtoupper | DO NOT USE. mb_strtolower fails on some cases when mb_convert_case($str, MB_CASE_LOWER, “UTF-8″) does not |
| strtr | DO NOT USE with 3-params. 2-param version ok with valid utf-8. |
| substr_compare | DO NOT USE |
| substr_count | > mb_substr_count, or preg_match_all? |
| substr_replace | DO NOT USE |
| substr | > mb_substr, see also mb_strcut & mb_strimwidth |
| trim | OK without a $charlist 2nd param. or preg_replace(‘/(^\s+)|(\s+$)/’, ”, $s); |
| ucfirst | DO NOT USE |
| ucwords | DO NOT USE, mb_convert_case($str, MB_CASE_TITLE, “UTF-8″) |
| vfprintf | DO NOT USE, http://www.php.net/manual/en/function.sprintf.php#89020 |
| vprintf | DO NOT USE, http://www.php.net/manual/en/function.sprintf.php#89020 |
| vsprintf | DO NOT USE, http://www.php.net/manual/en/function.sprintf.php#89020 |
| wordwrap | SUSPECT |
| urlencode | OK |
| rawurlencode | OK |
| urldecode | SUSPECT |
| rawurldecode | SUSPECT |
| utf8_encode | only use on ascii or 8859-1 |
| utf8_decode | ? |