converting php/mysql/apache app from latin-1 to utf-8

April 27, 2009

these are the notes i wrote to myself as i was preparing to port a big and old app to utf-8. i do not claim they are correct but they worked for me. most of this is not original but derived and condensed from other web pages as noted below. the purpose of this list is as a cheat sheet or to-do list. feel free to leave comments but try to be polite and don’t yell at me if i got something wrong.

wordpress insists on displaying simple single quote and simple double quote characters in random open/close forms in the following. sorry. please ignore and imagine they were all just the simple vertical versions.

useful web sites

immediately after opening a mysql connection, either:

  • SET NAMES ‘utf8’;
  • or mysql_set_charset(‘utf8’, $connection_handle);

use <form accept-charset=”utf-8″> on every form

convert html, php, js, css and other text files

declare css files as utf-8: @charset “UTF-8”;

declare linked js files in html tag as utf-8

if using htmlspecialchars, use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8’);

  • use ENT_COMPAT mode, e.g. so that if putting attribute values with ” into html tags from a script, it won’t screw up.

add to top of every script ?

  • $default_locale = setlocale(LC_ALL, ‘en_US.UTF-8’);
  • ini_set(‘default_charset’, ‘UTF-8’ );

and just before page output PHPLIBtemplates.inc.php:

  • header(‘Content-Type: text/html; charset=utf-8’);

in apache config

  • AddDefaultCharset utf-8

in php.ini

  • mbstring.func_overload=7
  • default_charset=UTF-8
  • mbstring.internal_encoding=UTF-8

mbstring.func_overload=7 covers ereg and some string functions as listed in mbstring functions and detailed below. many string functions are still not safe.

PCRE

  • all pregs need the utf8 u modifier: preg_match(‘/myregex/u’, $str)
  • avoid pcre i modifier
  • avoid \w \W \b \B

to find the byte count of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings:

  • mb_strlen($utf8_string, ‘latin1’);

to validate form input as utf8, http://devlog.info/2008/08/24/php-and-unicode-utf-8 says

  • (strlen($str) AND !preg_match(‘/^.{1}/us’, $str)) // true means bad utf-8

but http://www.phpwact.org/php/i18n/charsets says this cannot be trusted. so use mb_check_encoding() to get a true/false answer

to quietly sanitize utf8 input strings (http://blog.liip.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-characters.html):

  • $s = iconv(“UTF-8″,”UTF-8//IGNORE”,$s);

which quietly deals with bad utf-8 input. it’s safe to use the result but it doesn’t require adding code to send the form back to the users for re-entry.

test strings

$strs = array(
		'Iñtërnâtiônàlizætiøn',
		'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית',
		'ايران لا ترى تغييرا في الموقف الأمريكي',
		'独・米で死傷者を出した銃の乱射事件',
		'國會預算處公布驚人的赤字數據後',
		'이며 세계 경제 회복에 걸림돌이 되고 있다',
		'В дагестанском лесном массиве южнее села Какашура',
		'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่',
		'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ',
		'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་',
		'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το',
		'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը',
		'რუსეთი ასევე გეგმავს სამხედრო');

to be lazy, sanitize $_GET and $_POST input with

function clean_input(&$a) {
    if ( isset($a) && is_array($a) && !empty($a) )
        foreach ($a as $k => &$v)
            clean_input($v);
    elseif ( is_string($a) && !mb_check_encoding($a, 'UTF-8'))
        $a = iconv('UTF-8', 'UTF-8//IGNORE', $a);
	return true;
}

replacement for strtr()

function mystrtr($s, $p1, $p2=false) {
  if ( is_string($p1) && is_string($p2) 
        && mb_strlen($p1, 'UTF-8') == mb_strlen($p2, 'UTF-8') ) {
  $t = '';
  for ( $i=0; $i < mb_strlen($s, 'UTF-8'); $i++ )
    $t .= ($j = mb_strpos($p1, $c = substr($s, $i, 1), 0, 'UTF-8')) === false 
      ? $c 
      : mb_substr($p2, $j, 1, 'UTF-8');
    return $t;
  } elseif ( $p2 === false && is_array($p1) ) {
    return strtr($s, $p1);
  }
  trigger_error('mystrtr() called with bad parameters strlen(p1)=' . mb_strlen($p1, 'UTF-8') 
    . ' strlen(p2)=' . mb_strlen($p2, 'UTF-8'), E_USER_WARNING);
  return $s;
}

notes on specific functions learned from own tests, links noted above and in the table

addcslashes DO NOT USE
addslashes DO NOT USE
chop see rtrim
chr only use for ascii
chunk_split SUSPECT, probably works on byte strings
count_chars operates on byte strings, use only on ascii or 8859
crc32 see md5
crypt see md5
echo presumably mb-safe?
explode SAFE, but can use preg_split
fprintf DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020
fscanf DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020
html_entity_decode DO NOT USE, see htmlspecialchars
htmlentities DO NOT USE, see htmlspecialchars
htmlspecialchars OK but use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8’)
implode probably OK?
join same as implode
lcfirst DO NOT USE, mb_convert_case
levenshtein SUSPECT, testing needed
localeconv ?
ltrim OK without a $charlist 2nd param. or use preg_replace(‘/^\s+/u’,
”, $s);
mb_strtolower DO NOT USE, confirmed buggy! mb_convert_case($s, MB_CASE_LOWER,
“UTF-8”)
mb_strtoupper DO NOT USE, confirmed buggy! mb_convert_case($s,
MB_CASE_UPPER, “UTF-8”)
md5_file probably ok
md5 probably ok, i guess it returns the MD5 of the byte
string, as one would want
metaphone SUSPECT
money_format ?
nl2br DO NOT USE, preg_replace(‘/\n/u’, ‘<br>’, $s);
number_format ?
ord only use for ascii
parse_str Use mb_parse_str
print presumably mb-safe?
printf RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020
quotemeta SUSPECT, preg_replace
rtrim OK without a $charlist 2nd param. or use preg_replace(‘/\s+$/u’,
”, $s);
setlocale ALWAYS USE
sha1_file see md5
sha1 see md5
similar_text SUSPECT
soundex SUSPECT
sprintf RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020
sscanf RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020
str_getcsv OK if local and LANG set correctly
str_ireplace DO NOT USE, preg_replace
str_pad DO NOT USE
str_repeat SUSPECT
str_replace SAFE, or use preg_replace
str_rot13 DO NOT USE except on 7-bit ascii only
str_shuffle DO NOT USE
str_split > mb_split or use preg_split instead
str_word_count SUSPECT
strcasecmp DO NOT USE
strchr SUSPECT, use mb_strpos or mb_strrichr
strcmp according to comments on php.net, ok if is locale set
right
strcoll according to bug reports, ok on posix systems, not
windows. but set locale
strcspn DO NOT USE
strip_tags DO NOT USE
stripcslashes DO NOT USE
stripos > mb_stripos
stripslashes DO NOT USE, preg_replace(array(‘/\x5C(?!\x5C)/u’,
‘/\x5C\x5C/u’), array(”,’\\’), $s)
stristr > mb_stristr
strlen > mb_strlen, OK unless you need byte length, e.g. to
save a file, then use mb_strlen($s, ‘latin1’);
strnatcasecmp SUSPECT
strnatcmp SUSPECT
strncasecmp SUSPECT
strncmp SUSPECT
strpbrk SUSPECT, use preg
strpos > mb_strpos
strrchr SUSPECT, use
strrev DO NOT USE
strripos > mb_strripos
strrpos > mb_strpos
strspn DO NOT USE, use preg_match
strstr > mb_strstr
strtok DO NOT USE
strtolower DO NOT USE. mb_strtoupper fails on some cases when
mb_convert_case($str, MB_CASE_UPPER, “UTF-8”) does not
strtoupper DO NOT USE. mb_strtolower fails on some cases when
mb_convert_case($str, MB_CASE_LOWER, “UTF-8”) does not
strtr DO NOT USE with 3-params. 2-param version ok with valid
utf-8.
substr_compare DO NOT USE
substr_count > mb_substr_count, or preg_match_all?
substr_replace DO NOT USE
substr > mb_substr, see also mb_strcut & mb_strimwidth
trim OK without a $charlist 2nd param. or
preg_replace(‘/(^\s+)|(\s+$)/’, ”, $s);
ucfirst DO NOT USE
ucwords DO NOT USE, mb_convert_case($str, MB_CASE_TITLE,
“UTF-8”)
vfprintf DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020
vprintf DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020
vsprintf DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020
wordwrap SUSPECT
urlencode OK
rawurlencode OK
urldecode SUSPECT
rawurldecode SUSPECT
utf8_encode only use on ascii or 8859-1
utf8_decode ?

Advertisements

10 Responses to “converting php/mysql/apache app from latin-1 to utf-8”

  1. Marinkina said

    Пора переименовать блог, присвоив название связанное с доменами 🙂 может хватит про них?

  2. thefsb said

    Marinkina, fsb has been my nickname for a very long time, since before the security agency i think you refer to was given that name and i don’t intend to let them steal my name away, nor any Federal Savings Bank, or the FöreningsSparbanken, Fulbright Scholarship Board or any design of front side bus.

  3. Cederash said

    Даже и не придирешься!

  4. Ferinannnd said

    Навеяно наверное стандартным мышлением? Будьте проще ))

  5. Avertedd said

    Согласен, что пост получился удачным. Хорошая работа!

  6. Marc said

    Hi, I found your page thru google. I have big trouble with UTF-8 encoding, php and mysql. Whenever I try to insert $_POST data with accent, they get transformed to latin1 in mysql. The thing is that my database is in utf8_unicode_ci, every mysql variable is set to utf8 encoding.

    What I find also weird, if I update manually data thru phpmyadmin with accent they are ok in mysql and the webpage.

    Any clue?

    • thefsb said

      @Marc: i cannot debug your problem on the basis of the information you gave. but I’m fairly confident that you will find the answer if you systematically work through the checklist on this page. in particular, be sure that your web pages are delivered as utf-8 (check the charset header sent to the user agent), that php’s default_charset is set to utf-8, and that php’s connection to mysql is set to utf-8.

  7. osteombef said

    Amazing, very interesting theme. I am going to write about it also.

  8. Antony said

    I found this page very usefull.
    Thanks 🙂

    for full backward compatibility trim should trim off new lines too?

    ltrim
    preg_replace(‘/^(\s|\r\n|\n)+/u’, ”, $s);

    rtrim
    preg_replace(‘/(\s|\r\n|\n)+$/u’, ”, $s);

    trim
    preg_replace(‘/(^(\s|\r\n|\n)+)|((\s|\r\n|\n)+$)/’, ”, $s);

    • thefsb said

      you may well be right. but “for full backward compatibility” with PHP’s trim() the simple thing is to just use PHP’s trim() which, without a second parameter, is safe on utf8 strings.

      a more interesting project would be to write the RE that would trim those fancy Unicode whitespace characters.

      btw: both your REs and my REs for trim could be optimized: there’s no need to capture. and i’m not sure so many parens are really required.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: