Example intermediate collation file and Sphinx charset_table file
December 2, 2010
Here’s an example of a human-editable collation table. This is a manually edited version of the file that the first script generated for MySQL’s utf8_general_ci. The second script uses it to generate a Spninx charset_table:
! 0021 # 0023 % 0025 ',’ 0027,2019 0 0030 1 0031 2 0032 3 0033 4 0034 5 0035 6 0036 7 0037 8 0038 9 0039 @ 0040 A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą 0041,0061,00c0,00c1,00c2,00c3,00c4,00c5,00e0,00e1,00e2,00e3,00e4,00e5,0100,0101,0102,0103,0104,0105 B,b 0042,0062 C,c,Ç,ç,Ć,ć,Ĉ,ĉ,Ċ,ċ,Č,č 0043,0063,00c7,00e7,0106,0107,0108,0109,010a,010b,010c,010d D,d,Ď,ď 0044,0064,010e,010f E,e,È,É,Ê,Ë,è,é,ê,ë,Ē,ē,Ĕ,ĕ,Ė,ė,Ę,ę,Ě,ě 0045,0065,00c8,00c9,00ca,00cb,00e8,00e9,00ea,00eb,0112,0113,0114,0115,0116,0117,0118,0119,011a,011b F,f 0046,0066 G,g,Ĝ,ĝ,Ğ,ğ,Ġ,ġ,Ģ,ģ 0047,0067,011c,011d,011e,011f,0120,0121,0122,0123 H,h,Ĥ,ĥ 0048,0068,0124,0125 I,i,Ì,Í,Î,Ï,ì,í,î,ï,Ĩ,ĩ,Ī,ī,Ĭ,ĭ,Į,į,İ,ı 0049,0069,00cc,00cd,00ce,00cf,00ec,00ed,00ee,00ef,0128,0129,012a,012b,012c,012d,012e,012f,0130,0131 J,j,Ĵ,ĵ 004a,006a,0134,0135 K,k,Ķ,ķ 004b,006b,0136,0137 L,l,Ĺ,ĺ,Ļ,ļ,Ľ,ľ 004c,006c,0139,013a,013b,013c,013d,013e M,m 004d,006d N,n,Ñ,ñ,Ń,ń,Ņ,ņ,Ň,ň 004e,006e,00d1,00f1,0143,0144,0145,0146,0147,0148 O,o,Ò,Ó,Ô,Õ,Ö,ò,ó,ô,õ,ö,Ō,ō,Ŏ,ŏ,Ő,ő 004f,006f,00d2,00d3,00d4,00d5,00d6,00f2,00f3,00f4,00f5,00f6,014c,014d,014e,014f,0150,0151 P,p 0050,0070 Q,q 0051,0071 R,r,Ŕ,ŕ,Ŗ,ŗ,Ř,ř 0052,0072,0154,0155,0156,0157,0158,0159 S,s,Ś,ś,Ŝ,ŝ,Ş,ş,Š,š,ſ 0053,0073,015a,015b,015c,015d,015e,015f,0160,0161,017f T,t,Ţ,ţ,Ť,ť 0054,0074,0162,0163,0164,0165 U,u,Ù,Ú,Û,Ü,ù,ú,û,ü,Ũ,ũ,Ū,ū,Ŭ,ŭ,Ů,ů,Ű,ű,Ų,ų 0055,0075,00d9,00da,00db,00dc,00f9,00fa,00fb,00fc,0168,0169,016a,016b,016c,016d,016e,016f,0170,0171,0172,0173 V,v 0056,0076 W,w,Ŵ,ŵ 0057,0077,0174,0175 X,x 0058,0078 Y,y,Ý,ý,ÿ,Ŷ,ŷ,Ÿ 0059,0079,00dd,00fd,00ff,0176,0177,0178 Z,z,Ź,ź,Ż,ż,Ž,ž 005a,007a,0179,017a,017b,017c,017d,017e ~ 007e Æ,æ 00c6,00e6 Ð,ð 00d0,00f0 Ø,ø 00d8,00f8 Þ,þ 00de,00fe ß 00df Đ,đ 0110,0111 Ħ,ħ 0126,0127 IJ,ij 0132,0133 ĸ 0138 Ŀ,ŀ 013f,0140 Ł,ł 0141,0142 ʼn 0149 Ŋ,ŋ 014a,014b Œ,œ 0152,0153 Ŧ,ŧ 0166,0167 µ 00b5
And here’s a corresponding Sphinx charset_table from the second script:
charset_table = U+021, U+023, U+025, U+027, U+030..U+039, U+040..U+05a, U+07e, U+0b5, U+0c6, \ U+0d0, U+0d8, U+0de, U+0df, U+110, U+126, U+132, U+138, U+13f, U+141, U+149, U+14a, \ U+166, U+2019->U+027, U+061->U+041, U+0c0->U+041, U+0c1->U+041, U+0c2->U+041, \ U+0c3->U+041, U+0c4->U+041, U+0c5->U+041, U+0e0->U+041, U+0e1->U+041, U+0e2->U+041, \ U+0e3->U+041, U+0e4->U+041, U+0e5->U+041, U+100->U+041, U+101->U+041, U+102->U+041, \ U+103->U+041, U+104->U+041, U+105->U+041, U+062->U+042, U+063->U+043, U+0c7->U+043, \ U+0e7->U+043, U+106->U+043, U+107->U+043, U+108->U+043, U+109->U+043, U+10a->U+043, \ U+10b->U+043, U+10c->U+043, U+10d->U+043, U+064->U+044, U+10e->U+044, U+10f->U+044, \ U+065->U+045, U+0c8->U+045, U+0c9->U+045, U+0ca->U+045, U+0cb->U+045, U+0e8->U+045, \ U+0e9->U+045, U+0ea->U+045, U+0eb->U+045, U+112->U+045, U+113->U+045, U+114->U+045, \ U+115->U+045, U+116->U+045, U+117->U+045, U+118->U+045, U+119->U+045, U+11a->U+045, \ U+11b->U+045, U+066->U+046, U+067->U+047, U+11c->U+047, U+11d->U+047, U+11e->U+047, \ U+11f->U+047, U+120->U+047, U+121->U+047, U+122->U+047, U+123->U+047, U+068->U+048, \ U+124->U+048, U+125->U+048, U+069->U+049, U+0cc->U+049, U+0cd->U+049, U+0ce->U+049, \ U+0cf->U+049, U+0ec->U+049, U+0ed->U+049, U+0ee->U+049, U+0ef->U+049, U+128->U+049, \ U+129->U+049, U+12a->U+049, U+12b->U+049, U+12c->U+049, U+12d->U+049, U+12e->U+049, \ U+12f->U+049, U+130->U+049, U+131->U+049, U+06a->U+04a, U+134->U+04a, U+135->U+04a, \ U+06b->U+04b, U+136->U+04b, U+137->U+04b, U+06c->U+04c, U+139->U+04c, U+13a->U+04c, \ U+13b->U+04c, U+13c->U+04c, U+13d->U+04c, U+13e->U+04c, U+06d->U+04d, U+06e->U+04e, \ U+0d1->U+04e, U+0f1->U+04e, U+143->U+04e, U+144->U+04e, U+145->U+04e, U+146->U+04e, \ U+147->U+04e, U+148->U+04e, U+06f->U+04f, U+0d2->U+04f, U+0d3->U+04f, U+0d4->U+04f, \ U+0d5->U+04f, U+0d6->U+04f, U+0f2->U+04f, U+0f3->U+04f, U+0f4->U+04f, U+0f5->U+04f, \ U+0f6->U+04f, U+14c->U+04f, U+14d->U+04f, U+14e->U+04f, U+14f->U+04f, U+150->U+04f, \ U+151->U+04f, U+070->U+050, U+071->U+051, U+072->U+052, U+154->U+052, U+155->U+052, \ U+156->U+052, U+157->U+052, U+158->U+052, U+159->U+052, U+073->U+053, U+15a->U+053, \ U+15b->U+053, U+15c->U+053, U+15d->U+053, U+15e->U+053, U+15f->U+053, U+160->U+053, \ U+161->U+053, U+17f->U+053, U+074->U+054, U+162->U+054, U+163->U+054, U+164->U+054, \ U+165->U+054, U+075->U+055, U+0d9->U+055, U+0da->U+055, U+0db->U+055, U+0dc->U+055, \ U+0f9->U+055, U+0fa->U+055, U+0fb->U+055, U+0fc->U+055, U+168->U+055, U+169->U+055, \ U+16a->U+055, U+16b->U+055, U+16c->U+055, U+16d->U+055, U+16e->U+055, U+16f->U+055, \ U+170->U+055, U+171->U+055, U+172->U+055, U+173->U+055, U+076->U+056, U+077->U+057, \ U+174->U+057, U+175->U+057, U+078->U+058, U+079->U+059, U+0dd->U+059, U+0fd->U+059, \ U+0ff->U+059, U+176->U+059, U+177->U+059, U+178->U+059, U+07a->U+05a, U+179->U+05a, \ U+17a->U+05a, U+17b->U+05a, U+17c->U+05a, U+17d->U+05a, U+17e->U+05a, U+0e6->U+0c6, \ U+0f0->U+0d0, U+0f8->U+0d8, U+0fe->U+0de, U+111->U+110, U+127->U+126, U+133->U+132, \ U+140->U+13f, U+142->U+141, U+14b->U+14a, U+153->U+152, U+167->U+166
Creating a Sphinx charset_table from a MySQL Collation
December 2, 2010
I have moved this text and the two associated PHP scripts to tom–/Collation-to-Charset-Table on GitHub
I wil not maintain this blog post any longer so please refer to GitHub if you want the latest. (Writing an README file in markdown is so much easier than dealing with effing WordPress anyhow:P)
I have an application that deals with music metadata from all over the world and I therefore use Unicode. I want a search function that native English speakers can use without understanding accents and diacriticals from other languages. MySQL’s utf8_general_ci is ideal. For example the letter “A” in a search key matches any of these:
A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą
The search uses SphinxSearch so I want to configure it to use character matching tables that are compatible utf8_general_ci. Sphinx’s charset_table allows any character folding to be configured but it isn’t going to be trivial to write down all the rules.
How can this be automated?
The basic idea is that you can dump out any of MySQL’s collations by populating a CHAR(1) column with every character you care about and
SELECT GROUP_CONCAT(mychar) FROM mytable GROUP BY mychar;
The output of which can then be procesed into charset_table rules for a Sphinx config file.
I broke the process into three steps:
- A script generates a human-readable file describing the collation rules
- Manually edit the file to define the exact rules I want Sphinx to use
- A second script turns the edited file into a charset_table definiton
The first script takes as input specification of a MySQL utf8 collation and a numeric range of Unicode code points. It creates the table, populates it, runs the SELECT query (in the style above) to generate the human-readable output file. For example, if it is working on utf8_general_ci from 0×20 to 0x17f then it would look like this:
0020 ! 0021 " 0022 # 0023 $ 0024 % 0025 … = 003d > 003e ? 003f @ 0040 A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą 0041,0061,00c0,00c1,00c2,00c3, 0c4,00c5,00e0,00e1,00e2,00e3,00e4,00e5,0100,0101,0102,0103,0104,0105 B,b 0042,0062 C,c,Ç,ç,Ć,ć,Ĉ,ĉ,Ċ,ċ,Č,č 0043,0063,00c7,00e7,0106,0107,0108,0109,010a, 010b,010c,010d D,d,Ď,ď 0044,0064,010e,010f … W,w,Ŵ,ŵ 0057,0077,0174,0175 X,x 0058,0078 Y,y,Ý,ý,ÿ,Ŷ,ŷ,Ÿ 0059,0079,00dd,00fd,00ff,0176,0177,0178 Z,z,Ź,ź,Ż,ż,Ž,ž 005a,007a,0179,017a,017b,017c,017d,017e [ 005b \ 005c ] 005d ^ 005e … Ł,ł 0141,0142 ʼn 0149 Ŋ,ŋ 014a,014b Œ,œ 0152,0153 Ŧ,ŧ 0166,0167 µ 00b5
Each line in the file repesents a set of characters the collation treats as equivalent. A line has one or more characters (comma separated) followed by a tab followed by those characters’ respective Unicode codepoints.
With an understanding of how the second script works, I can edit the fileto get the Sphinx charset_table rules I want.
A line with only one character will be translated to a singleton (ie. terminal) character in the charset_table. For example in the last line above, µ will become “U+00b5” standing on its own in the charset_table with “->” neither before nor after it.
A line with two or more charcters does two things. First, the leftmost character in the set will become a singleton. Then all the characters to the right of the first character will be folded to that first character. For example, take the line:
D,d,Ď,ď 0044,0064,010e,010f
This will produce the following charset_table rules:
U+0044, U+0064->U+0044, U+010e->U+0044, U+010f->U+0044
When I was editing my file I first reviewed the folding rules (the lines with more than one character) to see that they made sense. Then I carefully thought about all the characters I didn’t want Sphinx to index at all and deleted those lines from the file. For example, in a song’s artist name field I want !, ‘, ’ indexed but not “, $ or ?. Finally, thiking about the equivalence of “O’brien” and “O’brien”, I replaced the singleton ‘ line with this:
',’ 0027,2019
The second script then reads the edited file and generates the rules as I described.
The two scripts (in PHP) are here and here at tom–/Collation-to-Charset-Table on GitHub. Feel free to play with them. The output of the first can be piped into the second but I fee that the manual editing step of the process is important. The scripts are
MySQL collation to Sphinx charset_table conversion by thefsb/tom– is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
converting php/mysql/apache app from latin-1 to utf-8
April 27, 2009
these are the notes i wrote to myself as i was preparing to port a big and old app to utf-8. i do not claim they are correct but they worked for me. most of this is not original but derived and condensed from other web pages as noted below. the purpose of this list is as a cheat sheet or to-do list. feel free to leave comments but try to be polite and don’t yell at me if i got something wrong.
wordpress insists on displaying simple single quote and simple double quote characters in random open/close forms in the following. sorry. please ignore and imagine they were all just the simple vertical versions.
useful web sites
- http://www.phpwact.org/php/i18n/utf-8
- http://www.phpwact.org/php/i18n/charsets
- http://www.phpwact.org/php/i18n/utf-8/mysql
- http://devlog.info/2008/08/24/php-and-unicode-utf-8/
- http://www.sitepoint.com/blogs/2006/08/09/scripters-utf-8-survival-guide-slides/
- http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet
- http://www.cs.tut.fi/~jkorpela/chars.html
immediately after opening a mysql connection, either:
- SET NAMES ‘utf8′;
- or mysql_set_charset(‘utf8′, $connection_handle);
use <form accept-charset=”utf-8″> on every form
convert html, php, js, css and other text files
declare css files as utf-8: @charset “UTF-8″;
declare linked js files in html tag as utf-8
if using htmlspecialchars, use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8′);
- use ENT_COMPAT mode, e.g. so that if putting attribute values with ” into html tags from a script, it won’t screw up.
add to top of every script ?
- $default_locale = setlocale(LC_ALL, ‘en_US.UTF-8′);
- ini_set(‘default_charset’, ‘UTF-8′ );
and just before page output PHPLIBtemplates.inc.php:
- header(‘Content-Type: text/html; charset=utf-8′);
in apache config
- AddDefaultCharset utf-8
in php.ini
- mbstring.func_overload=7
- default_charset=UTF-8
- mbstring.internal_encoding=UTF-8
mbstring.func_overload=7 covers ereg and some string functions as listed in mbstring functions and detailed below. many string functions are still not safe.
PCRE
- all pregs need the utf8 u modifier: preg_match(‘/myregex/u’, $str)
- avoid pcre i modifier
- avoid \w \W \b \B
to find the byte count of a multi-byte string when you are using mbstring.func_overload 2 and UTF-8 strings:
- mb_strlen($utf8_string, ‘latin1′);
to validate form input as utf8, http://devlog.info/2008/08/24/php-and-unicode-utf-8 says
- (strlen($str) AND !preg_match(‘/^.{1}/us’, $str)) // true means bad utf-8
but http://www.phpwact.org/php/i18n/charsets says this cannot be trusted. so use mb_check_encoding() to get a true/false answer
to quietly sanitize utf8 input strings (http://blog.liip.ch/archive/2005/01/24/how-to-get-rid-of-invalid-utf-8-characters.html):
- $s = iconv(“UTF-8″,”UTF-8//IGNORE”,$s);
which quietly deals with bad utf-8 input. it’s safe to use the result but it doesn’t require adding code to send the form back to the users for re-entry.
test strings
$strs = array( 'Iñtërnâtiônàlizætiøn', 'החמאס: רוצים להשלים את עסקת שליט במהירות האפשרית', 'ايران لا ترى تغييرا في الموقف الأمريكي', '独・米で死傷者を出した銃の乱射事件', '國會預算處公布驚人的赤字數據後', '이며 세계 경제 회복에 걸림돌이 되고 있다', 'В дагестанском лесном массиве южнее села Какашура', 'นายประสิทธิ์ รุ่งสะอาด ปลัดเทศบาล รักษาการแทนนายกเทศมนตรี ต.ท่าทองใหม่', 'ભારતીય ટીમનો સુવર્ણ યુગ : કિવીઝમાં પણ કમાલ', 'ཁམས་དཀར་མཛེས་ས་ཁུལ་དུ་རྒྱ་གཞུང་ལ་ཞི་བའི་ངོ་རྒོལ་', 'Χιόνια, βροχές και θυελλώδεις άνεμοι συνθέτουν το', 'Հայաստանում սկսվել է դատական համակարգի ձեւավորումը', 'რუსეთი ასევე გეგმავს სამხედრო');
to be lazy, sanitize $_GET and $_POST input with
function clean_input(&$a) {
if ( isset($a) && is_array($a) && !empty($a) )
foreach ($a as $k => &$v)
clean_input($v);
elseif ( is_string($a) && !mb_check_encoding($a, 'UTF-8'))
$a = iconv('UTF-8', 'UTF-8//IGNORE', $a);
return true;
}
replacement for strtr()
function mystrtr($s, $p1, $p2=false) {
if ( is_string($p1) && is_string($p2)
&& mb_strlen($p1, 'UTF-8') == mb_strlen($p2, 'UTF-8') ) {
$t = '';
for ( $i=0; $i < mb_strlen($s, 'UTF-8'); $i++ )
$t .= ($j = mb_strpos($p1, $c = substr($s, $i, 1), 0, 'UTF-8')) === false
? $c
: mb_substr($p2, $j, 1, 'UTF-8');
return $t;
} elseif ( $p2 === false && is_array($p1) ) {
return strtr($s, $p1);
}
trigger_error('mystrtr() called with bad parameters strlen(p1)=' . mb_strlen($p1, 'UTF-8')
. ' strlen(p2)=' . mb_strlen($p2, 'UTF-8'), E_USER_WARNING);
return $s;
}
notes on specific functions learned from own tests, links noted above and in the table
| addcslashes | DO NOT USE |
| addslashes | DO NOT USE |
| chop | see rtrim |
| chr | only use for ascii |
| chunk_split | SUSPECT, probably works on byte strings |
| count_chars | operates on byte strings, use only on ascii or 8859 |
| crc32 | see md5 |
| crypt | see md5 |
| echo | presumably mb-safe? |
| explode | SAFE, but can use preg_split |
| fprintf | DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| fscanf | DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| html_entity_decode | DO NOT USE, see htmlspecialchars |
| htmlentities | DO NOT USE, see htmlspecialchars |
| htmlspecialchars | OK but use htmlspecialchars($s, ENT_COMPAT, ‘UTF-8′) |
| implode | probably OK? |
| join | same as implode |
| lcfirst | DO NOT USE, mb_convert_case |
| levenshtein | SUSPECT, testing needed |
| localeconv | ? |
| ltrim | OK without a $charlist 2nd param. or use preg_replace(‘/^\s+/u’, ”, $s); |
| mb_strtolower | DO NOT USE, confirmed buggy! mb_convert_case($s, MB_CASE_LOWER, “UTF-8″) |
| mb_strtoupper | DO NOT USE, confirmed buggy! mb_convert_case($s, MB_CASE_UPPER, “UTF-8″) |
| md5_file | probably ok |
| md5 | probably ok, i guess it returns the MD5 of the byte string, as one would want |
| metaphone | SUSPECT |
| money_format | ? |
| nl2br | DO NOT USE, preg_replace(‘/\n/u’, ‘<br>’, $s); |
| number_format | ? |
| ord | only use for ascii |
| parse_str | Use mb_parse_str |
| presumably mb-safe? | |
| printf | RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| quotemeta | SUSPECT, preg_replace |
| rtrim | OK without a $charlist 2nd param. or use preg_replace(‘/\s+$/u’, ”, $s); |
| setlocale | ALWAYS USE |
| sha1_file | see md5 |
| sha1 | see md5 |
| similar_text | SUSPECT |
| soundex | SUSPECT |
| sprintf | RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| sscanf | RISKY. ONLY use on 7-bit ascii,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| str_getcsv | OK if local and LANG set correctly |
| str_ireplace | DO NOT USE, preg_replace |
| str_pad | DO NOT USE |
| str_repeat | SUSPECT |
| str_replace | SAFE, or use preg_replace |
| str_rot13 | DO NOT USE except on 7-bit ascii only |
| str_shuffle | DO NOT USE |
| str_split | > mb_split or use preg_split instead |
| str_word_count | SUSPECT |
| strcasecmp | DO NOT USE |
| strchr | SUSPECT, use mb_strpos or mb_strrichr |
| strcmp | according to comments on php.net, ok if is locale set right |
| strcoll | according to bug reports, ok on posix systems, not windows. but set locale |
| strcspn | DO NOT USE |
| strip_tags | DO NOT USE |
| stripcslashes | DO NOT USE |
| stripos | > mb_stripos |
| stripslashes | DO NOT USE, preg_replace(array(‘/\x5C(?!\x5C)/u’, ‘/\x5C\x5C/u’), array(”,’\\’), $s) |
| stristr | > mb_stristr |
| strlen | > mb_strlen, OK unless you need byte length, e.g. to save a file, then use mb_strlen($s, ‘latin1′); |
| strnatcasecmp | SUSPECT |
| strnatcmp | SUSPECT |
| strncasecmp | SUSPECT |
| strncmp | SUSPECT |
| strpbrk | SUSPECT, use preg |
| strpos | > mb_strpos |
| strrchr | SUSPECT, use |
| strrev | DO NOT USE |
| strripos | > mb_strripos |
| strrpos | > mb_strpos |
| strspn | DO NOT USE, use preg_match |
| strstr | > mb_strstr |
| strtok | DO NOT USE |
| strtolower | DO NOT USE. mb_strtoupper fails on some cases when mb_convert_case($str, MB_CASE_UPPER, “UTF-8″) does not |
| strtoupper | DO NOT USE. mb_strtolower fails on some cases when mb_convert_case($str, MB_CASE_LOWER, “UTF-8″) does not |
| strtr | DO NOT USE with 3-params. 2-param version ok with valid utf-8. |
| substr_compare | DO NOT USE |
| substr_count | > mb_substr_count, or preg_match_all? |
| substr_replace | DO NOT USE |
| substr | > mb_substr, see also mb_strcut & mb_strimwidth |
| trim | OK without a $charlist 2nd param. or preg_replace(‘/(^\s+)|(\s+$)/’, ”, $s); |
| ucfirst | DO NOT USE |
| ucwords | DO NOT USE, mb_convert_case($str, MB_CASE_TITLE, “UTF-8″) |
| vfprintf | DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| vprintf | DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| vsprintf | DO NOT USE,
http://www.php.net/manual/en/function.sprintf.php#89020 |
| wordwrap | SUSPECT |
| urlencode | OK |
| rawurlencode | OK |
| urldecode | SUSPECT |
| rawurldecode | SUSPECT |
| utf8_encode | only use on ascii or 8859-1 |
| utf8_decode | ? |