Creating a Sphinx charset_table from a MySQL Collation

December 2, 2010

I have moved this text and the two associated PHP scripts to tom–/Collation-to-Charset-Table on GitHub

I wil not maintain this blog post any longer so please refer to GitHub if you want the latest. (Writing an README file in markdown is so much easier than dealing with effing WordPress anyhow:P)

I have an application that deals with music metadata from all over the world and I therefore use Unicode. I want a search function that native English speakers can use without understanding accents and diacriticals from other languages. MySQL’s utf8_general_ci is ideal. For example the letter “A” in a search key matches any of these:

A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą

The search uses SphinxSearch so I want to configure it to use character matching tables that are compatible utf8_general_ci. Sphinx’s charset_table allows any character folding to be configured but it isn’t going to be trivial to write down all the rules.

How can this be automated?

The basic idea is that you can dump out any of MySQL’s collations by populating a CHAR(1) column with every character you care about and

SELECT GROUP_CONCAT(mychar) FROM mytable GROUP BY mychar;

The output of which can then be procesed into charset_table rules for a Sphinx config file.

I broke the process into three steps:

  • A script generates a human-readable file describing the collation rules
  • Manually edit the file to define the exact rules I want Sphinx to use
  • A second script turns the edited file into a charset_table definiton

The first script takes as input specification of a MySQL utf8 collation and a numeric range of Unicode code points. It creates the table, populates it, runs the SELECT query (in the style above) to generate the human-readable output file. For example, if it is working on utf8_general_ci from 0x20 to 0x17f then it would look like this:

0020
!        0021
"        0022
#        0023
$        0024
%        0025
…
=        003d
>        003e
?        003f
@        0040
A,a,À,Á,Â,Ã,Ä,Å,à,á,â,ã,ä,å,Ā,ā,Ă,ă,Ą,ą 0041,0061,00c0,00c1,00c2,00c3,
0c4,00c5,00e0,00e1,00e2,00e3,00e4,00e5,0100,0101,0102,0103,0104,0105
B,b      0042,0062
C,c,Ç,ç,Ć,ć,Ĉ,ĉ,Ċ,ċ,Č,č               0043,0063,00c7,00e7,0106,0107,0108,0109,010a,
010b,010c,010d
D,d,Ď,ď  0044,0064,010e,010f
…
W,w,Ŵ,ŵ  0057,0077,0174,0175
X,x      0058,0078
Y,y,Ý,ý,ÿ,Ŷ,ŷ,Ÿ        0059,0079,00dd,00fd,00ff,0176,0177,0178
Z,z,Ź,ź,Ż,ż,Ž,ž        005a,007a,0179,017a,017b,017c,017d,017e
[        005b
\        005c
]        005d
^        005e
…
Ł,ł      0141,0142
ʼn        0149
Ŋ,ŋ      014a,014b
Œ,œ      0152,0153
Ŧ,ŧ      0166,0167
µ        00b5

Each line in the file repesents a set of characters the collation treats as equivalent. A line has one or more characters (comma separated) followed by a tab followed by those characters’ respective Unicode codepoints.

With an understanding of how the second script works, I can edit the fileto get the Sphinx charset_table rules I want.

A line with only one character will be translated to a singleton (ie. terminal) character in the charset_table. For example in the last line above, µ will become “U+00b5” standing on its own in the charset_table with “->” neither before nor after it.

A line with two or more charcters does two things. First, the leftmost character in the set will become a singleton. Then all the characters to the right of the first character will be folded to that first character. For example, take the line:

D,d,Ď,ď  0044,0064,010e,010f

This will produce the following charset_table rules:

U+0044, U+0064->U+0044, U+010e->U+0044, U+010f->U+0044

When I was editing my file I first reviewed the folding rules (the lines with more than one character) to see that they made sense. Then I carefully thought about all the characters I didn’t want Sphinx to index at all and deleted those lines from the file. For example, in a song’s artist name field I want !, , indexed but not , $ or ?. Finally, thiking about the equivalence of “O’brien” and “O’brien”, I replaced the singleton line with this:

',’   0027,2019

The second script then reads the edited file and generates the rules as I described.

The two scripts (in PHP) are here and here at tom–/Collation-to-Charset-Table on GitHub. Feel free to play with them. The output of the first can be piped into the second but I fee that the manual editing step of the process is important. The scripts are Creative Commons License MySQL collation to Sphinx charset_table conversion by thefsb/tom– is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Advertisements

8 Responses to “Creating a Sphinx charset_table from a MySQL Collation”

  1. jojo said

    Hi thefsb.
    I’ve tried your first script. And I’ve got an error on line 33.

    —————

    for ( $i = $first; $i query(
    “SET NAMES ‘utf8’ COLLATE ‘$collation’;”); // set conection charset

    —————-

    Are there any lines missing?
    Thanks

    • thefsb said

      jojo, thanks for pointing that out. the bit you quote is obviously wrong. i had a horrible time trying to make wordpress display code in some sensible manner. i gave up in the end and moved the scripts to pasteboard.com. i edited the post and the links are at the end of it.

      in the mean time i also modified the first script so that it would handle much bigger character ranges without running out of memory. the code is simpler now too. but you have to specify the ranges in the script rather than on the command line.

      • jojo said

        Hi thefsb.

        the 2 scripts that you posted at pasteboard.com are the same file. Would you mind checking them, please?

        Thanks.

  2. thefsb said

    jojo, I think I have fixed the link. I expect you’ll find some more troubles so please do mention them here.

  3. Clay said

    I don’t quite understand how this works. Am I dumping the mysql column to a file and piping it to the php?
    e.g.:
    echo “SELECT GROUP_CONCAT(msg) FROM logs GROUP BY msg” | mysql mydatabase > outfile

    cat outfile | ./1.php

    That doesn’t produce anything…
    I have a tool that uses Sphinx but it is used by people all around the world. I was trying to see if I could let them generate a charset table based on messages in their database.

    P.S. Thanks for writing and posting this!

  4. arjones said

    Hi thefsb, it’s wonderful find information about charset fold for Sphinx, seems everybody ran in the same problem, but didn’t became satisfied with the solution.

    I’m trying to use your scripts and I think there is a problem with the pastie reference, both seems quite similar and none of them connects to MySQL to obtain the charset.

    If the scripts are correct, can you just include the arguments we have to use, as an example.

    Thanks a lot!

    • thefsb said

      Hi arjones,

      You are quite right. And I was so embarrassed that I have failed to fix this more than once already that I went and did it right. Please take a look at tom–/Collation-to-Charset-Table on GitHub.

      I will update the blog post now to refer to GitHub. I converted this blog post to the repo’s README.mkd which I can maintain much more easily than the blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: