CosmoCode is a Berlin based IT service provider focusing on CMS, Wikis and Web2.0
Mail info@cosmocode.deTel +49 (30) 814504070
This post is basically a survival guide for my self, should I ever again be up against the evil called “Unicode Support” in Perl 5.8+.
Perl's idea of UTF-8 support is that scalars now have an internal flag that determines if its contents is UTF-8 or not. When the flag is off, then the content is assumed to be ASCII7, but basically it is just treated as bytes.
Now what happens if you for example concatenate two differently flagged scalars, is that Perl will first convert the “off-flagged” scalar to UTF-8. And this is when it gets ugly. Because it might be that this off-flagged string does very well contain UTF-8 encoded stuff already, just the flag wasn't set correctly.
But it doesn't stop there. No Perl script is an island - there's always input and output. How does Perl know if it is UTF-8 or not? The sad thing is that it tries to guess. And as we all know, things get really ugly when software tries to guess1).
So here is the main trick to stay sane with Perl's UTF-8 support: never let it guess, always make sure everything is encoded and flagged as UTF-8 internally!
You sometimes need a simple string containing non-ASCII chars. The simplest way to achieve that is to write your code in UTF-8. Use an UTF-8 capable editor and terminal and just write your string in your natural language. But tell Perl about it!
To do so, just use the “use utf8” pragma:
use utf8; #this script is written in UTF-8 my $string = "Äußerst süße Töchter aus Ölüberschußländern in Übergröße";
Again, in theory Perl should guess if your Terminal provides UTF-8 or not and recode input and output accordingly. For me that never works reliable. So just tell Perl what encoding your streams use with binmode:
# treat all input and output as UTF-8 and set the flags correctly binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; binmode STDIN, ":utf8";
The above should of course also work with other encodings. Perl than will recode them to UTF-8 internally.
Even if your Terminal, editor and Script is in UTF-8 – your files to read might not. Telling Perl the correct encoding will again automatically recode them and set the UTF-8 flag:
# read from a latin1 encoded file open FH, "<:encoding(iso-8859-1)", 'test.latin1.txt'; $latin1 = <FH>; close FH; # read from a UTF-8 encoded file open FH, "<:encoding(utf8)", 'test.utf8.txt'; $utf8 = <FH>; close FH;
Both scalars, $utf8 and $latin1, now contain valid UTF-8 encoded text with Perl's internal flag enabled.
You might think to know the answer here: SET NAMES utf8. Yes and no. Sending this will switch your MySQL connection to UTF-8 and when you pass Perl scalars with the UTF-8 flag enabled, their values will be inserted correctly (usually). However everything you read from the database will be UTF-8 encoded but missing the UTF-8 flag.
Luckily DBD::mysql has a cure for that – an option called mysql_enable_utf8. You need to pass it in the connect method.
$DBH = DBI->connect("DBI:mysql:database=foo;host=localhost", 'user', 'password', { mysql_enable_utf8 => 1 });
The flag will also take care of sending SET NAMES utf8 for you.
If you have some data from other sources (eg. Non-MySQL DBs), you can switch the UTF-8 flag with Encode::decode_utf8. The decode is a bit confusing but it will “decode” into Perl's internal UTF-8 format.
use Encode; # $line containes UTF-8 encoded text but the flag isn't set, yet $line = Encode::decode_utf8($line); # set the flag
That's it. Once you figure it all out, it is somewhat bearable. Personally I prefer PHP's UTF-8 support: just treat everything as single bytes and provide a library for multibyte operations.
Moritz' article (the perlgeek.de one) is also available in english: http://perlgeek.de/en/article/encodings-and-unicode
About CosmoCode
Subscribe
LeSpocky
2009/12/11 08:54
The treatment seems right but it lacks understanding what Perl does here and how it is called. Perl distinguishes between text/character strings and byte strings and marks this with a flag. So far correct. But the internal representation of these strings is irrelevant. In fact it maybe unicode or even UTF-8, but this doesn't matter. At least this flag should not be called UTF-8-flag because this implies something which is not the actual meaning.
Second point: 'use utf8;' is only for telling Perl that your source is UTF-8, nothing more!
Third: you're somehow right in what you're doing when you decode all data you read from somewhere else into Perl's internal text string format. The rule here is: decode when reading from outside, encode when writing to outside, regardless from which charset your reading and to which you're writing and regardless of Perl's internal coding of it's text strings. It's much more understandable if you don't call this UTF-8-flag, even if this is the way Perl handles it internally. So the point you forgot is: it's not enough to decode byte strings when reading files or databases or user input or whatever. You must also encode it back to the target charset when you do output again. Of course you can set all these input and output encodings to UTF-8 if you know they actually are, but as you correctly said: this is no difference to any other encoding.
Last thing: you can set this flag manually, but not the way you showed here, because this is just decoding a byte string into a text string. And it is usually not recommend to set this flag manually, because Perl's internal representation maybe UTF-8 but you can't be 100% sure.
For further reading I recommend: perlunitut in perldoc [1] and if you speak German: Zeichenkodierungen oder „Warum funktionieren meine Umlaute nicht?” [2].
[1] http://perldoc.perl.org/perlunitut.html [2] http://perlgeek.de/de/artikel/charsets-unicode