This post is basically a survival guide for my self, should I ever again be up against the evil called “Unicode Support” in Perl 5.8+.
Perl's idea of UTF-8 support is that scalars now have an internal flag that determines if its contents is UTF-8 or not. When the flag is off, then the content is assumed to be ASCII7, but basically it is just treated as bytes.
Now what happens if you for example concatenate two differently flagged scalars, is that Perl will first convert the “off-flagged” scalar to UTF-8. And this is when it gets ugly. Because it might be that this off-flagged string does very well contain UTF-8 encoded stuff already, just the flag wasn't set correctly.
But it doesn't stop there. No Perl script is an island - there's always input and output. How does Perl know if it is UTF-8 or not? The sad thing is that it tries to guess. And as we all know, things get really ugly when software tries to guess1).
So here is the main trick to stay sane with Perl's UTF-8 support: never let it guess, always make sure everything is encoded and flagged as UTF-8 internally!
You sometimes need a simple string containing non-ASCII chars. The simplest way to achieve that is to write your code in UTF-8. Use an UTF-8 capable editor and terminal and just write your string in your natural language. But tell Perl about it!
To do so, just use the “use utf8” pragma:
use utf8; #this script is written in UTF-8
my $string = "Äußerst süße Töchter aus Ölüberschußländern in Übergröße";
Again, in theory Perl should guess if your Terminal provides UTF-8 or not and recode input and output accordingly. For me that never works reliable. So just tell Perl what encoding your streams use with
# treat all input and output as UTF-8 and set the flags correctly
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
binmode STDIN, ":utf8";
The above should of course also work with other encodings. Perl than will recode them to UTF-8 internally.
Even if your Terminal, editor and Script is in UTF-8 – your files to read might not. Telling Perl the correct encoding will again automatically recode them and set the UTF-8 flag:
# read from a latin1 encoded file
open FH, "<:encoding(iso-8859-1)", 'test.latin1.txt';
$latin1 = <FH>;
# read from a UTF-8 encoded file
open FH, "<:encoding(utf8)", 'test.utf8.txt';
$utf8 = <FH>;
$latin1, now contain valid UTF-8 encoded text with Perl's internal flag enabled.
You might think to know the answer here:
SET NAMES utf8. Yes and no. Sending this will switch your MySQL connection to UTF-8 and when you pass Perl scalars with the UTF-8 flag enabled, their values will be inserted correctly (usually). However everything you read from the database will be UTF-8 encoded but missing the UTF-8 flag.
DBD::mysql has a cure for that – an option called
mysql_enable_utf8. You need to pass it in the
$DBH = DBI->connect("DBI:mysql:database=foo;host=localhost",
mysql_enable_utf8 => 1
The flag will also take care of sending
SET NAMES utf8 for you.
If you have some data from other sources (eg. Non-MySQL DBs), you can switch the UTF-8 flag with
Encode::decode_utf8. The decode is a bit confusing but it will “decode” into Perl's internal UTF-8 format.
# $line containes UTF-8 encoded text but the flag isn't set, yet
$line = Encode::decode_utf8($line); # set the flag
That's it. Once you figure it all out, it is somewhat bearable. Personally I prefer PHP's UTF-8 support: just treat everything as single bytes and provide a library for multibyte operations.