Tel +49 (30) 814504070

Andreas Gohr
10.12.2009 14:01 Uhr

Surviving the Perl/UTF-8 Madness

Tags:

This post is basically a survival guide for my self, should I ever again be up against the evil called “Unicode Support” in Perl 5.8+.

Perl's idea of UTF-8 support is that scalars now have an internal flag that determines if its contents is UTF-8 or not. When the flag is off, then the content is assumed to be ASCII7, but basically it is just treated as bytes.

Now what happens if you for example concatenate two differently flagged scalars, is that Perl will first convert the “off-flagged” scalar to UTF-8. And this is when it gets ugly. Because it might be that this off-flagged string does very well contain UTF-8 encoded stuff already, just the flag wasn't set correctly.

But it doesn't stop there. No Perl script is an island - there's always input and output. How does Perl know if it is UTF-8 or not? The sad thing is that it tries to guess. And as we all know, things get really ugly when software tries to guess1).

So here is the main trick to stay sane with Perl's UTF-8 support: never let it guess, always make sure everything is encoded and flagged as UTF-8 internally!

Internal Scalars

You sometimes need a simple string containing non-ASCII chars. The simplest way to achieve that is to write your code in UTF-8. Use an UTF-8 capable editor and terminal and just write your string in your natural language. But tell Perl about it!

To do so, just use the “use utf8” pragma:

use utf8; #this script is written in UTF-8
 
my $string = "Äußerst süße Töchter aus Ölüberschußländern in Übergröße";

Standard File Handles

Again, in theory Perl should guess if your Terminal provides UTF-8 or not and recode input and output accordingly. For me that never works reliable. So just tell Perl what encoding your streams use with binmode:

# treat all input and output as UTF-8 and set the flags correctly
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
binmode STDIN,  ":utf8";

The above should of course also work with other encodings. Perl than will recode them to UTF-8 internally.

Reading from Files

Even if your Terminal, editor and Script is in UTF-8 – your files to read might not. Telling Perl the correct encoding will again automatically recode them and set the UTF-8 flag:

# read from a latin1 encoded file
open FH, "<:encoding(iso-8859-1)", 'test.latin1.txt';
$latin1 = <FH>;
close FH;
 
# read from a UTF-8 encoded file
open FH, "<:encoding(utf8)", 'test.utf8.txt';
$utf8 = <FH>;
close FH;

Both scalars, $utf8 and $latin1, now contain valid UTF-8 encoded text with Perl's internal flag enabled.

MySQL Databases

You might think to know the answer here: SET NAMES utf8. Yes and no. Sending this will switch your MySQL connection to UTF-8 and when you pass Perl scalars with the UTF-8 flag enabled, their values will be inserted correctly (usually). However everything you read from the database will be UTF-8 encoded but missing the UTF-8 flag.

Luckily DBD::mysql has a cure for that – an option called mysql_enable_utf8. You need to pass it in the connect method.

$DBH = DBI->connect("DBI:mysql:database=foo;host=localhost",
                    'user',
                    'password',
                    {
                         mysql_enable_utf8 => 1
                    });

The flag will also take care of sending SET NAMES utf8 for you.

Manually setting the UTF-8 flag

If you have some data from other sources (eg. Non-MySQL DBs), you can switch the UTF-8 flag with Encode::decode_utf8. The decode is a bit confusing but it will “decode” into Perl's internal UTF-8 format.

use Encode;
 
# $line containes UTF-8 encoded text but the flag isn't set, yet
$line = Encode::decode_utf8($line); # set the flag


That's it. Once you figure it all out, it is somewhat bearable. Personally I prefer PHP's UTF-8 support: just treat everything as single bytes and provide a library for multibyte operations.


Bookmark and Share

Comments

LeSpocky
2009/12/11 08:54

The treatment seems right but it lacks understanding what Perl does here and how it is called. Perl distinguishes between text/character strings and byte strings and marks this with a flag. So far correct. But the internal representation of these strings is irrelevant. In fact it maybe unicode or even UTF-8, but this doesn't matter. At least this flag should not be called UTF-8-flag because this implies something which is not the actual meaning.

Second point: 'use utf8;' is only for telling Perl that your source is UTF-8, nothing more!

Third: you're somehow right in what you're doing when you decode all data you read from somewhere else into Perl's internal text string format. The rule here is: decode when reading from outside, encode when writing to outside, regardless from which charset your reading and to which you're writing and regardless of Perl's internal coding of it's text strings. It's much more understandable if you don't call this UTF-8-flag, even if this is the way Perl handles it internally. So the point you forgot is: it's not enough to decode byte strings when reading files or databases or user input or whatever. You must also encode it back to the target charset when you do output again. Of course you can set all these input and output encodings to UTF-8 if you know they actually are, but as you correctly said: this is no difference to any other encoding.

Last thing: you can set this flag manually, but not the way you showed here, because this is just decoding a byte string into a text string. And it is usually not recommend to set this flag manually, because Perl's internal representation maybe UTF-8 but you can't be 100% sure.

For further reading I recommend: perlunitut in perldoc [1] and if you speak German: Zeichenkodierungen oder „Warum funktionieren meine Umlaute nicht?” [2].

[1] http://perldoc.perl.org/perlunitut.html [2] http://perlgeek.de/de/artikel/charsets-unicode

Renée Bäcker
2009/12/13 15:11

Moritz' article (the perlgeek.de one) is also available in english: http://perlgeek.de/en/article/encodings-and-unicode

Create a comment




If you can't read the letters on the image, download this .wav file to get them read to you.

About CosmoCode

CosmoCode is a Berlin based IT service provider with a strong emphasis on web applications. We mainly focus on Content Management Systems, Wikis and custom solutions.

Subscribe

Subscribe Like our blog? Stay up to date via RSS