Growing exponentially over the last decade, Unicode text now
comprises over 95% of the documents retrieved over the web, while in
other collections, it is often 100% Unicode. This tutorial shows
Perl programmers how to manage Unicode data.
Simple patterns like [a-z] or \d no longer cut the mustard, partly because Unicode is such a large character set, and partly because of multiple ways of writing characters with diacritics. There are many land mines in regular expressions now that Unicode matters
How does Unicode support across major platforms, including Java, Perl, Python, Ruby, and more, stack up? Who's doing the best job, and who's failing miserably? Is anyone doing a good job? Does anyone actually implement to standard, and to what extent? I'll compare the major platforms to separate the losers from the not-so-losers.