Flat Earth Catalogue

2023-01-13

Inspecting Unicode strings in Perl

ord $str returns the unsigned integer value of the first character of $str

The special-case split you're trying to remember is split //, $str (for disintegrating a string into a list of characters, right? Yep, called it).

use charnames (); charnames::viacode(ord $str) returns the "best" name (most recent Name_Alias if any, otherwise Name if any, otherwise any alias you defined, otherwise undef). charnames is ... less un-ergonomic for going the other direction, from names to ordinals or strings.

Caveat: Be sure you are handling Unicode strings, not byte strings.

  • -C7 when you write a one-liner
  • use open qw(:std :encoding(UTF-8)); when you write a script, assuming you have been allowed to encode your Unicode string data the sane way
  • use feature qw(unicode_strings); when you are confident you don't need the hinky "guess if it's 8-bit" (ASCII or EBCDIC) backwards compatibility behavior
  • use utf8; if that script also has embedded non-ASCII (think about your __DATA__ section)
Bonus code: having a confusing time with trailing whitespace? Make it explain itself.
use strict;
use warnings;
use open qw':std :encoding(UTF-8)';
use feature qw(unicode_strings say);
use utf8;

use charnames ();

while (<>) {
chomp;
/(\s*)$/;
say join "\n\t", $_, map {charnames::viacode(ord $_)} split //, $1;
}


02:04

2023-01-10

Google Drive path in crouton chroots

 /var/host/media/fuse/drivefs-[unique_number_tag]/root


21:52

Powered by Blogger

 

(K) 2002-present. All rights reversed, except as noted.

Hard-won technical knowledge, old rants, and broken links from 10 years ago. I should not have to explain this in the 21st century, but no, I do not actually believe the world is flat.

Past
current