Strings are complicated
This week, exercism is talking about 14 Ways to Reverse a String! It's surprisingly complicated. I think it's not because reversing is complicated, but because strings themselves are complicated ("Anyone who says differently is selling something").
I couldn't find the reverse-string problem in the Perl track on exercism, but I think Perl can illustrate this beautifully.
Perl's built-in reverse
will reverse the bytes in an undecoded string in scalar context.
❯ perl -E 'say scalar reverse shift' 'Larry Wall'
llaW yrraL
But that doesn't work for Unicode
❯ perl -E 'say scalar reverse shift' 'Matz (まつもとゆきひろ)'
)��㲁㍁ㆂ㨁も㤁㾁�( ztaM
unless we decode the string first.
❯ perl -CAO -E 'say scalar reverse shift' 'Matz (まつもとゆきひろ)'
)ろひきゆともつま( ztaM
But reversing Unicode codepoints isn't always what we want either. This works when the é is a LATIN SMALL LETTER E WITH ACUTE
❯ perl -CAO -E 'say scalar reverse shift' 'José Valim'
milaV ésoJ
but not when the é is a LATIN SMALL LETTER E plus a COMBINING ACUTE ACCENT. We could fix that by normalizing first.
❯ perl -MUnicode::Normalize -CAO -E 'say scalar reverse NFC shift' 'Josés'
sésoJ
But what if had, say, Unicode flags?
❯ perl -MUnicode::Normalize -CAO -E 'say scalar reverse NFC shift' '🇧🇷 🇺🇸'
🇸🇺 🇷🇧
Reversing codepoints is not what we want, even if they're normalized. What we really want is to reverse grapheme clusters.
❯ perl -MUnicode::GCString -CAO -E 'say join "", reverse Unicode::GCString->new(shift)->as_array' '🇧🇷 🇺🇸'
🇺🇸 🇧🇷
And this, of course, works for all of the cases.
❯ perl -MUnicode::GCString -CAO -E 'say join "", reverse Unicode::GCString->new($_)->as_array for @ARGV' 'Larry Wall' 'Matz (まつもとゆきひろ)' 'José Valim' '🇧🇷 🇺🇸'
llaW yrraL
)ろひきゆともつま( ztaM
milaV ésoJ
🇺🇸 🇧🇷
Putting all that in a script might look like this.
#!/usr/bin/env perl
use v5.38;
use Encode; # decode, encode
use Unicode::GCString;
use Unicode::Normalize; # NFC
for my $word ("Larry Wall", "Matz (まつもとゆきひろ)", "José Valim", "Josés", "🇺🇸 🇧🇷") {
say '';
# Reverse the bytes.
my $reversed = reverse $word;
say "$word => $reversed";
# Reverse the unicode codepoints.
my $decoded = decode 'UTF-8', $word;
my $decoded_reversed = reverse $decoded;
say "$word => ", encode 'UTF-8', $decoded_reversed;
# Reverse the normalized unicode codepoints.
my $normalized = NFC $decoded;
my $normalized_reversed = reverse $normalized;
say "$word => ", encode 'UTF-8', $normalized_reversed;
# Reverse the grapheme clusters.
my $gcstring = Unicode::GCString->new($decoded);
my $reversed_gcstring = join '', reverse $gcstring->as_array;
say "$word => ", encode 'UTF-8', $reversed_gcstring;
}
Interestingly, in Perl strings start out as sequences of bytes; we have to do a little work to decode them. In Rust, strings start out as utf8; we have to do a little work to see the bytes.
use unicode_normalization::UnicodeNormalization; // nfc
use unicode_segmentation::UnicodeSegmentation; // graphemes
fn main() {
for word in ["Larry Wall", "Matz (まつもとゆきひろ)", "José Valim", "Josés", "🇺🇸 🇧🇷"] {
println!();
// Reverse the bytes.
let reversed = word.as_bytes().iter().rev().map(|&b| b).collect::<Vec<_>>();
let reversed = String::from_utf8_lossy(&reversed);
println!("{word:?} => {reversed:?}");
// Reverse the unicode codepoints.
let reversed = word.chars().rev().collect::<String>();
println!("{word:?} => {reversed:?}");
// Reverse the normalized unicode codepoints.
let normalized = word.nfc().collect::<String>();
let reversed = normalized.chars().rev().collect::<String>();
println!("{word:?} => {reversed:?}");
// Reverse the grapheme clusters.
let reversed = word.graphemes(true).rev().collect::<String>();
println!("{word:?} => {reversed:?}");
}
}
Both the Perl and the Rust version give something like this.
"Larry Wall" => "llaW yrraL"
"Larry Wall" => "llaW yrraL"
"Larry Wall" => "llaW yrraL"
"Larry Wall" => "llaW yrraL"
"Matz (まつもとゆきひろ)" => ")��㲁㍁ㆂ㨁も㤁㾁�( ztaM"
"Matz (まつもとゆきひろ)" => ")ろひきゆともつま( ztaM"
"Matz (まつもとゆきひろ)" => ")ろひきゆともつま( ztaM"
"Matz (まつもとゆきひろ)" => ")ろひきゆともつま( ztaM"
"José Valim" => "milaV ��soJ"
"José Valim" => "milaV ésoJ"
"José Valim" => "milaV ésoJ"
"José Valim" => "milaV ésoJ"
"Jose\u{301}s" => "s��esoJ"
"Jose\u{301}s" => "s\u{301}esoJ"
"Jose\u{301}s" => "sésoJ"
"Jose\u{301}s" => "se\u{301}soJ"
"🇺🇸 🇧🇷" => "���𧇟� ���\u{3a1df}�"
"🇺🇸 🇧🇷" => "🇷🇧 🇸🇺"
"🇺🇸 🇧🇷" => "🇷🇧 🇸🇺"
"🇺🇸 🇧🇷" => "🇧🇷 🇺🇸"
Only the last line in each group is correct in every case.
Perhaps it's interesting to note that what we actually see further depends on the terminal or browser we are using, as well as what fonts. It's not just the programs producing the bytes we want, but also the output device showing the glyphs we want. Strings are complicated!