'How do I remove particular sets of characters and not others?
As a consequence of botched character decoding, I have a set of titles that look like this, with special characters followed by other characters like ñ
that were not in the original:
Päivän Jälkeen
Tuuli kääntyä voi
Päivän Jälkeen
Tuuli kääntyä voi
「Eurotrash」
Le Désert N'Est Plus En Afrique
I know that I can use something along the lines of s:g / ... / ... /
, but I am unsure of how to match just the special characters (├
, ╜
) as opposed to capturing apostrophes, spaces, and so on.
When using \W
to try and capture "not a character used in a 'word'", I run into the issue of it matching apostrophes, spaces, and so on.
Thus, my question is, what am I missing that would be useful when trying to essentially delete these characters from the words?
Solution 1:[1]
TL;DR Two solutions plus some further discussion.
Simplest solution that works for your examples
<:So>
For example:
say
S:g /<:So>//
given
'P?ñiv?ñn J?ñlkeen'
# Pñivñn Jñlkeen
An explanation of the above solution
First, an explanation that looks short but will take you deep fast:
Learn about Unicode character properties. Here's the official Unicode Character Database doc.
Study Raku materials (eg the latest Raku documentation on Unicode properties) and discussions (eg searches of SO: "unicode property", "unicode properties") about use of Unicode properties in Raku.
Learn how the reference Raku compiler (Rakudo) actually uses them. (It gets complicated; see SO What are all the Unicode properties a Raku character will match.)
Now a slightly longer explanation that may make understanding this stuff easier than the approach outlined above:
In Raku regexes,
<...>
is a general assertion that...
is true. There are many variations on what...
can be.¹One variation is that the current character is matched against a character class, or some combination of classes, specified by the
<...>
. There are many variations on what this Raku character class expression...
can be.²One variation of a character class expression is an explicit Unicode character property value; it will match characters with that Unicode character property value. There are many variations on what a Unicode character property and its values can be;
<:So>
is an example and works for your scenario.³
Here's the process I went through to select that character class for your scenario:
I used a
util.unicode.org
hosted utility to see properties of the?
character.?I focused on the the
General_Category
property listed near the top of the left hand column of the table. This is a common property to specify in a Raku regex. You generally won't specify it by writing<:General_Category...>
, with the...
being the category, but instead by just writing<:category>
wherecategory
is the category.I noted the
General_Category
value (the next column):Other_Symbol
. If you read the Unicode or Raku doc related to theGeneral_Category
property, you'll see thatSo
is a short alias ofOther_Symbol
.To specify a Unicode property in Raku, write it using Raku's colon pair literal syntax. The key can be any of many Unicode properties.? So to match a character that has the
So
property, write<:So>
in a regex.To remove a character that has that class from a string, one option is to use the
s///
construct:say S:g /<:So>// given 'P?ñiv?ñn J?ñlkeen' # Pñivñn Jñlkeen
Another way to get the same result
You might want to do "arithmetic" with character classes / characters, adding some, and subtracting others:
say
S:g / <+ [\W] - [\s] - ['] > //
given
'P?ñiv?ñn J?ñlkeen'
# Pñivñn Jñlkeen
Read <+ [\W] - [\s] - ['] >
as "Not a word character, but also not a space character and not a '
either".
See Enumerated character classes and ranges in the Character Class section of Raku's doc for further details of using enumerated classes (using < [...] >
) and adding/subtracting classes (using +
and -
) and/or an earlier SO I answer I wrote in response to a question pretty similar to yours.
For some use cases this approach may simplify getting precisely the result you want and/or expressing it so it'll be easier for others, or a future you, to maintain. But it all depends on things I'll discuss in the next section.
Further discussion of your question: "and so on"
I am unsure of how to capture just the special characters as opposed to capturing apostrophes, spaces, and so on.
Let's say you use something like one or other or the two solutions I've shown. How do you know they really are what you want? The answer is you need to explore Unicode. Character classes like \w
/\W
, \s
/\S
etc are just convenient shortcuts for Unicode properties, so you still end up needing to explore Unicode if you want to be sure what's going on. Given that it all ultimately boils down to Unicode properties, let's discuss the <:So>
solution.
As we saw above, the General_Category
property of ?
(as shown via the utility linked above) is a link/value Other_Symbol
.
If you click that latter link, you'll see a page corresponding to the Unicode Other_Symbol
character class. It's a big jumbled mass of black-and-white symbols and colorful emojis and then an orderly list of other symbols. There are over 6,000 characters!
Does this character class contain characters that could be categorized as "spaces"? I'm near certain it doesn't, but it'll be up to you to figure that out, not me. What about "apostrophes"? I'm slightly less sure about that than "spaces", though I again think it won't include characters that could be called or categorized as "apostrophes". Does the Other_Symbol
character class contain "and so on"?!? Maybe toss a coin? ? Maybe do an in page browser search for particular characters on the Other_Symbol
page? ?
When I'm not using the tools hosted on util.unicode.org
or similar, one approach I've occasionally used to explore Unicode character classes is variations on this Raku code:
say (^0x10FFFF)».chr.grep(/ <:So> /)
The (^0x10FFFF)
is a way to specify the integer Range
that corresponds to all of the 1,114,112 legal Unicode Code Points. The ».chr
iterates the range, producing a list of integers from 0
onward, applying .chr
to each, which produces the Unicode character corresponding to a given integer. The .grep(/ <:So> /)
then keeps only the characters whose General_Category
is Other_Symbol
.
That said, that'll be really slow. You'll want to find other ways to explore Unicode.
Other options include the util.unicode.org
tools and the Raku community's Unicodable
which you can run by visiting the #raku
IRC channel and entering u: ...
.
Discussion of "$_ ~~ s:g / \? / /;
"
when I attempt to directly escape the character in my regex via
$_ ~~ s:g/ \? / /;
, the result stays the same
I've not been able to reproduce that.
I think you've just gotten confused.
If you are still convinced you are right, please produce an MRE.
Footnotes
¹ A big chunk of Raku's power is in its powered up regexes. And a big chunk of that is expressed via the general form <...>
.
² Character class syntaxes include older style ones carried forward from older regex formats, eg \s
to match whitespace. But they're all just shortcuts for Unicode properties or characters rather than ASCII ones, and there are now a LOT more variations.
³ If you only skim the doc you might think that Raku regexes can only match against Unicode's General_Category
property. But if you look at the code examples you'll see there's <:Script<Latin>>
and <:Block('Basic Latin')>
too. (But what are they?) And then when you see the vast array of properties displayed by the util.unicode.org
property browser you realize there's vastly more that could be matched. Rakudo matches many of these but not all. For gory details, see What are all the Unicode properties a character will match?.
? Perhaps adding links to some of these utilities from the Raku doc would be a good thing. And/or creating/hosting variants of them using Raku.
? I suspect Raku defines some additional things that are of the form <:foo>
beyond Unicode properties. For example, I know :space
works (it matches an ASCII space) but suspect it's not a Unicode property. Otoh that sounds downright wrong to me and against what I would expect of Raku design. If I find out for sure one way or the other I'll update this footnote.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |