'Dart support for using Script Property Values in Regular Expressions
The Unicode regular expression documentation describes doing complex matches for text. Specifically, I am wondering about matching various scripts within a string of text based on the script property values of the code points.
The Unicode documentation about Using Script Property Values in Regular Expressions refers to this possibility:
The script property is useful in regular expression syntax for easy specification of spans of text that consist of a single script or mixture of scripts. In general, regular expressions should use specific Script property values only in conjunction with both Common and Inherited. For example, to distinguish a sequence of characters appropriate for Greek text, one might use
((Greek | Common) (Inherited | Me | Mn))
The preceding expression matches all characters that have a Script property value of Greek or Common and which are optionally followed by characters with a Script property value of Inherited. For completeness, the regular expression also allows any nonspacing or enclosing mark.
Some languages commonly use multiple scripts, so, for example, to distinguish a sequence of characters appropriate for Japanese text one might use:
((Hiragana | Katakana | Han | Latin | Common) (Inherited | Me | Mn))
Is this implemented in Dart? I don't see it described for Dart RegEx or for JavaScript ECMAScript regex specs that Dart regexes are based on.
Solution 1:[1]
Dart added support for Unicode properties in version 2.4 back in mid-2019 (see https://github.com/dart-lang/sdk/issues/34935). However, there is a gotcha: for this to work you need to pass the optional argument "unicode: true" to the RegExp() constructor in order to identify your pattern as a "unicode pattern". I've tested the following (which matches {L} letters, {N} numbers and {M} marks) and it works fine with the latest Dart SDK:
RegExp(r'[\p{L}\p{N}\p{M}]', unicode: true)
To match Greek characters, as per @daxim's example:
RegExp exp = RegExp(r'(\p{Script=Greek})', unicode: true);
Iterable<RegExpMatch> matches;
matches = exp.allMatches('???????????');
for (Match m in matches) {
print('${m.group(1)}');
}
Solution 2:[2]
The most simple case is not supported, let alone set operations. Tested with https://dartpad.dev/
void main() {
RegExp exp = new RegExp(r"(\p{Script:Greek})");
String str = "?";
Iterable<RegExpMatch> matches = exp.allMatches(str);
for (Match m in matches) {
final match = m.group(0);
print(match);
}
}
Got: no result
Expect: ?
Use Perl when you don't want to be disappointed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Suragch |
Solution 2 | Suragch |