'Dart support for using Script Property Values in Regular Expressions

The Unicode regular expression documentation describes doing complex matches for text. Specifically, I am wondering about matching various scripts within a string of text based on the script property values of the code points.

The Unicode documentation about Using Script Property Values in Regular Expressions refers to this possibility:

The script property is useful in regular expression syntax for easy specification of spans of text that consist of a single script or mixture of scripts. In general, regular expressions should use specific Script property values only in conjunction with both Common and Inherited. For example, to distinguish a sequence of characters appropriate for Greek text, one might use

((Greek | Common) (Inherited | Me | Mn))

The preceding expression matches all characters that have a Script property value of Greek or Common and which are optionally followed by characters with a Script property value of Inherited. For completeness, the regular expression also allows any nonspacing or enclosing mark.

Some languages commonly use multiple scripts, so, for example, to distinguish a sequence of characters appropriate for Japanese text one might use:

((Hiragana | Katakana | Han | Latin | Common) (Inherited | Me | Mn))

Is this implemented in Dart? I don't see it described for Dart RegEx or for JavaScript ECMAScript regex specs that Dart regexes are based on.



Solution 1:[1]

Dart added support for Unicode properties in version 2.4 back in mid-2019 (see https://github.com/dart-lang/sdk/issues/34935). However, there is a gotcha: for this to work you need to pass the optional argument "unicode: true" to the RegExp() constructor in order to identify your pattern as a "unicode pattern". I've tested the following (which matches {L} letters, {N} numbers and {M} marks) and it works fine with the latest Dart SDK:

RegExp(r'[\p{L}\p{N}\p{M}]', unicode: true)

To match Greek characters, as per @daxim's example:

RegExp exp = RegExp(r'(\p{Script=Greek})', unicode: true);
Iterable<RegExpMatch> matches;
matches = exp.allMatches('???????????');
for (Match m in matches) {
  print('${m.group(1)}');
}

Solution 2:[2]

The most simple case is not supported, let alone set operations. Tested with https://dartpad.dev/

void main() {
  RegExp exp = new RegExp(r"(\p{Script:Greek})");
  String str = "?";
  Iterable<RegExpMatch> matches = exp.allMatches(str);
  for (Match m in matches) {
    final match = m.group(0);
    print(match);
  }
}

Got: no result

Expect: ?


Use Perl when you don't want to be disappointed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Suragch
Solution 2 Suragch