'Regex to match word delimiters in multilingual text

I have a text box that a user can input any text in any language in and I need to split that text into words so that I could pass those words into hunspell spell check. For splitting I use a regexp that matches word delimiters.

At first I used \W as a word delimiter to split a text into wrods, but that works only with Latin letters, such as in English language. If I use non-Latin language, it treats every letter of it as \W. That's because \W is defined as any character that is [^a-zA-Z0-9_].

So far, (?![-'])[\pP|\pZ|\pC] seems to tokenize English, Spanish and Russian correctly. It basically says to treat all punctuation characters (except for the hyphen and the apostrophe), all separator characters and all "other" characters (control, private use, etc) as word delimiters. I have excluded hyphen and apostrophe because those usually shouldn't be treated as word delimiters.

I haven't tested it much, just came up with it today, so I thought it would be wise to ask if someone knew of any regex that is more suited for matching word delimiters in a multilingual text.

Note that I'm not concerned with languages that can't be tokenized, such as Japanese, Chinese, Thai, etc.

Update: Since people were asking what language I'm using (though it probably shouldn't matter much), I'm using C++ and Qt5's QRegularExpression class.

Solution 1:^[1]

With Java (for example), you can emulate word boundaries like that (don't forget to double escape):

(?<![\p{L}\p{N}_])[\p{L}\p{N}_]+(?![\p{L}\p{N}_])

Where \p{L} matches any letters and \p{N} any digits.

Thus, you can easily split a string into "words" with: [^\p{L}\p{N}_]+

(I don't know the regex flavor you use, but you can probably remove the curly brackets).

Solution 2:^[2]

In PHP this should work:

[\pL]*

In Javascript you can use (set "u" for unicode after delimiter):

/[\p{L}]*/u

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Joachim Feltkamp

'Regex to match word delimiters in multilingual text

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]