'Why is an underscore (_) not regarded as a non-word character?

Why is an underscore (_) not regarded as a non-word character? This regexp \W matches all non-word character but not the underscore.



Solution 1:[1]

Referring to Jeffrey Friedl's book about Regular Expressions, this was a change in Perl Regular Expressions, originally. Back to 1988 according to characters that were allowed to name a Perl variable [Page 89]:

Perl 2 was released in June 1988. Larry had replaced the regex code entirely, this time using a greatly enhanced version of the Henry Spencer package mentioned in the previous section. You could still have at most nine sets of parentheses, but now you could use | inside them. Support for \d and \s was added, and support for \w was changed to include an underscore, since then it would match what characters were allowed in a Perl variable name.

Solution 2:[2]

\W is defined as [^A-Za-z0-9_].

It is the opposite of \w which is [A-Za-z0-9_] and means "a word character".

It is not about words as you perceive them in a spoken language. The "word" here means an identifier, word that can be used to name a variable or a type in a programming language.

Many programming languages allow only uppercase and lowercase letters, digits and underscore (_) in identifiers. There are languages that allow other characters but back when the regular expressions were invented, there were less languages that permissive and most of them allowed only the characters that match \w in identifiers.

Solution 3:[3]

"Word character" definition is based on characters that can be used as a part of identifier in many programming languages, that is [A-Za-z0-9_].

Solution 4:[4]

According to regex101: \w matches any non-word character (equal to [^a-zA-Z0-9_]). This seems to be a designers' choice.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 revo
Solution 2
Solution 3 Andrew Svietlichnyy
Solution 4