'Regex for getting region and lang from url /xx/xx in two groups

I have a url structure where the first subdirectory is the region and then the second optional one is the language overide:

https://example.com/no/en

I'm trying to get the two parts out in a group each. This way, in the JS, I can do the following to get each part of the url:

const pathname = window.location.pathname // '/no/en/mypage'
const match = pathname.match('xxx')
const region = match[1]   // 'no' or '/no'
const language = match[2] // 'en' or '/en'

I have tried creating multiple regexes with no luck in nailing all of my requirements below: This is the closest I have come, but it is prune to error due to also matching "/do" from /donotmatch with the following regex:

(\/[a-z]{2})(\/[a-z]{2})? The problem with this one is that it's also matching cases like /noada. I then tried to match first two a-z and then followed by either a forward slash or no characters like this: (\/[a-z]{2}\/|[^.])([a-z]{2}\/|[^.])? I think I am not getting the syntax correct for the not part.

The regex I am trying to create has to pass these criterias in order not to break:

  • /no - group 1 match(no), group 2 undefined
  • /no/ - group 1 match(no), group 2 undefined
  • /nona - no matches
  • /no/en - group 1 match(no), group 2 match(en)
  • /no/en/ - group 1 match(no), group 2 match(en)
  • /no/enen - group 1 match(no), group 2 undefined
  • /no/en/something - group 1 match(no), group 2 match(en)
  • /no/en/jp - group 1 match(no), group 2 match(en) (jp is not going to be matched)

I feel I am really close to a working solution, but all my tries so far have been off in a slight way.

If the group part is not possible, I suppose also getting /xx/xx and then splitting by / is also an option.



Solution 1:[1]

You may use this regex with an optional 2nd capture group:

\/(\w{2})(?:\/(\w{2}))?(?:\/|$)

RegEx Demo

RegEx Explanation:

  • \/: Match starting /
  • (\w{2}): First capture group to match 2 word characters
  • (?:\/(\w{2}))?: Optional non-capture group that starts with a / followed by seconf capture group to match 2 word characters.
  • (?:\/|$): Match closing / or end of line

Solution 2:[2]

Follow each capture with (?=$|/), which is a look ahead to assert that what comes next is either end of input or a slash.

https?://[^/]+/(\w\w)(?=$|/)(?:/(\w\w)(?=$|/))?

See live demo.

The second capture is wrapped in an optional non-capture group via (?:…)?

To be more strict to allow only letters, replace \w with [a-z] but \w may be enough for your needs.

Solution 3:[3]

I just saw this new way of getting the same result using the new URL pattern API

The API is quite new as in the writing of this answer, but there is a polyfill you can use to add support for it right now.

const pattern = new URLPattern({ pathname: '/:region(\\w{2})/:lang(\\w{2})?' })
const result = pattern.exec('https://example.com/no/en')?.pathname?.groups
const region = result?.region // no or undefined
const lang = result?.lang // en or undefined

Solving the issue with trailing slash, one could replace the slashes with nothing before sending the "url string" to the exec method.

// ...
const urlWithoutTrailingSlashes = 'https://example.com/no/en/'.replace(/\/+$/, '')
const result = pattern.exec(urlWithoutTrailingSlashes)?.pathname?.groups
// ...

I did not yet find a way to do the optional trailing slashes in the regex inside the pattern as the limitations of lookaheads and ends with. If anyone finds a way, please edit this answer or add a comment to it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 anubhava
Solution 2
Solution 3 Sølve Tornøe