'Regex finding models from product list
I am trying to retrieve the product models from a list of product titles.
Since it is difficult to retrieve model from title, I decided starting from getting substrings with uppercase letters AND numbers (it can contains special characters also, but they are not necessary).
Some examples would be:
- Apple iPhone 8 Plus 64GB Tela Retina 5.5" 12MP/7MP iOS 11 - Prata
- Smart TV QLED de 55" Samsung QN55Q7FAMP com HDMI/USB/Wi-Fi Bivolt
- Smart TV QLED de 65" Samsung QN55Q7FAMP com HDMI/USB/Wi-Fi Bivolt
- MEMORIA DDR4 CRUCIAL 16GB/2400 CRUCIAL BLS16G4D240FSE BALLISTIX S
- MEMORIA DDR4 CRUCIAL 16GB/2400 CRUCIAL BLS16G4D240FSB BALLISTIX S
- MEMORIA DDR4 CRUCIAL 16GB/2400 CRUCIAL BLS16G4D240FSC BALLISTIX S
- MEMORIA DDR4 CRUCIAL 16GB/2400 CRUCIAL CT16G4DFD824A (SIN BLISTER
- Projetor LG MiniBeam PW1500G 1500 Lumens WXGA (1280x800) HDMI/USB
I know a lot of them will be captured with error. To avoid some errors, I am thinking in construct a dictionary of strings to ignore (like DDR4, xxGB, etc...)
I started trying with this. I am getting words with uppercase AND/OR numbers. How to get words with BOTH (uppercase letters and numbers) and special characters (if they are there it is ok, but they are not necessary).
This was my first approach to solve the problem. Of course, another solutions using or not regex would be very welcome.
Solution 1:[1]
Maybe try to match blocks that contain at least one capital letter and one number? So something like this ensures there is at least one capital and one number in the middle. You would need to use 'or' to make it work the other way too.
.+ ([A-Z1-9]*[A-Z]+[0-9]+[A-Z1-9]*) .+
Solution 2:[2]
Though this is an old question, I found it intriguing and tried to come up with a suitable regex:
\s(?![A-Z]*DDR)([A-Z](([A-Z]|-)*[0-9]+([A-Z]|-)*)+)
The main assumption made is that model numbers all start with a letter. This means that inevitably some model numbers are missed out due to starting with a numeric digit. Some breakdown:
\s
- Preceded by a space
(?![A-Z]*DDR)
- Negative lookahead to skip all sequences that include DDR
The model number is composed of a HEAD and a TAIL.
HEAD:
[A-Z]
- A single uppercase character
TAIL:
([A-Z]|-)*
- Starts with zero or more of either an uppercase character or a dash
[0-9]+
- Must include at least one numeric digit
([A-Z]|-)*
- Ends with zero or more of either and uppercase character or a dash
(([A-Z]|-)*[0-9]+([A-Z]|-)*)+
- The TAIL can have more tails attached
There are still false positives in there, but over 90% of the model numbers from OP's sample data have been picked up.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | KalenGi |