'How to Check Which Record is non-numeric in a String Column in Delta Table
I am working on Delta table using Databricks on Azure.
The Delta table contains about 100 million records with many columns. One column data type of which is String, but containing large numbers (BIGINT). There are just some records that have some random non-numeric values.
Question: Is there any way to Select these non-numeric records from this table via Databricks SQL notebook?
Data Example:
A | B | C | D | E | ... |
---|---|---|---|---|---|
'838472910' | ... | ... | ... | ... | ... |
'392018' | ... | ... | ... | ... | ... |
'10293849021' | ... | ... | ... | ... | ... |
'NULL' | ... | ... | ... | ... | ... |
'9384038' | ... | ... | ... | ... | ... |
'21314' | ... | ... | ... | ... | ... |
'UNMAPPED' | ... | ... | ... | ... | ... |
'3840594739' | ... | ... | ... | ... | ... |
'UNIDENTIFIED' | ... | ... | ... | ... | ... |
'29380072' | ... | ... | ... | ... | ... |
'592812012' | ... | ... | ... | ... | ... |
'8432178930' | ... | ... | ... | ... | ... |
'EMPTYVALUE' | ... | ... | ... | ... | ... |
Solution 1:[1]
There are several possibilities for that:
- Try to cast value to
bigint
, and select rows where result of cast isnil
:
select * from test where cast(data as bigint) is null
- Use
regexp
operator (doc) to perform check:
select * from test where data not regexp '^[0-9]+$'
Both approaches are giving the same result, although I think that first one would be a bit faster (although it needs testing to confirm that).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Alex Ott |