'How to find position of substring column in a another column using PySpark?
If I have a PySpark DataFrame with two columns, text
and subtext
, where subtext
is guaranteed to occur somewhere within text
. How would I calculate the position of subtext
in text
column?
Input data:
+---------------------------+---------+
| text | subtext |
+---------------------------+---------+
| Where is my string? | is |
| Hm, this one is different | on |
+---------------------------+---------+
Expected output:
+---------------------------+---------+----------+
| text | subtext | position |
+---------------------------+---------+----------+
| Where is my string? | is | 6 |
| Hm, this one is different | on | 9 |
+---------------------------+---------+----------+
Note: I can do this using static text/regex without issue, I have not been able to find any resources on doing this with a row-specific text/regex.
Solution 1:[1]
You can use locate
. You need to subtract 1 because string index starts from 1, not 0.
import pyspark.sql.functions as F
df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))
df2.show(truncate=False)
+-------------------------+-------+--------+
|text |subtext|position|
+-------------------------+-------+--------+
|Where is my string? |is |6 |
|Hm, this one is different|on |9 |
+-------------------------+-------+--------+
Solution 2:[2]
Another way using position
SQL function :
from pyspark.sql.functions import expr
df1 = df.withColumn('position', expr("position(subtext in text) -1"))
df1.show(truncate=False)
#+-------------------------+-------+--------+
#|text |subtext|position|
#+-------------------------+-------+--------+
#|Where is my string? |is |6 |
#|Hm, this one is different|on |9 |
#+-------------------------+-------+--------+
Solution 3:[3]
pyspark.sql.functions.instr(str, substr)
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
import pyspark.sql.functions as F
df.withColumn('pos',F.instr(df["text"], df["subtext"]))
Solution 4:[4]
You can use locate itself. The problem is first parameter of locate (substr) should be string.
So you can use expr function to convert column to string
Please find the correct code as below:
df=input_df.withColumn("poss", F.expr("locate(subtext,text,1)"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | mck |
Solution 2 | blackbishop |
Solution 3 | nobody |
Solution 4 | saranya rasamani |