'Regex remove the hyperlink from a column in pandas data frame (imported from csv)
I read data from csv
data_frame2 = pd.read_csv("search query.csv")
print(data_frame2['id'][0]')
is
HYPERLINK("https://something.com/resource/1308610617","1308610617")'
what I have tried
data_frame2['id'] = data_frame2['id'].str.replace(r'https?://[^\s<>"]+|www\.[^\s<>"]+', "")
what I want to see is the last number for every row in that column.
1308610617
Solution 1:[1]
If the number is always at the last then this may help
data_frame2['id'] = data_frame2['id'].apply(lambda x:int(re.findall(r"[0-9]+$",x)[0]))
Solution 2:[2]
To capture only the last numbers of the link location you can create a regex starting with a word boundary
follow by one or more digits (\d+
) inside a capturing group. Then, use Pandas str.extract
function to return on the the desired parts. The capturing group will ensure that only the digits are returned by the extract function.
import pandas as pd
regex = r'\b(\d+)",'
df = pd.DataFrame({'id':['HYPERLINK("https://something.com/resource/1308610617","1308610617")']})
df['Num'] = df['id'].str.extract(regex)
print(df)
Output from df
id Num
0 HYPERLINK("https://something.com/resource/1308610617","1308610617") 1308610617
Similarly, to capture the digits in the "friendly_name" part of the function you can replace the regex from the code above with \b(\d+)"\)
(word boundary follow by one or more digits follow by double quotes and the character )
).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Midhilesh Momidi |
Solution 2 | n1colas.m |