'Regex remove the hyperlink from a column in pandas data frame (imported from csv)

I read data from csv

data_frame2 = pd.read_csv("search query.csv")

print(data_frame2['id'][0]')

is

HYPERLINK("https://something.com/resource/1308610617","1308610617")'

what I have tried

data_frame2['id'] = data_frame2['id'].str.replace(r'https?://[^\s<>"]+|www\.[^\s<>"]+', "")

what I want to see is the last number for every row in that column.

1308610617


Solution 1:[1]

If the number is always at the last then this may help

data_frame2['id'] = data_frame2['id'].apply(lambda x:int(re.findall(r"[0-9]+$",x)[0]))

Solution 2:[2]

To capture only the last numbers of the link location you can create a regex starting with a word boundary follow by one or more digits (\d+) inside a capturing group. Then, use Pandas str.extract function to return on the the desired parts. The capturing group will ensure that only the digits are returned by the extract function.

import pandas as pd

regex = r'\b(\d+)",'

df = pd.DataFrame({'id':['HYPERLINK("https://something.com/resource/1308610617","1308610617")']})

df['Num'] = df['id'].str.extract(regex)
print(df)

Output from df

                                                                    id         Num

0  HYPERLINK("https://something.com/resource/1308610617","1308610617")  1308610617

Similarly, to capture the digits in the "friendly_name" part of the function you can replace the regex from the code above with \b(\d+)"\) (word boundary follow by one or more digits follow by double quotes and the character )).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Midhilesh Momidi
Solution 2 n1colas.m