'Scrape and change data in date in BeautifulSoup
I am scraping data from different web pages and there are several dates in this data. The code allowing me to have the information that I want looks like this, I only put here the part concerning the dates.
data = []
for url in urlsjugement:
soup = BeautifulSoup(
requests.get(url, headers=headers).content, "html.parser"
)
title = soup.select_one("#identite_deno").get_text(strip=True)
try:
active = soup.select_one('td:-soup-contains("Jugement") + td').get_text(
strip=True)
except:
active = "In activity"
date = soup.select_one('td:-soup-contains("Date création entreprise") + td').get_text(
strip=True)
data.append([title, active, date])
df = pd.DataFrame(
data,
columns=["Title", "Active", "Date"],
)
print(df.to_markdown())
I would like first of all to separate the judgment and the date of judgment into two different data and to be able to compare the two dates. There is a business creation date and a closing date, so I would like to have the lifespan of the businesses, is that possible?
| Title | Active | Date |
|---:|:----------------------------|:--------------------------------------|:-----------|
| 0 | 1804 TRANSPORT | Liquidation judiciaire le 07-01-2022- | 28-01-2013 |
I have 2 informations in the column Active and I want separate these. After this I want calculate the time between the two date. Thanks for your help !
Solution 1:[1]
I only tried it with your first url, but inside your for
loop, I would make this change:
title = soup.select_one("#identite_deno").text
start = list(soup.select_one('td:-soup-contains("Date création entreprise") + td'))[0].text.strip()
end = list(soup.select_one('td.red').stripped_strings)[0].split('le ')[1]
days = datetime.strptime(end, '%d-%m-%Y')-datetime.strptime(start, '%d-%m-%Y')
data.append([title, start, end,days.days])
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jack Fleeting |