'How to ignore infobox when scraping title from Wikipedia anchor text?
I am trying to scrape the first 20 links on a Wikipedia page but I want to ignore the infobox on the right side. It has a 'table' tag. Here is what I have so far, any help would be greatly appreciated.
import requests
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")
all_links = {}
count = 0
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]
content = soup.find('div', {'class':'mw-parser-output'})
for link in content.find_all("a"):
if count <= 20:
url = link.get("title", "")
if not any(url.startswith(x) for x in IGNORE) and url != "":
count = count + 1
print(url)
else:
break
Solution 1:[1]
One approach could be to select your elements more specific e.g. with css selectors
and :not()
as pseudo class
:
soup.select('div.mw-parser-output a:not(.infobox a)')
Example
import requests
from bs4 import BeautifulSoup
response = requests.get("https://en.wikipedia.org/wiki/Wales")
soup = BeautifulSoup(response.text, "html.parser")
all_links = {}
count = 0
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", " "]
for link in soup.select('div.mw-parser-output a:not(.infobox a)'):
if count <= 20:
url = link.get("title", "")
if not any(url.startswith(x) for x in IGNORE) and url != "":
count = count + 1
print(url)
else:
break
Output
Wales (disambiguation)
Welsh language
About this sound
Cymru.ogg
Countries of the United Kingdom
United Kingdom
England
Wales–England border
Severn Estuary
Bristol Channel
Irish Sea
Snowdon
Temperateness
Maritime climate
Cardiff
Welsh people
Celtic Britons
Roman withdrawal from Britain
Celtic nations
Llywelyn ap Gruffudd
Edward I of England
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | HedgeHog |