'Python - BeautifulSoup - How to return two different elements or more, with different attributes?
HTML Exemple
<html>
<div book="blue" return="abc">
<h4 class="link">www.example.com</h4>
<p class="author">RODRIGO</p>
</html>
Ex1:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"class": ["author","link"]})
for each in res:
print(each)
Result1:
www.example.com RODRIGO
Ex2:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"book": ["blue"]})
for each in res:
print(each["return")
Result 2:
abc
!!!puzzle!!!
The question I have is how to return the 3 results in a single query?
Result 3
www.example.com RODRIGO abc
Solution 1:[1]
Example HTML seems to be broken - Assuming the div
wrappes the other tags and it is may not the only book you can select all books:
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Example
from bs4 import BeautifulSoup
html = '''
<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>
'''
soup = BeautifulSoup(html)
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Output
www.rodrigo.com RODRIGO abc
A more structured example could be:
data = []
for e in soup.select('[book="blue"]'):
data.append({
'link':e.h4.text,
'author':e.select_one('.author').text,
'return':e.get('return')
})
data
Output:
[{'link': 'www.rodrigo.com', 'author': 'RODRIGO', 'return': 'abc'}]
Solution 2:[2]
For the case one attribute against many values a regex approach is suggested:
from bs4 import BeautifulSoup
import re
html = """<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
soup = BeautifulSoup(html, 'lxml')
by_clss = soup.find_all(class_=re.compile(r'link|author'))
print(b_clss)
For more flexibility, a custom query function can be passed to find
or find_all
:
from bs4 import BeautifulSoup
html = """<html>
<div href="blue" return="abc"></div> <!-- div need a closing tag in a html-doc-->
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
def query(tag):
if tag.has_attr('class'):
# tag['class'] is a list. Here assumed that has only one value
return set(tag['class']) <= {'link', 'author'}
if tag.has_attr('book'):
return tag['book'] in {'blue'}
return False
print(soup.find_all(query))
# [<div book="blue" return="abc"></div>, <h4 class="link">www.rodrigo.com</h4>, <p class="author">RODRIGO</p>]
Notice that your html-sample has no closing div-tag. In my second case I added it otherwise the soup... will not taste good.
EDIT To retrieve elements which satisfies a simultaneous conditions on attributes the query could look like this:
def query_by_attrs(**tag_kwargs):
# tag_kwargs: {attr: [val1, val2], ...}
def wrapper(tag):
for attr, values in tag_kwargs.items():
if tag.has_attr(attr):
# check if tag has multi-valued attributes (class,...)
if not isinstance((tag_attr:=tag[attr]), list): # := for python >=3.8
tag_attr = (tag_attr,) # as tuple
return bool(set(tag_attr).intersection(values)) # false if empty set
return wrapper
q_data = {'class': ['link', 'author'], 'book': ['blue']}
results = soup.find_all(query_by_attrs(**q_data))
print(results)
Solution 3:[3]
Extract All link from WebSite
import requests
from bs4 import BeautifulSoup
url = 'https://mixkit.co/free-stock-music/hip-hop/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | Rodrigo Moraes |