'Python get string from an html page
I have to create an array which contains all the element within title="", for example:
title="xxxxx", title="xxx2", title='xxx4', etc...
I need to get xxxx, xxx2, xxx4
I have used this script the get the html page:
import requests
import bs4
# URL
URL = "https://en.wikipedia.org/wiki/Main_Page"
# sending the request
response = requests.get(URL)
# parsing the response
soup = bs4.BeautifulSoup(response.text, 'html')
by printing soup, we can have the complete html file. Now I would like to get all the element wihtin soup that are within the string
title="".
Solution 1:[1]
To get all elements with a title attribute you could use e.g. css selectors
:
soup.select('[title]')
[<link href="/w/api.php?action=featuredfeed&feed=potd&feedformat=atom" rel="alternate" title="Wikipedia picture of the day feed" type="application/atom+xml"/>, <link href="/w/api.php?action=featuredfeed&feed=featured&feedformat=atom" rel="alternate" title="Wikipedia featured articles feed" type="application/atom+xml"/>, <link href="/w/api.php?action=featuredfeed&feed=onthisday&feedformat=atom" rel="alternate" title='Wikipedia "On this day..." feed' type="application/atom+xml"/>, <link href="/w/opensearch_desc.php" rel="search" title="Wikipedia (en)" type="application/opensearchdescription+xml"/>, <a href="/wiki/Wikipedia" title="Wikipedia">Wikipedia</a>, <a href="/wiki/Free_content" title="Free content">free</a>, <a href="/wiki/Encyclopedia" title="Encyclopedia">encyclopedia</a>, <a href="/wiki/Help:Introduction_to_Wikipedia" title="Help:Introduction to Wikipedia">anyone can edit</a>, <a href="/wiki/Special:Statistics" title="Special:Statistics">6,491,366</a>, <a href="/wiki/English_language" title="English language">English</a>, <a href="/wiki/Battle_of_Oroscopa" title="Battle of Oroscopa">Battle of Oroscopa</a>, <a href="/wiki/Ancient_Carthage" title="Ancient Carthage">Carthaginian</a>,...]
To create a list, with all values of these element title attributes:
[t.get('title') for t in soup.select('[title]')]
['Wikipedia picture of the day feed', 'Wikipedia featured articles feed', 'Wikipedia "On this day..." feed', 'Wikipedia (en)', 'Wikipedia', 'Free content', 'Encyclopedia', 'Help:Introduction to Wikipedia', 'Special:Statistics', 'English language', 'Battle of Oroscopa', 'Ancient Carthage', 'Hasdrubal the Boetharch', 'Numidia', 'Masinissa', 'Roman Republic', 'Carthage', 'Third Punic War', 'Battle of Oroscopa', 'Paige Bueckers', 'Uroš Drenovi?', '1921–22 Cardiff City F.C. season', "Wikipedia:Today's featured article/April 2022", 'mail:daily-article-l', 'Wikipedia:Featured articles', 'Martin Fehérváry', 'Martin Fehérváry', 'Swedish Hockey League', '1917 Odessa City Duma election', 'Odessa', 'List of people from Manchester', 'Geko (rapper)', 'K Koke', 'Elvis Costello', 'My Aim Is True', 'Bab el-Gasus', 'Jaega Wise', 'James Blunt', 'The Persistence of Chaos', 'Dave Frederick', 'Sussex County, Delaware', 'Wikipedia:Recent additions', 'Help:Your first article', 'Template talk:Did you know', 'Elon Musk in 2018', 'Twitter',...]
To avoid duplicates use a set:
set(t.get('title') for t in soup.select('[title]'))
Solution 2:[2]
Another common option other than bs4 is using regular expressions, pythons built in "re" module.
To answer your question directly, I pulled this quote from the documentation, located at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ :
Running the “three sisters” document through Beautiful Soup gives us a >BeautifulSoup object, which represents the document as a nested data structure:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse's story
# </title>
# </head>
# <body>
# <p class="title">
# <b>
# The Dormouse's story
# </b>
# </p>
# <p class="story">
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link3">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
# </p>
# <p class="story">
# ...
# </p>
# </body>
# </html>
Here are some simple ways to navigate that data structure:
soup.title
# <title>The Dormouse's story</title>
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
soup.p['class']
# u'title'
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | HedgeHog |
Solution 2 | Gerschel |