'soup.find() function is not working, how do I find the ID value?
If I have the following HTML that was found with BeautifulSoup
, can someone explain why print(soup.find(id="style"))
or print(soup.find(id="id"))
does not work? I am trying to find the id number specifically in the line
<td style="text-align:center"><a href="?id=6359075900">6359075900</a></td>
</span>
<br/><br/>
<table>
<tr>
<th class="outer">Criteria</th>
<td class="outer">Type: Identity Match: ILIKE Search: 'example.org'</td>
</tr>
</table>
<br/>
<table>
<tr>
<th class="outer">Certificates</th>
<td class="outer">
<table>
<tr>
<th>
<a href="?q=example.org&dir=v&sort=0&group=none">crt.sh ID</a>
</th>
<th style="white-space:nowrap">
<a href="?q=example.org&dir=v&sort=1&group=none">Logged At</a>
⇧ </th>
<th style="white-space:nowrap"><a href="?q=example.org&dir=v&sort=2&group=none">Not Before</a>
</th>
<th style="white-space:nowrap"><a href="?q=example.org&dir=v&sort=4&group=none">Not After</a>
</th>
<th>Common Name</th>
<th>Matching Identities</th>
<th>
<a href="?q=example.org&dir=v&sort=3&group=none">Issuer Name</a>
</th>
</tr>
<tr>
<td style="text-align:center"><a href="?id=6359075900">6359075900</a></td>
<td style="text-align:center;white-space:nowrap">2022-03-17</td>
<td style="text-align:center;white-space:nowrap">2022-03-14</td>
<td style="text-align:center;white-space:nowrap">2023-03-14</td>
<td>www.example.org</td>
<td>example.org<br/>www.example.org</td>
<td><a href="?caid=185756" style="white-space:normal">C=US, O=DigiCert Inc, CN=DigiCert TLS RSA SHA256 2020 CA1</a></td>
</tr>
Solution 1:[1]
This should do it (it will find the displayed number, not the value of the id parameter in the link, but I assume it is the same):
from bs4 import BeautifulSoup
import re
f = open("index.html") # this is your HTML
soup = BeautifulSoup(f, 'html.parser')
res = soup.find_all(href=re.compile("\?id"))
print(res[0].contents[0]) # 6359075900
This works with your example. If you have more than one links with data to extract, you will need to change the regex in the compile
parameter and iterate through the results instead of using hardcoded indexes as the [0]
in the code above.
Solution 2:[2]
Main issue, there is no tag with attribute called id in your soup, so you wont find()
anything.
Try to select your elements more specific e.g. with css selectors
-> all href
that contains parameter ?id
:
soup.select('a[href*="?id"]')
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td style="text-align:center"><a href="?id=6359075900">6359075900</a></td>
<td><a href="?caid=185756" style="white-space:normal">C=US, O=DigiCert Inc, CN=DigiCert TLS RSA SHA256 2020 CA1</a></td>
</tr>
<tr>
<td style="text-align:center"><a href="?id=6359075900">6359075901</a></td>
<td><a href="?caid=185756" style="white-space:normal">C=US, O=DigiCert Inc, CN=DigiCert TLS RSA SHA256 2020 CA1</a></td>
</tr>
<tr>
<td style="text-align:center"><a href="?id=6359075900">6359075902</a></td>
<td><a href="?caid=185756" style="white-space:normal">C=US, O=DigiCert Inc, CN=DigiCert TLS RSA SHA256 2020 CA1</a></td>
</tr>
'''
soup = BeautifulSoup(html)
for a in soup.select('a[href*="?id"]'):
print(a.text)
Output
6359075900
6359075901
6359075902
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | evilmandarine |
Solution 2 | HedgeHog |