'Pulling company name from webpage within <a> tag
I am trying to streamline my data collection by using Python 3.7 and BeautifulSoup to pull company name, if that company is approved or other, and if they are marketing to residential and/or businesses from this CT license list:
http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView
If I can get any help on how to finish the code to at least create a list of appended companies names within Python that would be a big help.
I am learning as I go and able to connect to the site and pull in the source code.
So far, I know this part of the code to work but not quite sure where to go from here:
import requests
from bs4 import BeautifulSoup
result = requests.get("http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView")
src = result.content
soup = BeautifulSoup(src,'lxml')
I can see the company name within the source code but not sure the best way to extract it and the others into a list:
<a href="/electric.nsf/c39dc573ab1299538525743b004d4df6/719f088ca20a6a1f85257dfd00480e13?OpenDocument">3Degrees Group, Inc.</a>
I would love to be able to pull all this into an csv. file at some point that has the companies, the status of the license, and who they market to but if someone can help me finish the code to put the companies in a list within Python that would much appreciated and allow me to learn by example.
Solution 1:[1]
23/04/22
:contains()
is being deprecated and replaced with :-soup-contains()
. This was introduced with Soup Sieve 2.1.
With bs4 4.7.1+ you can use :contains
and :has
to filter for only the sections pertaining to Supplier
. You can further subset the df for columns of interest.
tl;dr;
Make the request and read response into soup object:
r = requests.get('http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView')
soup = bs(r.content, 'lxml')
Let's use some images to explain the next steps......
What we have, which may be confusing, is a lot of nested tables, but we can start by thinking of the 'top' level table of interest:
We can take from this (hover over the different trs
in that image within the actual html and observe what is highlighted on the page to follow along. You can enter p:nth-child(4) > table
in the browser search box to isolate this table):
True, the actual visible content is nested but we know all content of interest is within trs
that at a given level are a series of sibling trs
.
This bit
soup.select('tr:has(td:nth-of-type(1) font:contains(Supplier)) ~ tr:not(:has(td:nth-of-type(1) font:contains(Aggregator)), :has(td:nth-of-type(1) font:contains(Aggregator)) ~ tr)')
gathers the sibling trs
of the 'top' level tr
that has child font
node containing Supplier
, tr:has(td:nth-of-type(1) font:contains(Supplier))
, then removes the 'top' level tr
containing Aggregator
and its siblings. As what you have in the html is a long list of trs
, you just want to filter out those that come after the section of interest.
Looking at the very first part (I'm using select_one
, rather than select
, to demonstrate the first node matched, and not the multiple siblings that are matched when we add in the general sibling combinator - more about that later):
soup.select_one('tr:has(td:nth-of-type(1) font:contains(Supplier))')
Running the above gives:
Compare this to the page:
You see how we have found a way to start selecting trs
at the top level starting with the one that (though nested) contains the Supplier
label. However, as we know that top level is flat we need to remove everything from (and including) the Aggregator
label. So, we add to that general sibling combinator the :not pseudo class. In a nutshell we say get all trs
that are siblings except those within the :not
Now we have our subset of rows of interest (green rectangle). We for loop
over these and each time we find a node matched by 'td:nth-of-type(2) font'
we set the status
to the value found e.g. Approved, Pending
.....
This line:
if node is not None:
is checking whether the current tr
contains a 'status' child node e.g. Approved/Pending
(which I name as status
for output) and set the later rows identifier for this category accordingly.
If the tr
does not have this child node then we know it is one of the other trs
that houses the tds
with the additional columns of info for output e.g.:
Due to the level of nesting, there are a few empty tds
including that we don't want. These we remove using pandas later:
df.drop([1,2,10], axis=1, inplace=True)
As we want the status
label and all the tds
in one list, row
, I use extend
to extend the first list to include the second. I believe this is quicker than insert
but would appreciate a yey/ney on that.
So, we would go, for example, from:
['Approved']
and
['', '', 'Yes', 'Yes', '3Degrees Group, Inc.', 'http://www.3degreesinc.com', '235 Montgomery Street, Suite 320 San Francisco, CA 94104', '(866) 476-9378', '11-11-07 12/14/2011', '']
to
['Approved', '', '', 'Yes', 'Yes', '3Degrees Group, Inc.', 'http://www.3degreesinc.com', '235 Montgomery Street, Suite 320 San Francisco, CA 94104', '(866) 476-9378', '11-11-07 12/14/2011', '']
These rows are added to a list called final
. So, you have a list of lists (each row).
You can pass this list to pandas.DataFrame to generate a dataframe ready for csv export with to_csv method. You use the columns
argument to specify the headers.
When generating the rows:
tds = [td.text for td in tr.select('td')]
We find the occasional additional whitespace and \n
(newline) e.g.
['', '', '', '', 'WFM Intermediary New England Energy, LLC', '', '125 Cambridgepark Dr, Cambridge, MA 02140', '', '07-10-08 \n11/28/2007\n9/8/2011', '']
I implement a simple regex to remove this:
tds = [re.sub('\n+|\s+',' ',td.text) for td in tr.select('td')]
Resulting in (perhaps not best example but is illustrative):
['', '', '', '', 'WFM Intermediary New England Energy, LLC', '', '125 Cambridgepark Dr, Cambridge, MA 02140', '', '07-10-08 11/28/2007 9/8/2011', '']
Regex:
Finally (did you make it this far?), we want to add some headers for the dataframe. We can use the ones from the page and again, via extend
, ensure we include our custom status
header.
headers = ['Status']
headers.extend([th.text for th in soup.select('th[align]')])
Py:
from bs4 import BeautifulSoup as bs
import requests, re
import pandas as pd
r = requests.get('http://www.dpuc.state.ct.us/electric.nsf/$FormByElectricApplicantsView?OpenForm&Start=1&Count=1000&ExpandView')
soup = bs(r.content, 'lxml')
final = []
headers = ['Status']
headers.extend([th.text for th in soup.select('th[align]')])
for tr in soup.select('tr:has(td:nth-of-type(1) font:contains(Supplier)) ~ tr:not(:has(td:nth-of-type(1) font:contains(Aggregator)), :has(td:nth-of-type(1) font:contains(Aggregator)) ~ tr)'):
node = tr.select_one('td:nth-of-type(2) font')
if node is not None:
status = node.text
else:
row = [status]
tds = [re.sub('\n+|\s+',' ',td.text) for td in tr.select('td')]
row.extend(tds)
final.append(row)
df = pd.DataFrame(final)
df.drop([1,2,10], axis=1, inplace=True)
df.columns = headers
df.to_csv(r'C:\Users\User\Desktop\Public Utilities.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of output:
Additional reading:
Solution 2:[2]
company_list = [];
for company in soup.find_all('a'):
company_list.append(company.string);
You can find more details regarding this on the BeautifulSoup documentation. You likely will have to modify this code to filter out specific tags via conditions inside the loop, but I believe that should be pretty simple.
More specifically, for your particular example, you can check the company.get('href')
matches ".nsf" via regex or whatever you prefer.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |