'Web Scraping Google Scholar Author profiles
I have used scholarly package and parsed the author names generated in the 3 question its method search by author name to get the author profiles including all the citation information of all the professors. I was able to load the data into a final dataframe with NA values for those who do not have a google scholar profile. However, there is an issue approx. 8 authors citation information is not matching the information on google scholar website, it is because the scholarly package is retrieving the citation information of other authors with the same name. I believe I can fix it by using search_author_id function but the question is how do we get the author_ids of all the professors in the first place.
Any help would be appreciated.
Cheers, Yash
Solution 1:[1]
This solution possibly will not be suitable for the scholarly
package. beautifulsoup
will be used instead.
Author id's
is located under the tag name inside the <a>
tag href
attribute. Here's how we can grab their id's:
# assumes that request and soup are already sent and made
link = soup.select_one('.gs_ai_name a')['href']
# https://stackoverflow.com/a/6633693/15164646
_id = link
# looking for the text that contains "user=" to split it to 3 parts.
id_identifer = 'user='
# splitting text to 3 parts
before_keyword, keyword, after_keyword = _id.partition(id_identifer)
# after_keyword means that everything AFTER "user=" will be scraped, which is ID.
author_id = after_keyword
# RlANTZEAAAAJ
Code that goes a "bit" out of your question scope (full example in the online IDE under bs4 folder -> get_profiles.py
):
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
for result in soup.select('.gs_ai_chpr'):
name = result.select_one('.gs_ai_name a').text
link = result.select_one('.gs_ai_name a')['href']
# https://stackoverflow.com/a/6633693/15164646
_id = link
id_identifer = 'user='
before_keyword, keyword, after_keyword = _id.partition(id_identifer)
author_id = after_keyword
affiliations = result.select_one('.gs_ai_aff').text
email = result.select_one('.gs_ai_eml').text
try:
interests = result.select_one('.gs_ai_one_int').text
except:
interests = None
cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')
Output:
Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516
Alternatively, you can do the same thing with Google Scholar Profiles API from SerpApi, but without thinking about how to solve the CAPTCHA, find proxies, and maintain the parser over time.
It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi API key
"engine": "google_scholar_profiles", # search engine
"mauthors": "samsung" # search query
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results.get('profiles'):
name = result.get('name')
email = result.get('email')
author_id = result.get('author_id')
affiliation = result.get('affiliations')
cited_by = result.get('cited_by')
interests = result['interests'][0]['title']
interests_link = result['interests'][0]['link']
print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')
Part of the output:
Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology
Disclaimer, I work for SerpApi.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |