'Web Scraping Google Scholar Author profiles

I have used scholarly package and parsed the author names generated in the 3 question its method search by author name to get the author profiles including all the citation information of all the professors. I was able to load the data into a final dataframe with NA values for those who do not have a google scholar profile. However, there is an issue approx. 8 authors citation information is not matching the information on google scholar website, it is because the scholarly package is retrieving the citation information of other authors with the same name. I believe I can fix it by using search_author_id function but the question is how do we get the author_ids of all the professors in the first place.

Any help would be appreciated.

Cheers, Yash

Solution 1:^[1]

This solution possibly will not be suitable for the scholarly package. beautifulsoup will be used instead.

Author id's is located under the tag name inside the <a> tag href attribute. Here's how we can grab their id's:

# assumes that request and soup are already sent and made

link = soup.select_one('.gs_ai_name a')['href']

# https://stackoverflow.com/a/6633693/15164646
_id = link

# looking for the text that contains "user=" to split it to 3 parts.
id_identifer = 'user='

# splitting text to 3 parts
before_keyword, keyword, after_keyword = _id.partition(id_identifer)

# after_keyword means that everything AFTER "user=" will be scraped, which is ID.
author_id = after_keyword

# RlANTZEAAAAJ

Code that goes a "bit" out of your question scope (full example in the online IDE under bs4 folder -> get_profiles.py):

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.gs_ai_chpr'):
  name = result.select_one('.gs_ai_name a').text
  link = result.select_one('.gs_ai_name a')['href']

  # https://stackoverflow.com/a/6633693/15164646
  _id = link
  id_identifer = 'user='
  before_keyword, keyword, after_keyword = _id.partition(id_identifer)
  author_id = after_keyword
  affiliations = result.select_one('.gs_ai_aff').text
  email = result.select_one('.gs_ai_eml').text

  try:
    interests = result.select_one('.gs_ai_one_int').text
  except:
    interests = None

  cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
  
  print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')

Output:

Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516

Alternatively, you can do the same thing with Google Scholar Profiles API from SerpApi, but without thinking about how to solve the CAPTCHA, find proxies, and maintain the parser over time.

It's a paid API with a free plan.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    "api_key": os.getenv("API_KEY"),      # your serpapi API key
    "engine": "google_scholar_profiles",  # search engine
    "mauthors": "samsung"                 # search query
}

search = GoogleSearch(params)             # where data extraction happens
results = search.get_dict()               # JSON -> Python dictionary

for result in results.get('profiles'):
    name = result.get('name')
    email = result.get('email')
    author_id = result.get('author_id')
    affiliation = result.get('affiliations')
    cited_by = result.get('cited_by')
    interests = result['interests'][0]['title']
    interests_link = result['interests'][0]['link']

print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')

Part of the output:

Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology

Disclaimer, I work for SerpApi.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Web Scraping Google Scholar Author profiles

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]