'Scraping content from urls in dataframe using R

Sorry, I'm relatively new to R and don't know it very well yet. I have also seen that similar questions have been asked more often. However, the corresponding solutions did not work for me (or - more likels - I did not make them work). I want to scrape content from a newspaper. Therefore, in a first step, i need to scrape all articles and the respective urls from an url. That works fine

Abendblatt <- read_html("https://www.abendblatt.de/archiv/nachrichten-vom-3-3-2016")
headline_ <- Abendblatt %>% 
  html_nodes(".teaser__headline") %>%
  html_text()
url_ <- Abendblatt %>% 
  html_nodes("article") %>%
  html_nodes("a") %>%
  html_attr("href")
df_urls <- data.frame(headline = headline_, url = url_)

Now I have the urls from all the articles. Next, I want to scrape specific contetn from the articles. For a single url that also works fine

Abendblatt_Article <- read_html("https://www.abendblatt.de/vermischtes/article227980833/Tatort-Muenster-Friederike-Kempter-hoert-als-Ermittlerin-auf.html")


header_ <- Abendblatt_Article %>% 
html_nodes(".article__header__headline") %>%
  html_text() %>%
  paste(., collapse = "")
intro_ <- Abendblatt_Article %>% 
  html_nodes(".article__header__intro__text") %>%
  html_text() %>%
  paste(., collapse = "")

text_ <- Abendblatt_Article %>% 
  html_nodes("p") %>%
  html_text() %>%
  paste(., collapse = "")
  
df <- data.frame(heading = header_, intro = intro_, text = text_)

However, I would like to loop over all urls in url_ from the dataframe df_urls.

Can anyone help me with that?

Many thanks

Jens



Solution 1:[1]

You can try this :

library(rvest)

read_data <- function(url) {
  
  result <- tryCatch({

  Abendblatt_Article <- read_html(url)
  
  header_ <- Abendblatt_Article %>% 
    html_nodes(".article__header__headline") %>%
    html_text() %>%
    paste(., collapse = "")
  intro_ <- Abendblatt_Article %>% 
    html_nodes(".article__header__intro__text") %>%
    html_text() %>%
    paste(., collapse = "")
  
  text_ <- Abendblatt_Article %>% 
    html_nodes("p") %>%
    html_text() %>%
    paste(., collapse = "")
  
  data.frame(heading = header_, intro = intro_, text = text_)  
  
  }, error = function(e) data.frame(heading = NA, intro = NA, text = NA))
  
  return(result)
}

result <- purrr::map_df(df_urls$url, read_data)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1