'Scraping content from urls in dataframe using R
Sorry, I'm relatively new to R and don't know it very well yet. I have also seen that similar questions have been asked more often. However, the corresponding solutions did not work for me (or - more likels - I did not make them work). I want to scrape content from a newspaper. Therefore, in a first step, i need to scrape all articles and the respective urls from an url. That works fine
Abendblatt <- read_html("https://www.abendblatt.de/archiv/nachrichten-vom-3-3-2016")
headline_ <- Abendblatt %>%
html_nodes(".teaser__headline") %>%
html_text()
url_ <- Abendblatt %>%
html_nodes("article") %>%
html_nodes("a") %>%
html_attr("href")
df_urls <- data.frame(headline = headline_, url = url_)
Now I have the urls from all the articles. Next, I want to scrape specific contetn from the articles. For a single url that also works fine
Abendblatt_Article <- read_html("https://www.abendblatt.de/vermischtes/article227980833/Tatort-Muenster-Friederike-Kempter-hoert-als-Ermittlerin-auf.html")
header_ <- Abendblatt_Article %>%
html_nodes(".article__header__headline") %>%
html_text() %>%
paste(., collapse = "")
intro_ <- Abendblatt_Article %>%
html_nodes(".article__header__intro__text") %>%
html_text() %>%
paste(., collapse = "")
text_ <- Abendblatt_Article %>%
html_nodes("p") %>%
html_text() %>%
paste(., collapse = "")
df <- data.frame(heading = header_, intro = intro_, text = text_)
However, I would like to loop over all urls in url_ from the dataframe df_urls.
Can anyone help me with that?
Many thanks
Jens
Solution 1:[1]
You can try this :
library(rvest)
read_data <- function(url) {
result <- tryCatch({
Abendblatt_Article <- read_html(url)
header_ <- Abendblatt_Article %>%
html_nodes(".article__header__headline") %>%
html_text() %>%
paste(., collapse = "")
intro_ <- Abendblatt_Article %>%
html_nodes(".article__header__intro__text") %>%
html_text() %>%
paste(., collapse = "")
text_ <- Abendblatt_Article %>%
html_nodes("p") %>%
html_text() %>%
paste(., collapse = "")
data.frame(heading = header_, intro = intro_, text = text_)
}, error = function(e) data.frame(heading = NA, intro = NA, text = NA))
return(result)
}
result <- purrr::map_df(df_urls$url, read_data)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |