'rselenium clicking on dropdown element to collect data
I am trying to collect some area names from a website and in order to do so I want to click the drop-down box to expand the downwards pointing arrow.
i.e. on the following page if I click on the "distritos" drop down I can see further drop down-availability
https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l
For Ciutat Vella
I see I have 4 additional items Barri Gòtic
, EL Raval
, La Barceloneta
and Sant Pare, Sta...
I would like to collect these names also. I have the following code to collect the following:
library(RSelenium)
library(rvest)
library(tidyverse)
# 1.a) Open URL, click on provincias
rD <- rsDriver(browser="firefox", port=4536L)
remDr <- rD[["client"]]
url2 = "https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l"
remDr$navigate(url2)
remDr$maxWindowSize()
# accept cookies
remDr$findElement(using = "xpath",'/html/body/div[1]/div[4]/div/div/div/footer/div/button[2]')$clickElement()
#click on Distrito
remDr$findElement(using = "xpath", '/html/body/div[1]/div[2]/div[1]/div[3]/div/div[1]/div')$clickElement()
html_distrito_full_page = remDr$getPageSource()[[1]] %>%
read_html()
Distritos_Names = html_distrito_full_page %>%
html_nodes('.re-GeographicSearchNext-checkboxItem') %>%
html_nodes('.re-GeographicSearchNext-checkboxItem-literal') %>%
html_text()
Distritos_Names
Which gives:
[1] "Ciutat Vella" "Eixample" "Gràcia" "Horta - Guinardó" "Les Corts" "Nou Barris" "Sant Andreu" "Sant Martí"
[9] "Sants - Montjuïc" "Sarrià - Sant Gervasi"
However, this is missing the names of the regions in the drop-down boxes.
How can I collect these drop-down links also? i.e. RSelenium
to navigate to the page, expand all "downwards facing arrows" then use rvest
to scrape the whole page once these downwards facing arrows have been expanded.
Solution 1:[1]
You could just use rvest to get the mappings by extracting the JavaScript variable housing the mappings + some other data. Use jsonlite to deserialize the extracted string into a JSON object, then apply a custom function to extract the actual mappings for each dropdown. Wrap that function in a map_dfr() call to get a final combined dataframe of all dropdown mappings.
TODO: Review JSON to see if can remove magic number 4 and dynamically determine the correct item to retrieve from parent list.
library(tidyverse)
library(rvest)
library(jsonlite)
extract_data <- function(x) {
tibble(
location = x$literal,
sub_location = map(x$subLocations, "literal", pluck)
)
}
p <- read_html("https://www.fotocasa.es/es/comprar/viviendas/barcelona-capital/todas-las-zonas/l") %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)"')[, 2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
location_data <- data$initialSearch$result$geographicSearch[4]
df <- map_dfr(location_data, extract_data)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | QHarr |