'Getting the text values using Rvest

The page in question is this: https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)

I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.

Click on OPEN ALL LEVELS to get the complete data

The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.

    x <- remdr$getPageSource()
    xpg <- read_html(x[[1]])
    
    # get the HS descriptions
    treeView <- xpg %>%
      html_nodes(xpath = '//*/div[@class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
      html_nodes(xpath = '//*/p[contains(@class, "MuiTypography-body1")]') %>%
      html_nodes('span') %>%
      html_text(trim = TRUE)

I need all the descriptions in order as a list.

Update: This is the output format. Descriptions and the 8-digit code Output format



Solution 1:[1]

General thoughts:

RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.

The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.


The API call:

* You can click on many of the smaller images below to enlarge

enter image description here


The API response:

This request returns nested JSON:

enter image description here

The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:

Each of these named lists is nested as per the levels on the webpage:

You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.

Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.

Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:

enter image description here

In addition to this, there is, where present, a commodityCode key within the nested headingItems:

enter image description here


Approach and challenges:

Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.

Example mask (descriptions|text):

enter image description here

The TRUE values are for the following chained accessors (chaining dependent on depth):

enter image description here

Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:

enter image description here

This work is done within the custom function get_data, where I also then do the following steps:

  1. I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve

enter image description here

  1. Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:

enter image description here

  1. At this point the codes and descriptions/text are in a single column:

enter image description here

I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.

enter image description here

  1. Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:

enter image description here

  1. I then keep only the rows where description_header_flag is TRUE:

enter image description here

  1. Finally, I remove the now not needed flag column:

enter image description here

This leaves me with a clean data.table to return from the function.


Generating the final output:

As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:

df <- rbindlist(map(data, get_data))

This can then be written to csv for example.

fwrite(df, 'result.csv')

Example rows in df:


N.B. I return a data.table as you showed 2 columns in your desired output.


R:

library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)

get_data <- function(x) {
  y <- x %>% unlist(recursive = T)
  t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
  t <- t[description_header_flag | commodity_code, ]
  t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
    html_text(read_html(.x))
  } else {
    .x
  }))
  t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
    .y
  } else {
    NA
  })
  t[, commodity_code := c(NA, commodity_code[.I - 1])]
  t <- t[description_header_flag == T, ]
  t[, description_header_flag := NULL]
  return(t)
}

data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems

df <- rbindlist(map(data, get_data))

fwrite(df, "result.csv")

Sample output:

enter image description here


Credits:

  1. gsub solution taken from: @shabbychef here

  2. row shift solution adapted from: @Gary Weissman here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1