'R import of stata file has problems with French accented characters

I have a large stata file that I think has some French accented characters that have been saved poorly.

When I import the file with the encoding set to blank, it won't read in. When I set it to latin1 it will read in, but in one variable, and I'm certain in others, French accented characters are not rendered properly. I had a similar problem with another stata file and I tried to apply the fix (which actually did not work in that case, but seems on point) here.

To be honest this seems to be the real problem here somehow. A lot of the garbled characters are "actual" and they match up to what is "expected" But I have no idea to go back.

Reproducible code is here:


library(haven)
library(here)
library(tidyverse)
library(labelled)
#Download file
temp <- tempfile()
temp2 <- tempfile()

download.file("https://github.com/sjkiss/Occupation_Recode/raw/main/Data/CES-E-2019-online_F1.dta.zip", temp)
unzip(zipfile = temp, exdir = temp2)
ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="latin1")

#Try with encoding set to blank, it won't work. 
#ces19web <- read_dta(file.path(temp2, "CES-E-2019-online_F1.dta"), encoding="")

unlink(c(temp, temp2))

#### Diagnostic section for accented characters ####
ces19web$cps19_prov_id
#Note value labels are cut-off at accented characters in Quebec. 
#I know this occupation has messed up characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text)
#Check the encodings of the occupation titles and store in a variable encoding
ces19web$encoding<-Encoding(ces19web$pes19_occ_text)
#Check encoding of problematic characters
ces19web %>% 
  filter(str_detect(pes19_occ_text,"assembleur-m")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) 
#Write out messy occupation titles
ces19web %>% 
  filter(str_detect(pes19_occ_text,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, encoding) %>% 
  write_csv(file=here("Data/messy.csv"))

#Try to fix

source("https://github.com/sjkiss/Occupation_Recode/raw/main/fix_encodings.R")
#store the messy variables in messy
messy<-ces19web$pes19_occ_text
library(stringi)
#Try to clean with the function fix_encodings
ces19web$pes19_occ_text_cleaned<-stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)

#Examine
ces19web %>% 
  filter(str_detect(pes19_occ_text_cleaned,"Ã|©")) %>% 
  select(cps19_ResponseId, pes19_occ_text, pes19_occ_text_cleaned, encoding) %>% 
head()



Solution 1:[1]

Your data file is a dta version 113 file (the first byte in the file is 113). That is, it's a Stata 8 file, and especially pre-Stata 14, hence using custom encoding (Stata >=14 uses UTF-8).

So using the encoding argument of read_dta seems right. But there are a few problems here, as can be seen with a hex editor.

First, the truncated labels at accented letters (like Québec ? Qu) are actually not caused by haven: they are stored truncated in the dta file.

The pes19_occ_text is encoded in UTF-8, as you can check with:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="UTF-8")
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

This "é" is characteristic of UTF-8 data (here "é") read as latin1. However, if you try to import with encoding="UTF-8", read_dta will fail: there might be other non-UTF-8 characters in the file, that read_dta can't read as UTF-8. We have to do somthing after the import.

Here, read_dta is doing something nasty: it imports "Producteur télé" as if it were latin1 data, and converts to UTF-8, so the encoding string really has UTF-8 characters "Ã" and "©".

To fix this, you have first to convert back to latin1. The string will still be "Producteur télé", but encoded in latin1.

Then, instead of converting, you have simply to force the encoding as UTF-8, without changing the data.

Here is the code:

ces19web <- read_dta("CES-E-2019-online_F1.dta", encoding="")
ces19web$pes19_occ_text <- iconv(ces19web$pes19_occ_text, from = "UTF-8", to = "latin1")
Encoding(ces19web$pes19_occ_text) <- "UTF-8"
grep("^Producteur", unique(ces19web$pes19_occ_text), value = T)

output: "Producteur télé"

You can do the same on other variables with diacritics.


The use of iconv here may be more understandable if we convert to raw with charToRaw, to see the actual bytes. After importing the data, "télé" is the representation of "74 c3 83 c2 a9 6c c3 83 c2 a9" in UTF-8. The first byte 0x74 (in hex) is the letter "t", and 0x6c is the letter "l". In between, we have four bytes, instead of only two for the letter "é" in UTF-8 ("c3 a9", i.e. "é" when read as latin1).

Actually, "c3 83" is "Ã" and "c2 a9" is "©".

Therefore, we have first to convert these characters back to latin1, so that they take one byte each. Then "74 c3 a9 6c c3 a9" is the encoding of "télé", but this time in latin1. That is, the string has the same bytes as "télé" encoded in UTF-8, and we just need to tell R that the encoding is not latin1 but UTF-8 (and this is not a conversion).

See also the help pages of Encoding and iconv.


Now a good question may be: how did you end up with such a bad dta file in the first place? It's quite surprising for a Stata 8 file to hold UTF-8 data.

The first idea that comes to mind is a bad use of the saveold command, that allows one to save data in a Stata file for an older version. But according to the reference manual, in Stata 14 saveold can only store files for Stata >=11.

Maybe a third party tool did this, as well as the bad truncation of labels? It might be SAS or SPSS for instance. I don't know were your data come from, but it's not uncommon for public providers to use SAS for internal work and to publish converted datasets. For instance datasets from the European Social Survey are provided in SAS, SPSS and Stata format, but if I remember correctly, initially it was only SAS and SPSS, and Stata came later: the Stata files are probably just converted using another tool.


Answer to the comment: how to loop over character variables to do the same? There is a smarter way with dplyr, but here is a simple loop with base R.

ces19web <- read_dta("CES-E-2019-online_F1.dta")

for (n in names(ces19web)) {
  v <- ces19web[[n]]
  if (is.character(v)) {
    v <- iconv(v, from = "UTF-8", to = "latin1") 
    Encoding(v) <- "UTF-8"
  }
  ces19web[[n]] <- v
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1