'How do I remove Japanese characters?

I have some data with Japanese characters from survey data. Some of the survey questions and answers (multiple choice) are given in both English and Japanese, e.g. Very rarely かなりまれ". In this case, it is helpful to remove the duplicate Japanese. How does one accomplish this? I only want to remove Japanese, not any other special characters.



Solution 1:[1]

The simplest approach is to keep only ASCII characters. This can be done by replacing non-ASCII with empty strings (e.g. str_replace_all("æøå ??", "[^0-F]", "")), and removing any resulting whitespace. However, if one wants to keep special symbols in general, this approach does not work. In that case one may want to remove only Japanese (including Chinese Kanji) symbols. This can be done by unicode block range matching. I found the Japanese relevant blocks here http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml, but Wikipedia lists them as well e.g. https://en.wikipedia.org/wiki/Katakana_(Unicode_block).

Here's a ready-made function (requires tidyverse and assertthat):

str_rm_jap = function(x) {
  #we replace japanese blocks with nothing, and clean any double whitespace from this
  #reference at http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
  x %>% 
    #japanese style punctuation
    str_replace_all("[\u3000-\u303F]", "") %>% 
    #katakana
    str_replace_all("[\u30A0-\u30FF]", "") %>% 
    #hiragana
    str_replace_all("[\u3040-\u309F]", "") %>% 
    #kanji
    str_replace_all("[\u4E00-\u9FAF]", "") %>% 
    #remove excess whitespace
    str_replace_all("  +", " ") %>% 
    str_trim()
}

#tests
assert_that(
  #positive tests
  "Very rarely ?????" %>% str_rm_jap() %>% equals("Very rarely"),
  "Comments ????????" %>% str_rm_jap() %>% equals("Comments"),

  #negative tests
  "Danish ok! ÆØÅ" %>% str_rm_jap() %>% equals("Danish ok! ÆØÅ")
)

Solution 2:[2]

You can use this to take out the Hiragana and Katakana:

replace(/[\u30a0-\u30ff\u3040-\u309f]/g, '')
  1. https://regex101.com/r/O5mfPu/1
  2. https://en.wikipedia.org/wiki/Katakana_(Unicode_block)
  3. https://en.wikipedia.org/wiki/Hiragana_(Unicode_block)

See also: JavaScript to replace Chinese characters

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 user