'Fix string encoding issues
Does anyone know of a .Net library (NuGet package preferrably) that I can use to fix strings that are 'messed up' because of encoding issues?
I have Excel* files that are supplied by third parties that contain strings like:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
These entries are simply user-error (e.g. someone copy/pasted wrong or something) because elsewhere in the same file the same entries are correct:
Telefónica UK Limited
Serviços de Comunicações e Multimédia
So I was wondering if there is a library/package/something that takes a string and fixes "common errors" like çõ
→ çõ
and ó
→ ó
. I understand that this won't be 100% fool-proof and may result in some false-negatives but it would sure be nice to have some field-tested library to help me clean up my data a bit. Ideally it would 'autodetect' the issue(s) and 'autofix' them as I won't always be able to tell what the source encoding (and destination encoding) was at the time the mistake was made.
* The filetype is not very relevant, I may have text from other parties in other fileformats that have the same issue...
Solution 1:[1]
My best advice is to start with a list of special characters that are used in the language in question.
I assume you're just dealing with Portuguese or other European languages with just a handful of non-US-ASCII characters.
I also assume you know what the bad encoding was in the first place (i.e. the code page), and it was always the same.
(If you can't assume these things, then it's a bigger problem.)
Then encode each of these characters badly, and look for the results in your source text. If any are found, you can treat it as badly encoded text.
var specialCharacters = "çõéó";
var goodEncoding = Encoding.UTF8;
var badEncoding = Encoding.GetEncoding(28591);
var badStrings = specialCharacters.Select(c => badEncoding.GetString(goodEncoding.GetBytes(c.ToString())));
var sourceText = "Serviços de Comunicações e Multimédia";
if(badStrings.Any(s => sourceText.Contains(s)))
{
sourceText = goodEncoding.GetString(badEncoding.GetBytes(sourceText));
}
Solution 2:[2]
The first step in fixing a bad encoding is to find what encoding the text was mis-encoded to, often this is not obvious.
So, start with a bit of text that is mis-encoded, and the corrected version of the text. Here my badly encoded text ends with ä rather than ä
var name = "Viistoperä";
var target = "Viistoperä";
var encs = Encoding.GetEncodings();
foreach (var encodingType in encs)
{
var raw = Encoding.GetEncoding(encodingType.CodePage).GetBytes(name);
var output = Encoding.UTF8.GetString(raw);
if (output == target)
{
Console.WriteLine("{0},{1},{2}",encodingType.DisplayName, encodingType.CodePage, output);
}
}
This will output a number of candidate encodings, and you can either pick the most relevant one. Windows-1252 is a better candidate than Turkish in this case.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Tim Rogers |
Solution 2 | Fiach Reid |