'Identifying and removing lines with non utf-8 characters in files
I have a python program that parses text files line for line. Some, very few, of these lines are corrupt meaning that they have non utf-8 characters. Once a line has a corrupt character, then the whole content of the line is waste. So solutions that delete, replace etc single characters won't do - I want to delete any line with none utf-8 characters as priority number 1, but saving it to another file to inspect it further is of interest if possible. All previous solutions I find only delete/replace non utf-8 characters.
My main language is python, however I am working in Linux so bash etc is a viable solution.
Solution 1:[1]
My main language is python, however I am working in Linux so bash etc is a viable solution.
I don't know python well enough to use it for an answer, so here's a perl version. The logic should be pretty similar:
#!/usr/bin/env perl
use warnings;
use strict;
use Encode;
# One argument: filename to log corrupt lines to. Reads from standard
# input, prints valid lines on standard output; redirect to another
# file if desired.
# Treat input and outputs as binary streams, except STDOUT is marked
# as UTF8 encoded.
open my $errors, ">:raw", $ARGV[0] or die "Unable to open $ARGV[0]: $!\n";
binmode STDIN, ":raw";
binmode STDOUT, ":raw:utf8";
# For each line read from standard input, print it to standard
# output if valid UTF-8, otherwise log it.
while (my $line = <STDIN>) {
eval {
# Default decode behavior is to replace invalid sequences with U+FFFD.
# Raise an error instead.
print decode("UTF-8", $line, Encode::FB_CROAK);
} or print $errors $line;
}
close $errors;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Shawn |