'Identifying and removing lines with non utf-8 characters in files

I have a python program that parses text files line for line. Some, very few, of these lines are corrupt meaning that they have non utf-8 characters. Once a line has a corrupt character, then the whole content of the line is waste. So solutions that delete, replace etc single characters won't do - I want to delete any line with none utf-8 characters as priority number 1, but saving it to another file to inspect it further is of interest if possible. All previous solutions I find only delete/replace non utf-8 characters.

My main language is python, however I am working in Linux so bash etc is a viable solution.



Solution 1:[1]

My main language is python, however I am working in Linux so bash etc is a viable solution.

I don't know python well enough to use it for an answer, so here's a perl version. The logic should be pretty similar:

#!/usr/bin/env perl
use warnings;
use strict;
use Encode;

# One argument: filename to log corrupt lines to. Reads from standard
# input, prints valid lines on standard output; redirect to another
# file if desired.

# Treat input and outputs as binary streams, except STDOUT is marked
# as UTF8 encoded.
open my $errors, ">:raw", $ARGV[0] or die "Unable to open $ARGV[0]: $!\n";
binmode STDIN, ":raw";
binmode STDOUT, ":raw:utf8";

# For each line read from standard input, print it to standard
# output if valid UTF-8, otherwise log it.
while (my $line = <STDIN>) {
    eval {
        # Default decode behavior is to replace invalid sequences with U+FFFD.
        # Raise an error instead.
        print decode("UTF-8", $line, Encode::FB_CROAK);
    } or print $errors $line;
}

close $errors;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Shawn