'Conditional splitting of a HUGE file

I have a really huge file (>500 million lines) that I want to split in several smaller files according to the first 3 characters of one of its columns.

It looks like this, where each element of columns 1 and 2 is unique:

A0A023GPI8  A0A023GPI8.1    232300  1027923628
A0A023GPJ0  A0A023GPJ0.2    716541  765680613
A0A023PXA5  A0A023PXA5.1    559292  728048729
A0A023PXB0  A0A023PXB0.1    559292  728048786
A0A023PXB5  A0A023PXB5.1    559292  728048524
A0A023PXB9  A0A023PXB9.1    559292  728048769
A0A023PXC2  A0A023PXC2.1    559292  728050382

I used the following script thinking it would be quite fast, because it seemed to me that it involved a single reading of the whole file. However, it's been running for several days and it is far from being finished. Any idea to explain why, and solutions to propose?

while read line
do
    PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
    echo -e "$line" >> ../split_DB/$PREFIX.part
done < $file


Solution 1:[1]

It is potentially as easy as:

$ awk '{s=substr($2,1,3); print >> s}' file

The >> redirects the print to appending a file by the name given. The name is formed by the first 3 letters of the second column.

This will be monumentally faster than Bash dealing with this file.


Note:

Usually an OS does have a limit on the number of simultaneous files open. This may be an issue depending on the number of potential character combinations in the first 3 characters of the second column. This will effect any solution where the files of those names remain open while processing the given file -- not just awk.

If you have 000 to 999 that is 999 potential files open; if you have AAA to ZZZ that is 17,575; if you have three alphanumeric with upper and lower case, that is 238,327 potential open files... If you data has only a few unique prefixes, you may not need to worry about this; If you state the details of the data, the solutions suggested here may be different.

(You can calculate the potential combinations with a base conversion of 'ZZZ' into decimal based on the length of the alphabet allowed in the 3 characters. ('0'..'9','A'..'Z') is base 32 ('0'..'9','a'..'z','A'..'Z') is base 62 and so on.)

You can raise the limit with most Unix style OSs if need be (within reason) or open and close new files as needed. Raising the file limit to 238,327 would be impractical. You could also sort the data and close the previous file as it goes out of use.

Solution 2:[2]

read isn't terribly efficient; it has to read one character at a time to avoid reading past the next newline character. However, a big source of overhead here is calling cut twice on each line. We can avoid that by using read again to do your splitting, and using parameter expansion to extract the first character of the second column.

while read -r line; do
    read -r _ col2 _ <<< "$line"
    prefix=${col2:0:3}
    # If the first column has a fixed width, you can forgo the
    # previous two lines and use
    #   prefix=${line:12:3}
    printf '%s\n' "$line" >> ../split_DB/$prefix.part
done < "$file"

However, I wouldn't spend too much time trying to do this efficiently in bash: here's a quick-and-dirty Python script that will do the same thing:

import functools

# Cache the output file handles, so that each is opened only once.
@functools.lru_cache(None)
def open_output_file(path):
    return open(path, 'a')

with open(file) as fh:
    for line in fh:
        cols = line.strip().split()
        prefix = cols[1][0:3]
        open_output_file(f"../split_DB/{prefix}.part").write(line)

Solution 3:[3]

Why the shell script is slow

The reason it is slow is that for each of the 500 million lines, you are forcing your shell to create 3 processes, so your kernel is hard at work spawning 1.5 billion processes. Suppose it can handle 10 thousand processes a second; you’re still looking at 150 thousand seconds, which is 2 days. And 10k processes per second is fast; probably a factor of ten or more better than you’re getting. On my 2016 15" MacBook Pro running macOS High Sierra 10.13.1, with a 2.7 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3, and 500 GB Flash storage (about 150 GB free), I was getting around 700 processes per second, so the script would nominally take almost 25 days to run through 500 million records.

Ways to speed it up

There are ways to make the code faster. You could use plain shell, or Awk, or Python, or Perl. Note that if you use Awk, it needs to be GNU Awk, or at least not BSD (macOS) Awk — the BSD version simply decided it hadn't got enough file descriptors.

I used a random data generator to create a file with 100,000 random entries somewhat similar to those in the question:

E1E583ZUT9  E1E583ZUT9.9    422255  490991884
Z0L339XJB5  Z0L339XJB5.0    852089  601069716
B3U993YMV8  B3U993YMV8.7    257653  443396409
L2F129EXJ4  L2F129EXJ4.8    942989  834728260
R4G123QWR2  R4G123QWR2.6    552467  744905170
K4Z576RKP0  K4Z576RKP0.9    947374  962234282
Z4R862HWX1  Z4R862HWX1.4    909520  2474569
L5D027SCJ5  L5D027SCJ5.4    199652  773936243
R5R272YFB5  R5R272YFB5.4    329247  582852318
G1I128BMI2  G1I128BMI2.6    359124  404495594

(The command used is a home-brew generator that's about to get a rewrite.) The first two columns have the same 10 leading characters in the pattern X#X###XXX# (X for letter, # for digit); the only difference is in the suffix .#. This isn't exploited in the scripts; it doesn't matter in the slightest. There's also no guarantee that the values in the second column are unique, nor that the .1 entry appears for a key if the .2 entry appears, etc. These details are mostly immaterial to the performance measurements. Because of the letter-digit-letter prefix used for the file names, the 26 * 10 * 26 = 6760 possible file prefixes. With the 100,000 randomly generated records, every one of those prefixes is present.

I wrote a script to time various ways of processing the data. There are 4 shell script variants — the one posted by Lucas A, the OP; two posted by chepner (one as comments), and one I created. There's also the Awk script created by dawg, a mildly modified version of the Python 3 script posted by chepner, and a Perl script I wrote.

Results

The results can be summarized by this table (run-time measured in seconds of elapsed time or wall clock time):

????????????????????????????????????????????????????????????????
?  Script Variant ?  N ?    Mean ? Std Dev ?     Min ?     Max ?
????????????????????????????????????????????????????????????????
?   Lucas A Shell ? 11 ? 426.425 ?  16.076 ? 408.044 ? 456.926 ?
? Chepner 1 Shell ? 11 ?  39.582 ?   2.002 ?  37.404 ?  43.609 ?
?         Awk 256 ? 11 ?  38.916 ?   2.925 ?  30.874 ?  41.737 ?
? Chepner 2 Shell ? 11 ?  16.033 ?   1.294 ?  14.685 ?  17.981 ?
?   Leffler Shell ? 11 ?  15.683 ?   0.809 ?  14.375 ?  16.561 ?
?     Python 7000 ? 11 ?   7.052 ?   0.344 ?   6.358 ?   7.771 ?
?        Awk 7000 ? 11 ?   6.403 ?   0.384 ?   5.498 ?   6.891 ?
?       Perl 7000 ? 11 ?   1.138 ?   0.037 ?   1.073 ?   1.204 ?
????????????????????????????????????????????????????????????????

The original shell script is 2.5 orders of magnitude slower than Perl; Python and Awk have almost the same performance when there are enough file descriptors available (Python simply stops if there aren't enough file descriptors available; so does Perl). The shell script can be made about about half as fast as Python or Awk.

The 7000 denotes the number of open files needed (ulimit -n 7000). This is because there are 26 * 10 * 26 = 6760 different 3-character starting codes in the generated data. If you have more patterns, you'll need more open file descriptors to gain the benefit of keeping them all open, or you will need to write a file descriptor caching algorithm somewhat like the one that GNU Awk must be using, with the consequential performance loss. Note that if the data was presented in sorted order, so that all the entries for each file were presented in sequence, then you'd be able to tweak the algorithms so that only one output file was open at a time. The random generated data is not in sorted order, so it hits any caching algorithm hard.

Scripts

Here are the various scripts tested during this exercise. These, and much of the supporting material, is available on GitHub in soq/src/so-4747-6170. Not all the code used is present in GitHub.

Lucas A Shell — aka opscript.sh

cat "$@" |
while read line
do
    PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
    echo -e "$line" >> split_DB/$PREFIX.part
done

This is a not-entirely useless use of cat (see UUoC — Useless Use of cat for the comparison). If no arguments are provided, it copies standard input to the while loop; if any arguments are provided, they're treated as file names and passed to cat and it copies the contents of those files to the while loop. The original script had a hard-wired < file in it. There is no measurable performance cost to using cat here. A similar change was needed in Chepner's shell script.

Chepner 1 Shell — aka chepner-1.sh

cat "${@}" |
while read -r line; do
    read -r _ col2 _ <<< "$line"
    prefix=${col2:0:3}
    printf '%s\n' "$line" >> split_DB/$prefix.part
done

Chepner 2 Shell — aka chepner-2.sh

cat "${@}" |
while read -r line; do
    prefix=${line:12:3}
    printf '%s\n' "$line" >> split_DB/$prefix.part
done

Leffler Shell — aka jlscript.sh

sed 's/^[^ ]*  \(...\)/\1 &/' "$@" |
while read key line
do
    echo "$line" >> split_DB/$key.part
done

Awk Script - aka awkscript.sh

exec ${AWK:-awk} '{s=substr($2,1,3); print >> "split_DB/" s ".part"}' "$@"

This wins hands down for compactness of script, and has decent performance when run with GNU Awk and with enough available file descriptors.

Python Script — aka pyscript.py

This is a Python 3 script, a mildly modified version of what Chepner posted.

import fileinput

output_files = {}
#with open(file) as fh:
#    for line in fh:
for line in fileinput.input():
    cols = line.strip().split()
    prefix = cols[1][0:3]
    # Cache the output file handles, so that each is opened only once.
    #outfh = output_files.setdefault(prefix, open("../split_DB/{}.part".format(prefix), "w"))
    outfh = output_files.setdefault(prefix, open("split_DB/{}.part".format(prefix), "w"))
    print(line, file=outfh)

# Close all the output files
for f in output_files.values():
    f.close()

Perl Script — aka jlscript.pl

#!/usr/bin/env perl
use strict;
use warnings;

my %fh;

while (<>)
{
    my @fields = split;
    my $pfx = substr($fields[1], 0, 3);
    open $fh{$pfx}, '>>', "split_DB/${pfx}.part" or die
        unless defined $fh{$pfx};
    my $fh = $fh{$pfx};
    print $fh $_;
}

foreach my $h (keys %fh)
{
    close $fh{$h};
}

Test Script — aka test-script.sh

#!/bin/bash
#
# Test suite for SO 4747-6170

set_num_files()
{
    nfiles=${1:-256}
    if [ "$(ulimit -n)" -ne "$nfiles" ]
    then if ulimit -S -n "$nfiles" 
         then : OK
         else echo "Failed to set num files to $nfiles" >&2
              ulimit -HSa >&2
              exit 1
         fi
     fi
}

test_python_7000()
{
    set_num_files 7000
    timecmd -smr python3 pyscript.py "$@"
}

test_perl_7000()
{
    set_num_files 7000
    timecmd -smr perl jlscript.pl "$@"
}

test_awk_7000()
{
    set_num_files 7000
    AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript.sh "$@"
}

test_awk_256()
{
    set_num_files 256   # Default setting on macOS 10.13.1 High Sierra
    AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript-256.sh "$@"
}

test_op_shell()
{
    timecmd -smr sh opscript.sh "$@"
}

test_jl_shell()
{
    timecmd -smr sh jlscript.sh "$@"
}

test_chepner_1_shell()
{
    timecmd -smr bash chepner-1.sh "$@"
}

test_chepner_2_shell()
{
    timecmd -smr bash chepner-2.sh "$@"
}

shopt -s nullglob

# Setup - the test script reads 'file'.
# The SOQ global .gitignore doesn't permit 'file' to be committed.
rm -fr split_DB
rm -f file
ln -s generated.data file

# Ensure cleanup
trap 'rm -fr split_DB; exit 1' 0 1 2 3 13 15

for function in \
    test_awk_256 \
    test_awk_7000 \
    test_chepner_1_shell \
    test_chepner_2_shell \
    test_jl_shell \
    test_op_shell \
    test_perl_7000 \
    test_python_7000
do
    mkdir split_DB
    boxecho "${function#test_}"
    time $function file
    # Basic validation - the same information should appear for all scripts
    ls split_DB | wc -l
    wc split_DB/* | tail -n 2
    rm -fr split_DB
done

trap 0

This script was run using the command line notation:

time (ulimit -n 7000; TRACEDIR=. Trace bash test-script.sh)

The Trace command logs all standard output and standard error to a lgo file and echoes it to its own standard output, and it reports on the 'environment' in a broad sense (environment variables, ulimit settings, date, time, command, current directory, user/groups, etc). It took just under 10 minutes to run the complete set of tests, three-quarters of which was spent running the OP's script.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 benrg
Solution 3