'Conditional splitting of a HUGE file
I have a really huge file (>500 million lines) that I want to split in several smaller files according to the first 3 characters of one of its columns.
It looks like this, where each element of columns 1 and 2 is unique:
A0A023GPI8 A0A023GPI8.1 232300 1027923628
A0A023GPJ0 A0A023GPJ0.2 716541 765680613
A0A023PXA5 A0A023PXA5.1 559292 728048729
A0A023PXB0 A0A023PXB0.1 559292 728048786
A0A023PXB5 A0A023PXB5.1 559292 728048524
A0A023PXB9 A0A023PXB9.1 559292 728048769
A0A023PXC2 A0A023PXC2.1 559292 728050382
I used the following script thinking it would be quite fast, because it seemed to me that it involved a single reading of the whole file. However, it's been running for several days and it is far from being finished. Any idea to explain why, and solutions to propose?
while read line
do
PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
echo -e "$line" >> ../split_DB/$PREFIX.part
done < $file
Solution 1:[1]
It is potentially as easy as:
$ awk '{s=substr($2,1,3); print >> s}' file
The >>
redirects the print to appending a file by the name given. The name is formed by the first 3 letters of the second column.
This will be monumentally faster than Bash dealing with this file.
Note:
Usually an OS does have a limit on the number of simultaneous files open. This may be an issue depending on the number of potential character combinations in the first 3 characters of the second column. This will effect any solution where the files of those names remain open while processing the given file -- not just awk.
If you have 000
to 999
that is 999 potential files open; if you have AAA
to ZZZ
that is 17,575; if you have three alphanumeric with upper and lower case, that is 238,327 potential open files... If you data has only a few unique prefixes, you may not need to worry about this; If you state the details of the data, the solutions suggested here may be different.
(You can calculate the potential combinations with a base conversion of 'ZZZ'
into decimal based on the length of the alphabet allowed in the 3 characters. ('0'..'9','A'..'Z')
is base 32 ('0'..'9','a'..'z','A'..'Z')
is base 62 and so on.)
You can raise the limit with most Unix style OSs if need be (within reason) or open and close new files as needed. Raising the file limit to 238,327 would be impractical. You could also sort the data and close the previous file as it goes out of use.
Solution 2:[2]
read
isn't terribly efficient; it has to read one character at a time to avoid reading past the next newline character. However, a big source of overhead here is calling cut
twice on each line. We can avoid that by using read
again to do your splitting, and using parameter expansion to extract the first character of the second column.
while read -r line; do
read -r _ col2 _ <<< "$line"
prefix=${col2:0:3}
# If the first column has a fixed width, you can forgo the
# previous two lines and use
# prefix=${line:12:3}
printf '%s\n' "$line" >> ../split_DB/$prefix.part
done < "$file"
However, I wouldn't spend too much time trying to do this efficiently in bash
: here's a quick-and-dirty Python script that will do the same thing:
import functools
# Cache the output file handles, so that each is opened only once.
@functools.lru_cache(None)
def open_output_file(path):
return open(path, 'a')
with open(file) as fh:
for line in fh:
cols = line.strip().split()
prefix = cols[1][0:3]
open_output_file(f"../split_DB/{prefix}.part").write(line)
Solution 3:[3]
Why the shell script is slow
The reason it is slow is that for each of the 500 million lines, you are forcing your shell to create 3 processes, so your kernel is hard at work spawning 1.5 billion processes. Suppose it can handle 10 thousand processes a second; you’re still looking at 150 thousand seconds, which is 2 days. And 10k processes per second is fast; probably a factor of ten or more better than you’re getting. On my 2016 15" MacBook Pro running macOS High Sierra 10.13.1, with a 2.7 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3, and 500 GB Flash storage (about 150 GB free), I was getting around 700 processes per second, so the script would nominally take almost 25 days to run through 500 million records.
Ways to speed it up
There are ways to make the code faster. You could use plain shell, or Awk, or Python, or Perl. Note that if you use Awk, it needs to be GNU Awk, or at least not BSD (macOS) Awk — the BSD version simply decided it hadn't got enough file descriptors.
I used a random data generator to create a file with 100,000 random entries somewhat similar to those in the question:
E1E583ZUT9 E1E583ZUT9.9 422255 490991884
Z0L339XJB5 Z0L339XJB5.0 852089 601069716
B3U993YMV8 B3U993YMV8.7 257653 443396409
L2F129EXJ4 L2F129EXJ4.8 942989 834728260
R4G123QWR2 R4G123QWR2.6 552467 744905170
K4Z576RKP0 K4Z576RKP0.9 947374 962234282
Z4R862HWX1 Z4R862HWX1.4 909520 2474569
L5D027SCJ5 L5D027SCJ5.4 199652 773936243
R5R272YFB5 R5R272YFB5.4 329247 582852318
G1I128BMI2 G1I128BMI2.6 359124 404495594
(The command used is a home-brew generator that's about to get a rewrite.) The first two columns have the same 10 leading characters in the pattern X#X###XXX#
(X
for letter, #
for digit); the only difference is in the suffix .#
. This isn't exploited in the scripts; it doesn't matter in the slightest. There's also no guarantee that the values in the second column are unique, nor that the .1
entry appears for a key if the .2
entry appears, etc. These details are mostly immaterial to the performance measurements. Because of the letter-digit-letter prefix used for the file names, the 26 * 10 * 26 = 6760 possible file prefixes. With the 100,000 randomly generated records, every one of those prefixes is present.
I wrote a script to time various ways of processing the data. There are 4 shell script variants — the one posted by Lucas A, the OP; two posted by chepner (one as comments), and one I created. There's also the Awk script created by dawg, a mildly modified version of the Python 3 script posted by chepner, and a Perl script I wrote.
Results
The results can be summarized by this table (run-time measured in seconds of elapsed time or wall clock time):
????????????????????????????????????????????????????????????????
? Script Variant ? N ? Mean ? Std Dev ? Min ? Max ?
????????????????????????????????????????????????????????????????
? Lucas A Shell ? 11 ? 426.425 ? 16.076 ? 408.044 ? 456.926 ?
? Chepner 1 Shell ? 11 ? 39.582 ? 2.002 ? 37.404 ? 43.609 ?
? Awk 256 ? 11 ? 38.916 ? 2.925 ? 30.874 ? 41.737 ?
? Chepner 2 Shell ? 11 ? 16.033 ? 1.294 ? 14.685 ? 17.981 ?
? Leffler Shell ? 11 ? 15.683 ? 0.809 ? 14.375 ? 16.561 ?
? Python 7000 ? 11 ? 7.052 ? 0.344 ? 6.358 ? 7.771 ?
? Awk 7000 ? 11 ? 6.403 ? 0.384 ? 5.498 ? 6.891 ?
? Perl 7000 ? 11 ? 1.138 ? 0.037 ? 1.073 ? 1.204 ?
????????????????????????????????????????????????????????????????
The original shell script is 2.5 orders of magnitude slower than Perl; Python and Awk have almost the same performance when there are enough file descriptors available (Python simply stops if there aren't enough file descriptors available; so does Perl). The shell script can be made about about half as fast as Python or Awk.
The 7000 denotes the number of open files needed (ulimit -n 7000
). This is because there are 26 * 10 * 26 = 6760 different 3-character starting codes in the generated data. If you have more patterns, you'll need more open file descriptors to gain the benefit of keeping them all open, or you will need to write a file descriptor caching algorithm somewhat like the one that GNU Awk must be using, with the consequential performance loss. Note that if the data was presented in sorted order, so that all the entries for each file were presented in sequence, then you'd be able to tweak the algorithms so that only one output file was open at a time. The random generated data is not in sorted order, so it hits any caching algorithm hard.
Scripts
Here are the various scripts tested during this exercise. These, and much of the supporting material, is available on GitHub in soq/src/so-4747-6170. Not all the code used is present in GitHub.
Lucas A Shell — aka opscript.sh
cat "$@" |
while read line
do
PREFIX=$(echo "$line" | cut -f2 | cut -c1-3)
echo -e "$line" >> split_DB/$PREFIX.part
done
This is a not-entirely useless use of cat
(see UUoC — Useless Use of cat
for the comparison). If no arguments are provided, it copies standard input to the while
loop; if any arguments are provided, they're treated as file names and passed to cat
and it copies the contents of those files to the while
loop. The original script had a hard-wired < file
in it. There is no measurable performance cost to using cat
here. A similar change was needed in Chepner's shell script.
Chepner 1 Shell — aka chepner-1.sh
cat "${@}" |
while read -r line; do
read -r _ col2 _ <<< "$line"
prefix=${col2:0:3}
printf '%s\n' "$line" >> split_DB/$prefix.part
done
Chepner 2 Shell — aka chepner-2.sh
cat "${@}" |
while read -r line; do
prefix=${line:12:3}
printf '%s\n' "$line" >> split_DB/$prefix.part
done
Leffler Shell — aka jlscript.sh
sed 's/^[^ ]* \(...\)/\1 &/' "$@" |
while read key line
do
echo "$line" >> split_DB/$key.part
done
Awk Script - aka awkscript.sh
exec ${AWK:-awk} '{s=substr($2,1,3); print >> "split_DB/" s ".part"}' "$@"
This wins hands down for compactness of script, and has decent performance when run with GNU Awk and with enough available file descriptors.
Python Script — aka pyscript.py
This is a Python 3 script, a mildly modified version of what Chepner posted.
import fileinput
output_files = {}
#with open(file) as fh:
# for line in fh:
for line in fileinput.input():
cols = line.strip().split()
prefix = cols[1][0:3]
# Cache the output file handles, so that each is opened only once.
#outfh = output_files.setdefault(prefix, open("../split_DB/{}.part".format(prefix), "w"))
outfh = output_files.setdefault(prefix, open("split_DB/{}.part".format(prefix), "w"))
print(line, file=outfh)
# Close all the output files
for f in output_files.values():
f.close()
Perl Script — aka jlscript.pl
#!/usr/bin/env perl
use strict;
use warnings;
my %fh;
while (<>)
{
my @fields = split;
my $pfx = substr($fields[1], 0, 3);
open $fh{$pfx}, '>>', "split_DB/${pfx}.part" or die
unless defined $fh{$pfx};
my $fh = $fh{$pfx};
print $fh $_;
}
foreach my $h (keys %fh)
{
close $fh{$h};
}
Test Script — aka test-script.sh
#!/bin/bash
#
# Test suite for SO 4747-6170
set_num_files()
{
nfiles=${1:-256}
if [ "$(ulimit -n)" -ne "$nfiles" ]
then if ulimit -S -n "$nfiles"
then : OK
else echo "Failed to set num files to $nfiles" >&2
ulimit -HSa >&2
exit 1
fi
fi
}
test_python_7000()
{
set_num_files 7000
timecmd -smr python3 pyscript.py "$@"
}
test_perl_7000()
{
set_num_files 7000
timecmd -smr perl jlscript.pl "$@"
}
test_awk_7000()
{
set_num_files 7000
AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript.sh "$@"
}
test_awk_256()
{
set_num_files 256 # Default setting on macOS 10.13.1 High Sierra
AWK=/opt/gnu/bin/awk timecmd -smr sh awkscript-256.sh "$@"
}
test_op_shell()
{
timecmd -smr sh opscript.sh "$@"
}
test_jl_shell()
{
timecmd -smr sh jlscript.sh "$@"
}
test_chepner_1_shell()
{
timecmd -smr bash chepner-1.sh "$@"
}
test_chepner_2_shell()
{
timecmd -smr bash chepner-2.sh "$@"
}
shopt -s nullglob
# Setup - the test script reads 'file'.
# The SOQ global .gitignore doesn't permit 'file' to be committed.
rm -fr split_DB
rm -f file
ln -s generated.data file
# Ensure cleanup
trap 'rm -fr split_DB; exit 1' 0 1 2 3 13 15
for function in \
test_awk_256 \
test_awk_7000 \
test_chepner_1_shell \
test_chepner_2_shell \
test_jl_shell \
test_op_shell \
test_perl_7000 \
test_python_7000
do
mkdir split_DB
boxecho "${function#test_}"
time $function file
# Basic validation - the same information should appear for all scripts
ls split_DB | wc -l
wc split_DB/* | tail -n 2
rm -fr split_DB
done
trap 0
This script was run using the command line notation:
time (ulimit -n 7000; TRACEDIR=. Trace bash test-script.sh)
The Trace
command logs all standard output and standard error to a lgo file and echoes it to its own standard output, and it reports on the 'environment' in a broad sense (environment variables, ulimit settings, date, time, command, current directory, user/groups, etc). It took just under 10 minutes to run the complete set of tests, three-quarters of which was spent running the OP's script.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Community |
Solution 2 | benrg |
Solution 3 |