'Splitting a file based on number of records and file size

I have a file which contains 100 records... Each record has different size, like some are 1 KB while some are 1 MB. I want to split the file based upon number of records, let's say 5. and each sub file should have maximum file size as 2 MB. When we use the split command, we can use -b/C for giving the file size parameter and we can use -n to give the number of lines or records we want to use in each files. But in split we can only use either file size parameter or number of records parameter any one of them, not both.

Is there any way or any alternative of split, that I can use to achieve both? split a file into sub files, each containing less than or equal to 5 records/ lines or each sub file with less than or equal to 2 MB.

secanrios:

  1. file with 5 records and size 2MB
  2. file with 1 record and size 2 MB
  3. file with 5 records and size 1 MB
absacsacsa......                1 KB
zzsasabsac......                1 MB
absacsacsa......                2 KB
zyasbsacsacsa......             2 MB
cbsacsacsa......                1 B
.
.
.

The real file has almost 3 million lines, so I cant use any manual approach, i.e checking each lines, reading file, etc as it takes hours to process. I'm just looking for some command like split which can take two parameters, split is quite fast, but unfortunately only takes single parameter.



Solution 1:[1]

The following assumes that "2MB" is 2000000, "records" are lines, the "size" of a "record" is the number of characters. It has been tested with GNU awk.

awk -v smax=2000000 -v lmax=5 -v n=0 -v c=0 -v name=xxx000000 '
  function next_file() {
    print name, n, sum
    close(name)
    c += 1
    name = sprintf("xxx%06d", c)
    n = 0
    sum = 0
  }
  sum + length($0) + 1 > smax {
    next_file()
  }
  {
    print > name
    n += 1
    sum += length($0) + 1
  }
  n == lmax {
    next_file()
  }
  END {
    if(n > 0) {
      print name, n, sum
    }
  }
' file

smax is the maximum size per file. lmax is the maximum number of lines per file. n, sum and c are internal variables used to count lines, compute cumulated sizes and count the created files.

This will create files named xxx000000, xxx000001... It will also print information on the standard output, one line per created file, with the file name, its line count and size:

xxx000000 5 1569851
xxx000001 4 1965155
...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1