'perl: Finding mean and variance of large numbers without overflow
I am using a subroutine (stats
) to calculate statistics for a list of numbers.
These numbers may be big enough to lose precision if stored as normal perl numbers.
I recieve such numbers as JSON formatted strings.
To decode these strings without losing precision,
I use a JSON::PP
object with allow_nonref
and allow_bignum
activated.
I send the list of such decoded numbers to stats
subroutine
(see in code shown below).
This routine calculates some statistics.
These statistics are then encoded to JSON and saved to file.
Most of the time the process seems to work correctly, but
for some inputs (see code for examples) the calculated value of mean and variance statistics
are either clearly wrong, or are encoded as JSON strings by the encoder, or both.
I suspect this is due to interaction of Math::BigInt
and Math::BigFloat
objects created by JSON decode, and List::Util::sum0
.
I am trying to figure out what causes this and a way to avoid/fix this, preferably without resorting to big non core modules. I am willing to accept imprecise calculation of mean and variance, but not entirely inaccurate results or numerical results encoded as string in JSON.
A script (stats.pl
) to demonstrate the problem:
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Varname = "DUMPED_RAWDATA";
use JSON::PP;
use List::Util;
my $JSON = JSON::PP->new->allow_bignum->utf8->pretty->canonical;
sub stats {
#TODO fix bug about negative variance. AVOID OVERFLOW
#TODO use GMP, XS?
# @_ has decoded numbers (called RAWDATA here)
my $n = scalar @_;
my $sum = List::Util::sum0(@_);
my $mean = $sum / $n;
my $var = List::Util::sum0( map { $_**2 } @_ ) / $n - $mean**2;
my $s = {
n => $n,
sum => $sum,
max => List::Util::max(@_),
min => List::Util::min(@_),
mean => $mean,
variance => $var
};
# DUMP STATE IF SOME ERROR OCCURS
print Dumper( \@_ ),
$JSON->encode( { json_encoded_stats => $s, json_encoded_rawdata => \@_ } )
if ( '"' eq substr( $JSON->encode($var), 0, 1 ) #MEAN ENCODED AS STRING
or '"' eq substr( $JSON->encode($mean), 0, 1 ) #VARIANCE ENCODED AS STRING
or $var < 0 ); #VARIANCE IS NEGATIVE!
$s;
}
my @test = (
[
qw( 919300112739897344 919305709216464896 919305709216464896 985592115567603712 959299136196456448)
],
[qw(479655558 429035600 3281034608 3281034608 2606592908 3490045576)],
[ qw(914426431563644928) x 3142 ]
);
for (@test) {
print "---\n";
stats( map { $JSON->decode($_) } @$_ );
}
Below is the curtailed output of perl stats.pl
with problems indicated as <---
.
---
$DUMPED_RAWDATA1 = [
'919300112739897344',
'919305709216464896',
'919305709216464896',
'985592115567603712',
'959299136196456448'
];
{
"json_encoded_rawdata" : [
919300112739897344,
919305709216464896,
919305709216464896,
985592115567603712,
959299136196456448
],
"json_encoded_stats" : {
"max" : 985592115567603712,
"mean" : "9.40560556587377e+17", <--- ENCODED AS STRING
"min" : 919300112739897344,
"n" : 5,
"sum" : 4702802782936887296,
"variance" : 7.46903843214008e+32
}
}
---
$DUMPED_RAWDATA1 = [
479655558,
429035600,
3281034608,
3281034608,
2606592908,
3490045576
];
{
"json_encoded_rawdata" : [
479655558,
429035600,
3281034608,
3281034608,
2606592908,
3490045576
],
"json_encoded_stats" : {
"max" : 3490045576,
"mean" : 2261233143,
"min" : 429035600,
"n" : 6,
"sum" : 13567398858,
"variance" : "-1.36775568782523e+18" <--- NEGATIVE VARIANCE, STRING ENCODED
}
}
---
$DUMPED_RAWDATA1 = [
'914426431563644928',
.
.
.
<snip 3140 identical lines>
'914426431563644928'
];
{
"json_encoded_rawdata" : [
914426431563644928,
.
.
.
<snip 3140 identical lines>
914426431563644928
],
"json_encoded_stats" : {
"max" : 914426431563644928,
"mean" : "9.14426431563676e+17", <--- STRING ENCODED
"min" : 914426431563644928,
"n" : 3142,
"sum" : 2.87312784797307e+21,
"variance" : -9.75463826617761e+22 <--- NEGATIVE VARIANCE
}
}
Solution 1:[1]
@ikegami's answer works correctly but it is too slow for me as this subroutine is called a lot of times in my program's inner loop. I think that is the cost of ensuring that all numbers are converted to arbitrary precision ones. I ended up using the following implementation which avoids converting all numbers to arbitrary precision type.
sub stats {
my $n = scalar @_;
my $sum = List::Util::sum0(@_);
my $mean = $sum / $n;
my $var = List::Util::sum0( map { ( $_ - $mean )**2 } @_ ) / $n;
$mean += 0;
$var += 0; # TO ENSURE THAT THEY ARE ENCODED AS NUMBERS IN JSON
{
n => $n,
sum => $sum,
max => List::Util::max(@_),
min => List::Util::min(@_),
mean => $mean,
variance => $var,
};
}
I changed the method of calculating variance
to ensure that negative results are avoided
(as suggested by @Robert).
It may sacrifice precision in $sum
(and everything that depends on $sum
)
due to floating point addition of large integers.
It completes the job in an acceptable execution time though.
The unintended JSON encoding of numbers as strings is explained in https://metacpan.org/pod/JSON::PP#simple-scalars. This problem is solved by using the method suggested there to force encoding as numbers.
JSON::PP will encode undefined scalars as JSON null values, scalars that have last been used in a string context before encoding as JSON strings, and anything else as number value
You can force the type to be a JSON number by numifying it:
my $x = "3"; # some variable containing a string $x += 0; # numify it, ensuring it will be dumped as a number $x *= 1; # same thing, the choice is yours. in to force
Solution 2:[2]
None of your inputs are big enough to require JSON::PP to create Math::BigInt objects on a system with 64-bit ints, so it doesn't.
You could do something like the following at the start of your sub.
@_ = map { Math::BigInt->new($_) } @_; # Or ::BigFloat?
Alternatively,
my $zero_B = Math::BigInt->new(0);
sub stats {
my $n = @_;
my $sum_B = sum($zero_B, @_);
my $mean_B = $sum_B / $n;
my $var_B = sum( map { Math::BigInt->new($_) ** 2 } @_ ) / $n - $mean_B ** 2;
my ($min, $max) = minmax(@_);
return {
n => $n,
sum => $sum_B,
max => $max,
min => $min,
mean => $mean_B,
variance => $var_B,
};
}
All together:
use strict;
use warnings;
use Data::Dumper qw( Dumper );
use JSON::PP qw( );
use List::MoreUtils qw( minmax );
use List::Util qw( sum );
use Math::BigInt qw( );
my $zero_B = Math::BigInt->new(0);
my $JSON = JSON::PP->new->allow_bignum->utf8->pretty->canonical;
sub stats {
my $n = @_;
my $sum_B = sum($zero_B, @_);
my $mean_B = $sum_B / $n;
my $var_B = sum( map { Math::BigInt->new($_) ** 2 } @_ ) / $n - $mean_B ** 2;
my ($min, $max) = minmax(@_);
return {
n => $n,
sum => $sum_B,
max => $max,
min => $min,
mean => $mean_B,
variance => $var_B,
};
}
my @test = (
[qw( 919300112739897344 919305709216464896 919305709216464896 985592115567603712 959299136196456448 )],
[qw( 479655558 429035600 3281034608 3281034608 2606592908 3490045576 )],
[ qw( 914426431563644928 ) x 3142 ]
);
for (@test) {
print "---\n";
my $s = stats( map { $JSON->decode($_) } @$_ );
if (
$JSON->encode($s->{variance}) =~ /"/ # MEAN ENCODED AS STRING
|| $JSON->encode($s->{mean}) =~ /"/ # VARIANCE ENCODED AS STRING
|| $s->{variance} < 0 # VARIANCE IS NEGATIVE!
) {
local $Data::Dumper::Varname = "DUMPED_RAWDATA";
print Dumper($_);
print $JSON->encode({
json_encoded_rawdata => $_,
json_encoded_stats => $s,
});
} else {
print "ok\n";
}
}
Notes:
- Both approaches will work even if the objects are already Math::* objects.
- I identified the vars are guaranteed to contain a Math:Big* object using
_B
for clarity. - I moved the testing code to the test harness.
- I used
minmax
because it's more efficient than callingmin
andmax
separately. - I imported the subs from the modules to avoid having to use use their full name.
- No need to force something in scalar context into scalar context.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |