Simple Statistics 3

Today I released Simple Statistics 3.0.0.

Like other projects, it follows semver - so the jump from 2.5.0 to 3.0.0 was required because I changed something in a non-backwards compatible way. That thing is how Simple Statistics handles invalid input, like what it does if you request the maximum number out of an empty array. Until 3.0.0, I chose to return NaN when given invalid input. In 3.0.0 and beyond, Simple Statistics will throw an Error instead.

There are a few additional improvements in the release:

combineMeans, subtractFromMean, and combineVariances - methods contributed by Guillaume Plique that make online statistics easier to implement, because they let you incrementally calculate new aggregates with new data instead of completely re-calculating them.
benchmarks that compare simple-statistics performance to that of other libraries. This experiment will continue - I'd love feedback on the methodology, to make sure that it doesn't bias toward any implementation. I'm still doing research to determine why jStat is winning in several benchmarks, and having a great time reading other implementations for inspiration, and finding a potential bug because of some suspiciously good performance.

Here are the benchmark results so far:

	Simple Statistics	science.js	jStat	mathjs
variance	99,565	92,064	305,801
median	54,497	5,199	17,215	1,432
mode	4,595	2,311	10,078	1,049
medianAbsoluteDeviation	17,373			522
min	384,394		528,290	41,598

About NaN

The unfortunate truth is that JavaScript doesn't have solid norms for error handling. Even in the core language, some methods throw error objects when things go wrong, others return undefined, others return special values like -1, like indexOf.

This confusion was worsened by word that using thrown errors (exceptions) was a performance drag, as documented by the bluebird project.

Thankfully, the V8 project, the JavaScript engine that powers Node.js & Chrome, fixed that performance uncertainty and try/catch is now performant.

For Simple Statistics, I decided to try using NaN as the 'invalid' value. Since the library is performance-related, I wanted to avoid what I thought was a potential performance drag, and NaN conveniently is considered to be a number, by JavaScript and Flow's convention.

Previously, then, you might write your standard deviation command-like utility like:

#!/usr/bin/env node
var variance = require('simple-statistics').variance;

var inputs = process.argv.slice(2).map(parseFloat);
var result = variance(inputs);

if (isNaN(result)) {
  console.log('Something went wrong');
  process.exit(1);
}

console.log(result);

Then you could try it out:

/tmp/test〉variance.js 1 2
0.25
/tmp/test〉variance.js
Something went wrong

With simple-statistics 3, you can skip the isNaN check: simple-statistics itself will throw an error if there is one. I could go on about the pros & cons, but I'll just list what's top of mind:

What I like

Errors have messages: they tell you which part of the program encountered an error. That makes them much easier to debug than values like undefined or NaN, which can propagate through an equation, leaving you wondering where it went wrong.
NaN is awkward. It's literally 'Not a Number', but qualifies as a number value for Flow. You might also assume that you can test for NaN by comparing it to NaN, the same way you could test for undefined. Unfortunately, NaN === NaN is false.

What I don't like

Errors are not really functional in a way. 'Exception handling', is about control-flow, not values. So too often they're invisible, and they're a blind spot of type systems like Flow. I've documented every Error that simple-statistics itself can throw and the case in which it'll throw it, but in contrast to the plethora of JavaScript tooling that ensures you handle nullable types, there's little to warn a newbie that they should really put a try/catchs statement around a particular function call. Most functional languages like Elm have Maybe types which elegantly place the concept of 'potential failure' within the view of the type system. JavaScript has this too - the Promise object, which is more or less a 'Maybe monad' that has nice, explicit representations of errors. I'd use Promise values in Simple Statistics, but they come with expectations of asynchrony, and I'd use a Maybe monad, but that just wouldn't be Simple.

Check it out: Simple Statistics 3.0.0.

April 6, 2017 Tom MacWright
@macwright.com on Bluesky, @tmcw@mastodon.social on Mastodon