JS Security With Untrusted Code

Lock

Combining security and flexibility is a difficult task. This is just one tiny corner of the problem that I've experienced in depth, in an effort to make TileMill's interactivity language more secure.

At the start of the day we had a problem: maps have an idea of 'formatters' similar to CouchDB views, which are JavaScript. And while the vast majority of formatters are generated by friendly tools like TileMill to take objects and return formatted strings, the technical definition of a formatter was extremely broad: it was any JavaScript.

This was a problem: any JavaScript means evil JavaScript, and, like any other company that provides embeds, like YouTube or Disqus, we don't want to be a vector.

Sanitizing JavaScript

The first thought was detecting malicious code, or even whitelisting only certain constructs. Unfortunately, JavaScript's flexibility makes it possible to write hacks in novel ways - with unusual characters, with odd references to this['win'+'dow'], and much, much more. There are tools to write javascript encoded into just a few characters. It's a losing battle trying to blacklist. You aren't going to get real protection without a full-fledged parser and generator.

And that's Google Caja: a fiendishly complex, Java-based JavaScript-cleaning server. Given no way to slot Java into our stack, this was a nogo - and the architecture is difficult to manage without a central service: our security model needed to work with arbitrary TileMill users not only using MapBox Hosting, but also alternative servers like TileStream and TileStache.

The Caja security model is also difficult to summarize: we need something that, at the end of the day, is predictable enough that you can write code and not notice that it's running in a secure environment.

That said, Caja is an incredible resource that came in handy at the end of this journey, and its wiki is a goldmine of attack vectors and design decisions.

Caja is also a little sketchy with deployment: it looks like it was mainly built to work with the OpenSocial spec, which looks to be entirely abandoned: Caja still lists MySpace as a big implementor.

`<iframes>`

Five years ago, Dean Edwards posted Sandboxing JavaScript using iframes and the state of the art hasn't moved very far from there. Still I insisted on sinking a few days into trying to run JavaScript in some kind of idyllic sandboxed environment. Sandie and SandboxJS seemed promising. At the end of the day, there weren't any iframe-based projects with any testing or compatibility research, and most treat security in a pretty simplistic var window = {} fashion.

After that LightingJS came out, which is great for trusted code, but 'safe' in this context is a matter of preventing conflicts, rather than containing security problems; a malicious third-party script can always window.parent out of their context and wreak havoc. Hence the trusted-code paradigm.

And Porthole promised to communicate with <iframe>s. But, with how much data? In IE, it uses location hashes, which are limited in length, and uses onresize events to trigger the frame to read the hash. Clever, but scary: I'm working on this stuff to deploy to thousands of sites, not entirely for kicks. It all sounds like a hack, that doesn't even start to properly guard against common hacks. How real is this security? Not much - at least SandBoxJS admits to this fact by quoting the epic Isaac Schlueter

Prisons are designed to keep dangerous criminals. Sandboxes are for children to not get lost or hurt while playing. There are several ways to break out of the sandbox... DON'T!

Giving up and Templating

After struggling against JavaScript sandboxing for two days, I decided that it was futile to build a fake security model on something that doesn't have a security model, and at the end of the day restrict the functionality of JavaScript to a template language.

}

Mustache is the clear pick for cross-platform templates - the Objective-C implementation, GRMoustache, let us have compatibility with the iPad app on day two. While the language has its faults, they tend to be syntax details rather than the awesome concept and intentions of the language.

Mustache is logic-free: meaning that users can't call arbitrary functions from the template language, much unlike underscore templates or jade (now pug). While this can be frustrating in other contexts, it shouldn't be here: the data input is simple and relatively flat.

Sanitizing HTML

So, with pure JavaScript a distant memory and templates in the future, we have a new enemy: HTML output. Read html5sec.org and you'll lose your mind learning about how the most innocent-looking spec additions can be security holes in your code.

Mustache is output-agnostic, so it doesn't really know that it's outputting HTML, or try to do anything special because it is: which means that it's far out of scope for Mustache itself to do HTML sanitization.

Most of the users of liquid, the Ruby language that inspired Mustache, use the Sanitize gem, which uses a whitelist to eliminate a wide swatch of XSS potentials. But that's in Ruby, so quite far away from anything we could use.

Via a lucky link from Stack Overflow I happened upon a corner of the Google Caja project that works in the browser and sanitizes HTML, called JsHtmlSanitizer. It's a heavyweight solution: it's a full-fledged SAX parser and whitelist. And, given that this is part of Caja, you need to check out the project with SVN and build it with ant to even get the file html-sanitizer-bundle.js (you can see it in Google Code's SVN repo, but it has dependencies).

And it adds a good 12K to the codebase - but, this is an area where you need to do things right, and should really, really avoid reinventing the wheel.

So the process is

// data & options -> mustache.js -> html_sanitize
html_sanitize(Mustache.to_html(x, view), urlX, idX);

html_sanitize is reasonably fast for its task, and it's incredibly well-tested and well-written.

Sanitizing CSS

This works pretty well, though making yourself aware of the full world of potential vulnerabilities is rough: even CSS can be dangerous, because of the silly, regrettable, stupid decision to add JavaScript support to CSS. This is something that we may be able to add back by restricting CSS tag content, but will forever be burned into my mind as yet another example of how the w3c has foolishly supported IE-additions to specs that only degrade it and mix concerns.

As Google Caja lays out CSS allows arbitrary code execution.

So, for the time being, we're doing some very simple validation of CSS, and relaxing html_sanitize restrictions for some of the most intense cuts: like allowing data-uri-based images via a URL whitelist. Luckily, the architecture of html_sanitize lets you set your own rules in a separate file, so you don't have to hack the library. In the vast majority of cases, the HTML that users type is the HTML that they get on the page.

We did also put maps in <iframe>s - though not for security purposes. It turned out to be the only way to keep CSS from clobbering embed content.

Post-Notes

I should have specified originally: this is focusing on client-side code that runs on A-level browsers: which unfortunately includes later versions of IE. So the solution needed to be basically vanilla JavaScript of some sort, which luckily the Caja HTML sanitizer supports.

If this had been on more modern browsers, CrabDude's example of Harmony Proxies is simple and awesome. It's good to see that JavaScript is gaining features that fit this use case.

If this was running in node.js, Robin's html-sanitizer is a simpler, more understandable implementation than Caja's for sanitization. Since you also have cutting-edge JavaScript features in node and lots of control over processes, there are sandboxes you can use there, like Gianni's sandbox.

January 10, 2012 Tom MacWright
@macwright.com on Bluesky, @tmcw@mastodon.social on Mastodon