roguelazer's website: serialization format performance

Most of the work done in actual programming jobs is taking structured data in some particular format from one system, slightly tweaking it, and sending it off to some other system. When exchanging data between different processes, it's almost always necessary to serialize it into a series of bytes which can be sent across a dumb byte-oriented transport (such as TCP). There are hundreds upon hundreds of different serialization formats out there, but I just wanted to talk about a few of the most common that folks use with the Python programming language.

All of these have the following properties:

Schema-less; you can encode arbitrary data without having to write a schema or definition
Support strings, integers, floating-point numbers, lists, and associative arrays

Most of them can be used either in a mode which returns a fully-fleshed data structure or in "SAX mode", which is described very well at my former co-worker Evan's blog post on handling Large Data Files. I'm only discussing the former in this post.

JSON	JSON comes from the ECMAScript (née JavaScript) programming language, and is just the syntax that you use the create object literals in that language. It's become the de-facto interchange format for most web services these days, and has optimized encoders/decoders for just about every language. It's also idiosyncratic and rigid; I've seen major outages at multi-billion-dollar companies because someone accidentally put a trailing comma in a JSON document.
YAML	YAML is very similar to JSON (in fact, it's a strict superset). It adds helpful features like comments and cross-referencing, and is extremely common in the Ruby community.
MsgPack	MsgPack is a binary serialization of JSON designed to be more efficient. It's expected that humans will read and write data as JSON, and then it'll be compiled to msgpack for storage and transport.
Pickle	Pickle is Python's native serialization format. It can serialize any arbitrary (native or user-defined type) to bytes. It's also impossible to use securely; de-pickling untrusted content leads to remote code execution. Nonetheless, many projects (such as carbon, the data ingestion component for the popular carbon/whisper/graphite metrics stack) insist on using it.

One of the biggest differences between these formats is performance. Below are the times to load a fairly small (48KB) document containing biographical details of the alphabetically-first members of the US Senate (one of the many files that I have to parse for How Many Times Has the House Voted to Repeal Obamacare) in Python (using common libraries with C optimizations for all formats¹):

Protocol	Library	Load time
JSON	`json` (in stdlib)	1.4ms
JSON	`ujson`	0.4ms
YAML	PyYAML (default loader)	303.1ms
YAML	PyYAML (C Loader)	27.9ms
MsgPack	msgpack-python	0.2ms
Pickle	`cPickle`	14.1ms

All of these benchmark results were generated using the code at https://gist.github.com/Roguelazer/c4582f266062bea12be8 on my 2013 15" MacBook Pro with its Intel Core i7-4850HQ.

So, yeah. I'm probably going to keep using YAML for application configuration, but maybe it's time for me to start adding an intermediate step where I compile it to something faster. Maybe this information will be of value to someone else on the Internet. shrug.

Example Documents

JSON

{
   "key": "value",
   "list_key": [1, 2, 3],
   "arbitrarily": {
        "nested": {
            "documents": 3.0
        }
    }
}

YAML

key: value
list_key:
  - 1
  - 2
  - 3
arbitarily: {"nested": {"documents": 3.0}}

MsgPack

\x83\xaaarbitarily\x81\xa6nested\x81\xa9documents\xcb@\x08\x00\x00
\x00\x00\x00\x00\xa8list_key\x93\x01\x02\x03\xa3key\xa5value

cPickle

(dp1\nS'arbitarily'\np2\n(dp3\nS'nested'\np4\n(dp5\nS'documents'
\np6\nF3\nsssS'list_key'\np7\n(lp8\nI1\naI2\naI3\nasS'key'\np9\nS'value'\np10\ns.

Of note: it's super-hard to install PyYAML with C extensions on a platform where libyaml isn't installed in /usr/. The correct invocation: pip install --global-option=--with-libyaml --global-option=build_ext --global-option=-L/usr/local/lib --global-option=-I/usr/local/include PyYAML (assuming libyaml is in /usr/local). It took me, like, half an hour to figure that out. Goddamn pip.

Serialization Format Performance

Example Documents

JSON

YAML

MsgPack

cPickle