infochimps-labs.github.com/wukong - infochimps-labs.github.com/wukong

Random Thoughts on Big Data

Stuff changes when you cross the 100GB barrier. Here are random musings on why it might make sense to

Sort everything
Don’t do any error handling
Catch errors and emit them along with your data
Make everything ASCII
Abandon integer keys
Use bash as your data-analysis IDE.

Rules of Thumb

How big is a dataset to benefit from Hadoop? “The dataset generally should be
(in memory) closer to the size of the available RAM of the cluster or more.”
(Scott Carey in org.apache.hadoop.general)

Drop ACID, explore Big Data

The traditional ACID quartet for relational databases can be re-interpreted in a Big Data context:

A — Associative
C — Commutative
I — Idempotent
D — Distributed
(*) — (and where possible, left in sort order)

Finally, where possible leave things in sort order by some appropriate index. Clearly I’m not talking about introducing extra unnecessary sorts on ephemeral data. For things that will be read (and experimented with) much more often than they’re written, though, it’s worth running a final sort. Now you can

Efficiently index into a massive dataset with binary search
Do a direct merge sort on two files with the same sort order
Run a reducer directly across the data
Assign a synthetic key by just serially numbering lines (either distribute a unique prefix to each mapper

Note: for files that will live on the DFS, you should usually not do a total sort,

If it’s not broken, it’s wrong

Something that goes wrong one in a five million times will crop up hundreds of times in a billion-record collection.

Error is not normally distributed

What’s more, errors introduced will not in general be normally distributed and their impact may not decrease with increasing data size.

Do your exception handling in-band

A large, heavily-used cluster will want to have ganglia or scribe or the like collecting and managing log data. Splunk is a compelling option I haven’t myself used, but it is broadly endorsed.

However, it’s worth considering another extremely efficient, simple and powerful distributed system for routing massive quantities of data in a structured way, namely wukong|hadoop itself.

Wukong gives you a BadRecord class — just rescue errors, pass the full or partial contents of the offending input. and emit the BadRecord instance in-band. They’ll be serialized out along with the rest, and at your preference can be made to reduce to a single instance. Do analysis on them at your leisure; by default, any StructStreamer will silently discard inbound BadRecords — they won’t survive past the current generation.

Keys

Artificial key: assigned externally, key is not a function of the object’s intrinsic values. A social security number is an artificial key. Artificial

Natural Key: minimal subset of fields with intrinsic semantic value that uniquely identify the record. My name isn’t unique, but my fingerprint is both uniqe and intrinsic. Given the object (me) you can generate the key, and given the key there’s exactly one object (me) that matches.

other fields

Mutable:
- A user’s ‘bio’ section.

Immutable:
- A user’s created_at date is immutable: it doesn’t help identify the person but it will never change.

Natural keys are right for big data

Synthetic keys suck. They demand locality or a central keymaster.

Use the natural key
Hash the natural key. This has some drawbacks

OK, fine. you need a synthetic key

Do a total sort, and use nl
Generate
Use a single reducer to reduce locality. YUCK.
have each mapper generate a unique prefix; number each line as “prefix#{line_number}” or whatever.

How do you get a unique prefix?

Distribute a unique prefix to each mapper out-of-band. People using Streaming are out of luck.

Use a UUID — that’s what they’re for. Drawback: ridiculously long

Hash the machine name, PID and timestamp to something short. Check after the
fact that uniqueness was achieved. Use the birthday party formula to find out
how often this will happen. (In practice, almost never.)

You can consider your fields are one of three types:

Key
- natural: a unique username, a URL, the MD5 hash of a URL
- synthetic: an integer generated by some central keymaster
Mutable:
- eg A user’s ‘bio’ section.
Immutable:
- A user’s created_at date is immutable: it doesn’t help identify the person but it will never change.

The meaning of a key depends on its semantics. Is a URL a key?

A location: (compare: “The head of household residing at 742 Evergreen Terr, Springfield USA”)
An entity handle (URI): (compare: “Homer J Simpson (aka Max Power)”)
An observation of that entity: Many URLs are handles to a stream — http://twitter.com/infochimps names the resource “infochimps-labs’s twitter stream”, but loading that page offers only the last 20 entries in that stream. (compare: “The collection of all words spoken by the residents of 742 Evergreen Terr, Springfield USA”)

The command line is an IDE

  cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")'

Encode once, and carefully.

Encoding violates idempotence. Data brought in from elsewhere must be considered unparsable, ill-formatted and rife with illegal characters.

Immediately fix a copy of the original data with as minimal encoding as possible.
Follow this with a separate parse stage to emit perfectly well-formed, tab-separated / newline delimited data
In this parse stage, encode the data to 7-bits, free of internal tabs, backslashes, carriage return/line feed or control characters. You want your encoding scheme to be
- perfectly reversible
- widely implemented
- easily parseable
- recognizable: incoming data that is mostly inoffensive (a json record, or each line of a document such as this one) should be minimally altered from its original. This lets you do rough exploration with sort/cut/grep and friends.
- !! Involve NO QUOTING, only escaping. I can write a simple regexp to decode entities such as %10, \n or . This regexp will behave harmlessly with ill-formed data (eg %%10 or &&; or \ at end of line) and is robust against data being split or interpolated. Schemes such as “quoting: it’s bad”, %Q{quoting: “just say no”} or tagged markup require a recursive parser. An extra or missing quote mark is almost impossible to backtrack. And av

In the absence of some lightweight, mostly-transparent, ASCII-compatible AND idempotent encoding scheme lurking in a back closet of some algorithms book — how to handle the initial lousy payload coming off the wire?

For data that is mostly text in a western language, you’ll do well wiht XML encoding (with [\n\r\t\\] forced to encode as entities)
URL encoding isn’t as recognizable, but is also safe. Use this for things like URIs and filenames, or if you want to be /really/ paranoid about escaping.
For binary data, Binhex is efficient enough and every toolkit can handle it. There are more data-efficient ascii-compatible encoding schemes but it’s not worth the hassle for the 10% or whatever gain in size.
If your payload itself is XML data, consider using \0 (nul) between records, with a fixed number of tab-separated metadata fields leading the XML data, which can then include tabs, newlines, or whatever the hell it wants. No changes are made to the data apart from a quick gsub to remove any (highly illegal) \0 in the XML data itself. A later parse round will convert it to structured hadoop-able data. Ex:

feed_request	20090809101112  200     OK      <?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
  <head>
      <title>infochimps.org &mdash; Find Any Dataset in the World</title>

Many of the command line utilities (cat, grep, etc.) will accept nul-delimited files.

You may be tempted to use XML around your XML so you can XML while you XML. Ultimately, you’ll find this can only be done right by doing a full parse of the input — and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON).

infochimps.org | find, share or sell any dataset in the world
github/infochimps-labs | infochimps-labs's code on github.com

wukong | hadoop made easy
edamame | fast persistent job queue
monkeyshines | api scraper
wuclan | social media scraper

Wukong image courtesy Curt Busse under an open license. It's a Chacma Baboon from the Okavango site. Make sure to read the story at the bottom of that page.

Thanks to github.com/mojombo for the swanky HTML layout.