download sourceball as .zipdownload sourceball as .tardownload sourceball as static gem

Random Thoughts on Big Data

Stuff changes when you cross the 100GB barrier. Here are random musings on why it might make sense to

Rules of Thumb

Drop ACID, explore Big Data

The traditional ACID quartet for relational databases can be re-interpreted in a Big Data context:

Finally, where possible leave things in sort order by some appropriate index. Clearly I’m not talking about introducing extra unnecessary sorts on ephemeral data. For things that will be read (and experimented with) much more often than they’re written, though, it’s worth running a final sort. Now you can

Note: for files that will live on the DFS, you should usually not do a total sort,

If it’s not broken, it’s wrong

Something that goes wrong one in a five million times will crop up hundreds of times in a billion-record collection.

Error is not normally distributed

What’s more, errors introduced will not in general be normally distributed and their impact may not decrease with increasing data size.

Do your exception handling in-band

A large, heavily-used cluster will want to have ganglia or scribe or the like collecting and managing log data. Splunk is a compelling option I haven’t myself used, but it is broadly endorsed.

However, it’s worth considering another extremely efficient, simple and powerful distributed system for routing massive quantities of data in a structured way, namely wukong|hadoop itself.

Wukong gives you a BadRecord class — just rescue errors, pass the full or partial contents of the offending input. and emit the BadRecord instance in-band. They’ll be serialized out along with the rest, and at your preference can be made to reduce to a single instance. Do analysis on them at your leisure; by default, any StructStreamer will silently discard inbound BadRecords — they won’t survive past the current generation.

Keys

other fields

Natural keys are right for big data

Synthetic keys suck. They demand locality or a central keymaster.

OK, fine. you need a synthetic key

How do you get a unique prefix?

You can consider your fields are one of three types:

The meaning of a key depends on its semantics. Is a URL a key?

The command line is an IDE

  cat /data/foo.tsv | ruby -ne 'puts $_.chomp.scan(/text="([^"]+)"/).join("\t")'

Encode once, and carefully.

Encoding violates idempotence. Data brought in from elsewhere must be considered unparsable, ill-formatted and rife with illegal characters.

In the absence of some lightweight, mostly-transparent, ASCII-compatible AND idempotent encoding scheme lurking in a back closet of some algorithms book — how to handle the initial lousy payload coming off the wire?

feed_request	20090809101112  200     OK      <?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang='en' xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
  <head>
      <title>infochimps.org &mdash; Find Any Dataset in the World</title>

Many of the command line utilities (cat, grep, etc.) will accept nul-delimited files.

You may be tempted to use XML around your XML so you can XML while you XML. Ultimately, you’ll find this can only be done right by doing a full parse of the input — and at that point you should just translate directly to a reasonable tab/newline format. (Even if that format is tsv-compatible JSON).


Fork me on GitHub