download sourceball as .zip

download sourceball as .tar

download sourceball as static gem

Thinking Big Data

There’s lots of data, Wukong and Hadoop can help

There are two disruptive

We’re instrumenting every realm of human activity
- Conversation
- Relationships

We have linearly scaling multiprocessing
- Old frontier computing: expensive, N log N, SUUUUUUCKS
- It’s cheap, it’s scaleable and it’s fun

== Map|Reduce ==

cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv

Bobo histogram:

cat twitter_users.tsv | cuttab 3 | cutc 1-6 | sort | uniq -c > histogram.tsv cat twitter_users.tsv | \ cuttab 3 | # extract the date column \ cutc 1-6 | # chop off all but the yearmonth \ sort | # sort, to ensure locality \ uniq -c > # roll up lines, along with their count \ histogram.tsv # save into output file

Word Count

mapper:

output each word on its own line
readlines.each{|line| puts line.split(/[^\w]+/) }

reducer:

every word is guaranteed to land in the same place and next to its
friends, so we can just output the repetition count for each
distinct line.
uniq -c

Word Count by Person

Partition Keys vs. Reduce Keys

- reduce by [word, , count] and [word, user_id, count]

== Global Structure ==

Enumerating neighborhood

adjacency list

join on center link

list of 3-paths ==

== Mechanics, HDFS ==

x M _
_ M y

infochimps.org | find, share or sell any dataset in the world
github/infochimps-labs | infochimps-labs's code on github.com

wukong | hadoop made easy
edamame | fast persistent job queue
monkeyshines | api scraper
wuclan | social media scraper

Subscribe to RSS Feed

Wukong image courtesy Curt Busse under an open license. It's a Chacma Baboon from the Okavango site. Make sure to read the story at the bottom of that page.

Thanks to github.com/mojombo for the swanky HTML layout.

Fork me on GitHub