download sourceball as .zipdownload sourceball as .tardownload sourceball as static gem

Thinking Big Data

There’s lots of data, Wukong and Hadoop can help

There are two disruptive

== Map|Reduce ==

cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv

cat twitter_users.tsv | cuttab 3 | cutc 1-6 | sort | uniq -c > histogram.tsv cat twitter_users.tsv | \ cuttab 3 | # extract the date column \ cutc 1-6 | # chop off all but the yearmonth \ sort | # sort, to ensure locality \ uniq -c > # roll up lines, along with their count \ histogram.tsv # save into output file

Word Count

mapper:

  1. output each word on its own line
    readlines.each{|line| puts line.split(/[^\w]+/) }

reducer:

  1. every word is guaranteed to land in the same place and next to its
  2. friends, so we can just output the repetition count for each
  3. distinct line.
    uniq -c

Word Count by Person

- reduce by [word, , count] and [word, user_id, count]

== Global Structure ==

Enumerating neighborhood

== Mechanics, HDFS ==

x M _
_ M y


Fork me on GitHub