Thinking Big Data
There’s lots of data, Wukong and Hadoop can help
There are two disruptive
- We’re instrumenting every realm of human activity
- Conversation
- Relationships
- We have linearly scaling multiprocessing
- Old frontier computing: expensive, N log N, SUUUUUUCKS
- It’s cheap, it’s scaleable and it’s fun
== Map|Reduce ==
cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv
cat twitter_users.tsv | cuttab 3 | cutc 1-6 | sort | uniq -c > histogram.tsv
cat twitter_users.tsv | \
cuttab 3 | # extract the date column \
cutc 1-6 | # chop off all but the yearmonth \
sort | # sort, to ensure locality \
uniq -c > # roll up lines, along with their count \
histogram.tsv # save into output file
Word Count
mapper:
- output each word on its own line
readlines.each{|line| puts line.split(/[^\w]+/) }
reducer:
- every word is guaranteed to land in the same place and next to its
- friends, so we can just output the repetition count for each
- distinct line.
uniq -c
Word Count by Person
- Partition Keys vs. Reduce Keys
- reduce by [word, , count] and [word, user_id, count]
== Global Structure ==
Enumerating neighborhood
== Mechanics, HDFS ==
x M _
_ M y