Thinking Big Data
There’s lots of data, Wukong and Hadoop can help
There are two disruptive
	- We’re instrumenting every realm of human activity
	
		- Conversation
- Relationships
 
	- We have linearly scaling multiprocessing
	
		- Old frontier computing: expensive, N log N, SUUUUUUCKS
- It’s cheap, it’s scaleable and it’s fun
 
== Map|Reduce ==
cat input.tsv | mapper.sh | sort | reducer.sh > output.tsv
cat twitter_users.tsv | cuttab 3 | cutc 1-6 | sort | uniq -c > histogram.tsv
cat twitter_users.tsv | \
cuttab 3 |                  # extract the date column               \
cutc 1-6 |                  # chop off all but the yearmonth        \
sort     |                  # sort, to ensure locality              \
uniq -c  >                  # roll up lines, along with their count \
histogram.tsv               # save into output file
Word Count
mapper:
	- output each word on its own line
 readlines.each{|line| puts line.split(/[^\w]+/) }
reducer:
	- every word is guaranteed to land in the same place and next to its
- friends, so we can just output the repetition count for each
- distinct line.
 uniq -c
Word Count by Person
	- Partition Keys vs. Reduce Keys
- reduce by [word, , count] and [word, user_id, count]
== Global Structure ==
Enumerating neighborhood
== Mechanics, HDFS ==
x M _
_ M y