download sourceball as .zipdownload sourceball as .tardownload sourceball as static gem

Wukong More Info

Why is it called Wukong?

Hadoop, as you may know, is named after a stuffed elephant. Since Wukong was started by the infochimps team, we needed a simian analog. A Monkey King who journeyed to the land of the Elephant seems to fit the bill:

Sun Wukong (孙悟空), known in the West as the Monkey King, is the main character in the classical Chinese epic novel Journey to the West. In the novel, he accompanies the monk Xuanzang on the journey to retrieve Buddhist sutras from India.

Sun Wukong possesses incredible strength, being able to lift his 13,500 jīn (8,100 kg) Ruyi Jingu Bang with ease. He also has superb speed, traveling 108,000 li (54,000 kilometers) in one somersault. Sun knows 72 transformations, which allows him to transform into various animals and objects; he is, however, shown with slight problems transforming into other people, since he is unable to complete the transformation of his tail. He is a skilled fighter, capable of holding his own against the best generals of heaven. Each of his hairs possesses magical properties, and is capable of transforming into a clone of the Monkey King himself, or various weapons, animals, and other objects. He also knows various spells in order to command wind, part water, conjure protective circles against demons, freeze humans, demons, and gods alike. — Sun Wukong’s Wikipedia entry

The Jaime Hewlett / Damon Albarn short that the BBC made for their 2008 Olympics coverage gives the general idea.

Map/Reduce Algorithms

Example graph scripts:

K-Nearest Neighbors

More example hadoop algorithms:

Example example scripts (from http://www.cloudera.com/resources/learning-mapreduce):

1. Find the [number of] hits by 5 minute timeslot for a website given its access logs.
2. Find the pages with over 1 million hits in day for a website given its access logs.
3. Find the pages that link to each page in a collection of webpages.
4. Calculate the proportion of lines that match a given regular expression for a collection of documents.
5. Sort tabular data by a primary and secondary column.
6. Find the most popular pages for a website given its access logs.

Don’t Use Wukong, use this instead

There are several worthy Hadoop|Streaming Frameworks:

Most people use Wukong / one of the above (or straight Java Hadoop, poor souls) for heavy lifting, and several of the following hadoop tools for efficiency:

Hadoop Algorithms

Hadoop

Note on Patches/Pull Requests

What’s up with Wukong::AndPig?

Wukong::AndPig is a small library to more easily generate code for the Pig data analysis language. See its README for more.

It’s not really being worked on, and you should probably ignore it.

TODOs

Utility

BUGS:

Patterns to implement:

Make wutils: tsv-oriented implementations of the coreutils (eg uniq, sort, cut, nl, wc, split, ls, df and du) to instrinsically accept and emit tab-separated records.


Fork me on GitHub