wukong :: install

Get the code
Setup
Installing and Running Wukong with Hadoop
Installing and Running Wukong with Datamapper, ActiveRecord, the command-line and more

Get the code

We’re still actively developing wukong. The newest version is available via Git on github:

$ git clone git://github.com/infochimps-labs/wukong

A gem is available from gemcutter:

$ sudo gem install wukong --source=http://gemcutter.org

(don’t use the gems.github.com version — it’s way out of date.)

You can instead download this project in either zip or tar formats.

Get the Dependencies

Hadoop
Pig (optional)
Parts of wukong require these gems:
- addressable/uri
- htmlentities
- extlib
- YAML
- JSON

Setup

1. Allow Wukong to discover where his elephant friend lives by setting a $HADOOP_HOME environment variable: export HADOOP_HOME="/usr/local/share/hadoop"
2. Add wukong’s bin/ directory to your $PATH if you’d like to use the wutils

(see also: Ruby Hadoop Quickstart)

Installing and Running Wukong with Hadoop

Wukong was primarily developed for Hadoop, and we think it’s the best way to use Hadoop (it’s certainly the most fun!).

Run Wukong on the Amazon AWS EC2 Cloud

Hadoop Infrastructure

Even if you have a bunch of machines with spare cycles, lots of RAM, and a shared filesystem… do yourself a favor and start out using the Cloudera AMIs on Amazon’s EC2 cloud. There are an overwhelming number of fiddly little parameters and you’ll be glad for the user experience before you get into server setup. If it’s still mid-late 2009 when you read this, ignore prudence and jump straight to using Hadoop 0.20. It will be a) more fun, b) much more robust (trust me, at “v0.20” you want to live on the bleeding edge), and c) you won’t have to suffer through migrating your HDFS two weeks after setup.

To set up hadoop, your best bet are the Cloudera AMIs on Amazon’s EC2 compute cloud:

http://www.cloudera.com/hadoop-ec2
http://www.cloudera.com/hadoop-ec2-ebs-beta

EC2 means anyone with a $10 bill can rent a 10-machine cluster with 1TB of distributed storage for 8 hours.

Run Wukong using Amazon AWS Elastic MapReduce

AWS Elastic MapReduce saves the trouble of even setting up a cluster: click, bam, there it is.

Phil Ripperger has prepared a Ruby Hadoop Quickstart explaining how to get started with Wukong, Hadoop and the Amazon Elastic MapReduce cloud — it’s better than anything we could put here. Thanks Phil!

Set up a Hadoop cluster

If you have a local cluster, or just want to experiment with a single-machine install, check out the Cloudera packages for both Debian/Ubuntu-based and Redhat/RPM-based Linux systems.

More Hadoop Notes

I’ve braindumped some random notes on configuring and using hadoop over here

Wukong isn’t just Hadoop: Datamapper, ActiveRecord, command-line usage and more

Wukong is used by many in an non-Hadoop environment — anywhere you can stream data records, you can unleash its monkey power.

Please see the usage notes for more!

infochimps.org | find, share or sell any dataset in the world
github/infochimps-labs | infochimps-labs's code on github.com

wukong | hadoop made easy
edamame | fast persistent job queue
monkeyshines | api scraper
wuclan | social media scraper

Wukong image courtesy Curt Busse under an open license. It's a Chacma Baboon from the Okavango site. Make sure to read the story at the bottom of that page.

Thanks to github.com/mojombo for the swanky HTML layout.