infochimps-labs.github.com/wukong - infochimps-labs.github.com/wukong

Hadoop Config Tips

Modify Hadoop Job / Site configuration for Wukong/Streaming

Wukong jobs can have long startup times.

If you are processing many small files, increasing
mapred.tasktracker.map.tasks.maximum to be 1.5-2.5 times the number of cores can
give a nice improvement.

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or two per node). Using the Fair Scheduler and enabling some options can turn this up to 1 map and 1 reduce to schedule per ‘tick’. There are some big changes in the schedulers due out in 0.21 that will significantly help here. For smaller low latency jobs this can make a big difference.

Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can hurt small clusters with many small map jobs. Look at the log file output of your reduce jobs (in the jobtracker UI) and see how long shuffle phase is taking. There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency smaller jobs will benefit a lot too. https://issues.apache.org/jira/browse/MAPREDUCE-318

Setup NFS within the cluster

If you’re lazy, I recommend setting up NFS — it makes dispatching simple config and script files much easier. (And if you’re not lazy, what the hell are you doing using Wukong?). Be careful though — used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.

Installing NFS to share files along the cluster gives the following conveniences:

You don’t have to bundle everything up with each run: any path in ~coder/ will refer back via NFS to the filesystem on master.
The user can now passwordless ssh among the nodes, since there’s only one shared home directory and since we included the user’s own public key in the authorized_keys2 file. This lets you easily rsync files among the nodes.

First, you need to take note of the internal name for your master, perhaps something like domU-xx-xx-xx-xx-xx-xx.compute-1.internal.

As root, on the master (change compute-1.internal to match your setup):

    apt-get install nfs-kernel-server 
    echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
    /etc/init.d/nfs-kernel-server stop ;

(The *.compute-1.internal part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)

Next, set up a regular user account on the master only. In this case our user will be named ‘chimpy’:

  visudo # uncomment the last line, to allow group sudo to sudo
  groupadd admin 
  adduser chimpy
  usermod -a -G sudo,admin chimpy
  su chimpy                  # now you are the new user
  ssh-keygen -t rsa          # accept all the defaults
  cat ~/.ssh/id_rsa.pub      # can paste this public key into your github, etc
  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2

Then on each slave (replacing domU-xx-… by the internal name for the master node):

    apt-get install nfs-common ;
    echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home  /mnt/home  nfs  rw  0 0"  >> /etc/fstab
    /etc/init.d/nfs-common restart
    mkdir /mnt/home
    mount /mnt/home
   ln -s /mnt/home/chimpy /home/chimpy

You should now be in business.

Performance tradeoffs should be small as long as you’re just sending code files and gems around. Don’t write out log entries or data to NFS partitions, or you’ll effectively perform a denial-of-service attack on the master node.

http://nfs.sourceforge.net/nfs-howto/ar01s03.html
The Setting up an NFS Server HOWTO was an immense help, and I recommend reading it carefully.

Tools for EC2 and S3 Management

http://s3sync.net/wiki
http://jets3t.s3.amazonaws.com/applications/applications.html#uploader
“ElasticFox”
“S3Fox (S3 Organizer)”:
“FoxyProxy”:

Random hadoop notes

do your jobs seem to speed up near the end? hadoop will take the largest
blocks first, for best speculative execution but the % complete is based on
the # of mappers, not the # of input bytes.

BZip2 files must all be run through the same mapper, so a job with asymmetrically distributed file sizes may find the first mappers finishing along with the last ones.

Random EC2 notes

How to Mount EBS volume at launch

The Cloudera AMIs and distribution include BZip2 support. This means that if you have input files with a .bz2 extension, they will be naturally un-bzipped and streamed. (Note that there is a non-trivial penalty for doing so: each bzip’ed file must go, in whole, to a single mapper; and the CPU load for un-bzipping is sizeable.)

To produce bzip2 files, specify the --compress_output= flag. If you have the BZip2 patches installed, you can give --compress_output=bz2; everyone should be able to use --compress_output=gz.

For excellent performance you can patch your install for Parallel LZO Splitting

If you’re using XFS, consider setting the nobarrier option
/dev/sdf /mnt/data2 xfs noatime,nodiratime,nobarrier 0 0

The first write to any disk location is about 5x slower than later writes. Explanation, and how to pre-soften a volume, here: http://docs.amazonwebservices.com/AWSEC2/latest/DeveloperGuide/index.html?instance-storage.html

infochimps.org | find, share or sell any dataset in the world
github/infochimps-labs | infochimps-labs's code on github.com

wukong | hadoop made easy
edamame | fast persistent job queue
monkeyshines | api scraper
wuclan | social media scraper

Wukong image courtesy Curt Busse under an open license. It's a Chacma Baboon from the Okavango site. Make sure to read the story at the bottom of that page.

Thanks to github.com/mojombo for the swanky HTML layout.