download sourceball as .zipdownload sourceball as .tardownload sourceball as static gem

Hadoop Config Tips

Modify Hadoop Job / Site configuration for Wukong/Streaming

Wukong jobs can have long startup times.

If you are processing many small files, increasing
mapred.tasktracker.map.tasks.maximum to be 1.5-2.5 times the number of cores can
give a nice improvement.

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or two per node). Using the Fair Scheduler and enabling some options can turn this up to 1 map and 1 reduce to schedule per ‘tick’. There are some big changes in the schedulers due out in 0.21 that will significantly help here. For smaller low latency jobs this can make a big difference.

Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can hurt small clusters with many small map jobs. Look at the log file output of your reduce jobs (in the jobtracker UI) and see how long shuffle phase is taking. There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency smaller jobs will benefit a lot too. https://issues.apache.org/jira/browse/MAPREDUCE-318

Setup NFS within the cluster

If you’re lazy, I recommend setting up NFS — it makes dispatching simple config and script files much easier. (And if you’re not lazy, what the hell are you doing using Wukong?). Be careful though — used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.

Installing NFS to share files along the cluster gives the following conveniences:

First, you need to take note of the internal name for your master, perhaps something like domU-xx-xx-xx-xx-xx-xx.compute-1.internal.

As root, on the master (change compute-1.internal to match your setup):

    apt-get install nfs-kernel-server 
    echo "/home *.compute-1.internal(rw)" >> /etc/exports ;
    /etc/init.d/nfs-kernel-server stop ;

(The *.compute-1.internal part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)

Next, set up a regular user account on the master only. In this case our user will be named ‘chimpy’:

  visudo # uncomment the last line, to allow group sudo to sudo
  groupadd admin 
  adduser chimpy
  usermod -a -G sudo,admin chimpy
  su chimpy                  # now you are the new user
  ssh-keygen -t rsa          # accept all the defaults
  cat ~/.ssh/id_rsa.pub      # can paste this public key into your github, etc
  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2

Then on each slave (replacing domU-xx-… by the internal name for the master node):

    apt-get install nfs-common ;
    echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home  /mnt/home  nfs  rw  0 0"  >> /etc/fstab
    /etc/init.d/nfs-common restart
    mkdir /mnt/home
    mount /mnt/home
   ln -s /mnt/home/chimpy /home/chimpy

You should now be in business.

Performance tradeoffs should be small as long as you’re just sending code files and gems around. Don’t write out log entries or data to NFS partitions, or you’ll effectively perform a denial-of-service attack on the master node.

Tools for EC2 and S3 Management

Random hadoop notes

BZip2 files must all be run through the same mapper, so a job with asymmetrically distributed file sizes may find the first mappers finishing along with the last ones.

Random EC2 notes


Fork me on GitHub