Wukong jobs can have long startup times.
If you are processing many small files, increasing
mapred.tasktracker.map.tasks.maximum to be 1.5-2.5 times the number of cores can
give a nice improvement.
Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or two per node). Using the Fair Scheduler and enabling some options can turn this up to 1 map and 1 reduce to schedule per ‘tick’. There are some big changes in the schedulers due out in 0.21 that will significantly help here. For smaller low latency jobs this can make a big difference.
Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can hurt small clusters with many small map jobs. Look at the log file output of your reduce jobs (in the jobtracker UI) and see how long shuffle phase is taking. There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency smaller jobs will benefit a lot too. https://issues.apache.org/jira/browse/MAPREDUCE-318
If you’re lazy, I recommend setting up NFS — it makes dispatching simple config and script files much easier. (And if you’re not lazy, what the hell are you doing using Wukong?). Be careful though — used unwisely, a swarm of NFS requests will mount a devastatingly effective denial of service attack on your poor old master node.
Installing NFS to share files along the cluster gives the following conveniences:
First, you need to take note of the internal name for your master, perhaps something like
As root, on the master (change
compute-1.internal to match your setup):
apt-get install nfs-kernel-server echo "/home *.compute-1.internal(rw)" >> /etc/exports ; /etc/init.d/nfs-kernel-server stop ;
*.compute-1.internal part limits host access, but you should take a look at the security settings of both EC2 and the built-in portmapper as well.)
Next, set up a regular user account on the master only. In this case our user will be named ‘chimpy’:
visudo # uncomment the last line, to allow group sudo to sudo groupadd admin adduser chimpy usermod -a -G sudo,admin chimpy su chimpy # now you are the new user ssh-keygen -t rsa # accept all the defaults cat ~/.ssh/id_rsa.pub # can paste this public key into your github, etc cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys2
Then on each slave (replacing domU-xx-… by the internal name for the master node):
apt-get install nfs-common ; echo "domU-xx-xx-xx-xx-xx-xx.compute-1.internal:/home /mnt/home nfs rw 0 0" >> /etc/fstab /etc/init.d/nfs-common restart mkdir /mnt/home mount /mnt/home ln -s /mnt/home/chimpy /home/chimpy
You should now be in business.
Performance tradeoffs should be small as long as you’re just sending code files and gems around. Don’t write out log entries or data to NFS partitions, or you’ll effectively perform a denial-of-service attack on the master node.
--compress_output=flag. If you have the BZip2 patches installed, you can give
--compress_output=bz2; everyone should be able to use