Setting up Hadoop on Windows Vista

Now usually you’ll want to set up hadoop on a UNIX variant like Linux, Solaris or even Mac OS X. For one, hadoop DFS is so close to UNIX type systems in it’s usage, running it on Windows feels a lot more “alien”. Besides that, large Windows based clusters haven’t been tested yet, most probably for the aforementioned reason.

However, in a development environment, Windows is a lot more common place. So, after having used hadoop on a VM on my Vista box or using my MacBook, I took a heart and explored the necessary steps to run it on Vista (or any other of those Windows flavours).

I. Cygwin

The first step is to install cygwin. This is an absolute requirment to overcome the “alien” part of the windows shell handling. Installing cygwin is easy, it has a set-up which lets you choose what, where and which ported packages you want to install. Make sure to choose the opensshd package in the Net grouping.

When you’ve set up cygwin, you have to set up your ssh keys and install sshd as a service. Please follow the following steps:

0. Open cygwin:

Choose “Run as administrator” from the context menu by right-clicking on cygwin icon

1. Set permissions:

With recent releases of cygwin, there are many permission problems.
Add these 4 commands as work around:

chmod +r  /etc/passwd
chmod u+w /etc/passwd
chmod +r  /etc/group
chmod u+w /etc/group


2. Run SSH configuration script:

ssh-host-config

The script will ask a lot of questions. Answer all with “yes” except for “This script plans to use cyg_server, Do you want to use a different name?”. The answer is no. The script will also prompt you for a CYGWIN env variable. set it to ntsec tty.

ntsec is an environment variable used by cygwin to instruct cygwin to use Windows’ security rules for controlling users’ access to files and other operating system facilities.

tty is an environment variable used by cygwin to make it work properly with editors, It stands for “tele type”. That’s stuff from way back in time ;-)

The script will also ask you for a password. This will allow you to connect to your windows box.

3. Install sshd as a service

Call net start sshd. It will install sshd as a service, i.e. sshd can be started upon windows startup.

4. Test.

$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 9f:48:5e:da:0f:11:3b:19:29:56:9f:0b:34:45:2b:4b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

You’ll be promoted for your user’s password.

Now we can start installing hadoop.

II. Hadoop

Hadoop support two operating modes: the single instance and the clustered mode. For the sake of simplicity, we’ll outline setting up a single instance installation.  Now it’s time to in medias res:

0. Prerequisites

The first and major prerequisite for running hadoop is of course Java. You’ll need at current JDK, which you can obtain form Sun at http://java.sun.com.  Hadoop is recommending Java 6, while it still can be run with Java 5. The next requirement is setting up private key – based ssh access. For that you’ll need to generate keys which you publish as authorized keys in your .ssh directory:

$> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Now test it:

$> ssh localhost
Last login: Sat Dec  5 15:51:02 2009 from 127.0.0.1

If you get an error, then your sshd is mot probably not running. See the section above.

1. Hadoop installation

Grab a distribution from http://hadoop.apache.org. At the time of writing it’s 0.20.1. Unpack it in your windows on c somewhere handy. I put in C:\Servers where I keep the other stuff like JBoss, Tomcat and such.

After unpacking, we’ll do some symbolic linking to facilitate handling in cygwin:

Link Hadoop:

$>mkdir /u01

$ ln -s /cygdrive/c/Servers/hadoop/ /u01/hadoop

It mustn’t be /u01, you can have it anywhere you like in your own cygwin environment.

Now symlink Java:

$> ln -s /cygdrive/c/Program\ Files/Java/jdk1.6.0_11/ /usr/java

2. Hadoop configuration

Hadoop must now know where java is installed so cd to /u01/hadoop/conf and edit hadoop-env.sh. locate JAVA_HOME and uncomment it. Set the path to /usr/java.

If you use the new configuration scheme which came with hadoop 0.20, you’ll want to at least configure hdfs and map reduce. Here are the contents of hdfs-site.xml:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!– Put site-specific property overrides in this file. –>

<configuration>

<property>
<name>dfs.data.dir</name>
<value>/u01/hadoop/data/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of
directories, then data will be stored in all named
directories,
typically on different devices.
Directories that do not exist are
ignored.
</description>
</property>

<property>
<name>dfs.name.dir</name>
<value>/u01/hadoop/data/dfs/name</value>
<description>Determines where on the local filesystem the DFS name
node should store the name table. If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy.
</description>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>

</configuration>

mapred-site.xml is as such:


<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:54311</value>
</property>
<property>
<name>dfs.replication</name>
<value>8</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>

</configuration>

core-site.xml can stay empty.

Once you’ve set up java and hadoop, you’ re ready to initialize dfs file system. In your cygwin shell cd to the hadoop directory root and issue the following command:

$>hadoop namenode -format

It will initialize the dfs.You’ll see some output indicating the physical location of the dfs on your system.

After that you’ll be able to start hadoop:

$>/u01/hadoop/bin/start-all.sh

does the trick and starts all hadoop components. You can check if everything is running correctly by calling JPS:

$>/usr/java/bin/jps.exe

(it’s still a windows system, right) which should output the following:

5004 DataNode
5772 Jps
4120 JobTracker
4220 SecondaryNameNode
2044 NameNode
5972 TaskTracker

The numbers are the windows PIDs.

2. conclusion

Now you should have a running hadoop on your vista box. Please keep in mind that command line handling must be passed over the cygwin environment. If you plan to do some scripting or call your map reduce programs, you always have to use that intermediary. If you’re looking for your windows files in cygwin, you’ll find all windows lettered drives under /cygdrive/.

About these ads
This entry was posted in Distributed Computing. Bookmark the permalink.

9 Responses to Setting up Hadoop on Windows Vista

  1. Purl says:

    Thank you.. for the post .. However I just see JPS, Namenode and JobTracker running. I dont see Datanode and Tasktracker running. I dont see anything on the logs. What am I missing ??

  2. Purl says:

    I could manually start datanode successfully. I see an issue with starting ask tracker . could you help. i am installing hadoop 1.1.1. thanks.

    2013-02-24 22:36:37,738 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: \tmp\hadoop-cyg_server\mapred\local\taskTracker to 0755
    at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:670)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
    at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:810)
    at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1557)
    at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3893)

    • itellity says:

      I guess you need to check whether you have administrative permissions set for the user under which hadoop runs.

      • Purl says:

        I think thats the issue. I changed the user to be admin and ran the task tracker manually. Now the issue i saw yesterday is gone. I checked the logs, it seemed like task tracker started fine but when i ran jps.exe it doesnt show me the PID for task tracker. plus i see this one the logs

        INFO org.apache.hadoop.mapred.JobTracker: HDFS initialized but not ‘healthy’ yet, waiting…

        and when i re-ran the command to start task tracker, it complains that its already running at some PID. just doesnt show when i ran jps.exe

        any thoughts ? thanks again

  3. itellity says:

    Stop all processes and reformat the namenode. after that it should work like a charm.

    • Purl says:

      I tried that .. doesnt start tasktracker.
      org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Failed to set permissions of path: \tmp\hadoop-cyg_server\mapred\local\taskTracker to 0755
      at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)

      could be an issue with version. online people have suggested to downgrade the version of hadoop. not sure if thats my only option. right now am working on hadoop 1.1.1

      • itellity says:

        That’s very probable. Last time i tried to set up hadoop on Windows was with version 0.20. You might want to try that or run 1.0 on a Linux box within VirtualBox.

  4. Purl says:

    Finally .. got it working with version 1.0.4 ..thanks for your help ..
    there is a patch for the permissions issue ..
    https://github.com/congainc/patch-hadoop_7682-1.0.x-win

  5. foo says:

    This is awesome thank you. I don’t seem to have jps.exe though. Can’t find out much about it either oneline

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s