Installing Hadoop CHD4 + MR1 on Mac OS X

As I had to do this once again for a customer not yet running YARN, here are my install notes again:

1. Set up SSH

SSH is preinstalled on Mac OS, you just should make sure that your have your keys set up properly:


ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

I’m assuming that you’re running this under your own user, as you want to use this as a dev environment. If not, create a user hadoop with remote login privileges.

2. Download distribution

Download the packages for hadoop proper and mr1 from CDH. It’s here:

http://archive.cloudera.com/cdh4/cdh/4/

At the time of writing, the packages were:

  1. mr1-2.0.0-mr1-cdh4.2.2.tar.gz (for MR1)
  2. hadoop-2.0.0-cdh4.7.0.tar.gz (Hadoop + HDFS)

unpack and move it to /opt or wherever you want to have it. I put put it in /Servers

symlink mr1 with mapred and hadoop-2.0.0-cdh4.7.0 with hadoop.

3. Configuration

Set JAVA_HOME. It’s in
/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home, change the minor version to the appropriate JDK release. Please note that CDH4 hasn’t been tested with JDK 8.

The best thing is to add JAVA_HOME to your ~/.profile or ~/.bashrc

Now edit /Servers/hadoop/etc/hadoop/core-site.xml:

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/Servers/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>

</configuration>

Now hdfs-site.xml:

<configuration>

<property>
  <name>dfs.name.dir</name>
  <value>/Servers/hadoop/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
The default is ${hadoop.tmp.dir}/dfs/name.
  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/Servers/hadoop/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
  The default is ${hadoop.tmp.dir}/dfs/data.
  </description>
</property>

</configuration>

Now copy and edit mapred-site.xml from hadoop-2.0.0-mr1-cdh4.2.0/conf to hadoop-2.0.0-cdh4.2.0/etc/hadoop directory.

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of tasks that will be run simultaneously by a
  a task tracker
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of tasks that will be run simultaneously by a
   a task tracker
  </description>
</property>

4. Format Namenode and run

Now run


cd /Servers/hadoop
./bin/hadoop namenode -format

Then start DFS:

./sbin/start-dfs.sh

Running jps should give you:

jps
45271 SecondaryNameNode
45342 Jps
45089 NameNode
45168 DataNode

Now start JobTracker:

/Servers/mapred/bin/start-mapred.sh

Running jps should the again give you:

jps
45271 SecondaryNameNode
45401 JobTracker
45089 NameNode
45168 DataNode
45487 Jps
45468 TaskTracker

The web interfaces for the NameNode, JobTracker, TaskTracker and SecondaryNameNode are here:
NameNode – http://localhost:50070/
JobTracker – http://localhost:50030/
Task Tracker – http://localhost:50060
Second NameNode – http://localhost:50090

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s