Mixing Scala and Java in a Gradle project

This post is basically the twin of an earlier post, which describes the same process for maven.

I had the questionable pleasure of having to convert my existing maven project to Gradle, which is basically almost as bad a maven and a lot slower, but hell, which build tool is perfect anyway?

So with not much further ado, here’s the basic structure:

apply plugin: 'scala'
apply plugin: 'eclipse'
sourceCompatibility = 1.7
version = '1.0'
configurations {
provided
}
configurations.all {
resolutionStrategy {
            force(
            'org.scala-lang:scala-library:2.10.4'
           )
    }
}
jar {
    from(configurations.compile.collect { it.isDirectory() ? it : zipTree(it) }) {
        exclude "META-INF/*.SF"
        exclude "META-INF/*.DSA"
        exclude "META-INF/*.RSA"
        exclude "META-INF/license/*"
    }
    manifest {
    attributes( 'Implementation-Version': version,
                'Built-By': System.getProperty('user.name'),
                'Built-Date': new Date(),
                'Built-JDK': System.getProperty('java.version'))
    }
}
repositories {
    mavenCentral()
    maven { url "http://conjars.org/repo/" }
    maven { url "http://repo.typesafe.com/typesafe/releases/" }
}
dependencies {
    compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
    compile "org.apache.flume.flume-ng-sinks:flume-ng-elasticsearch-sink:1.5.0.1"
    testCompile group: 'junit', name: 'junit', version: '4.+'
    
}
test {
    systemProperties 'property': 'value'
}

 

I must confess that this is actually a great deal shorter than the maven equivalent, mainly due to the simplicity of the scala plugin implementation, which also takes care of compiling java classes as well.

The project structure as such follows the basic pattern of maven projects:

src/main/scala
src/main/java
src/main/resources

and pretty much the same for test.

Advertisements
Posted in Uncategorized | Leave a comment

Creating an ELB load balancer with private subnet instances in a VPC

I was facing massive issues with an ELB configuration which had the following set up:

  • All instance were part of an AWS VPC
  • Three subnets, one public, two privates
  • Both private subnet contained the web containers (tomcat) in two different availability zones

The issue that I was facing was that whatever I did, the LB wasn’t routing requests to my instances. The initial configuration was such that my ELB instances were part of the same subnets as my tomcat instances. As such, using curl against those always worked, but not using the ELB’s public address.

After some frustrating googling, I came up with the solution:

1. Your ELB instances cannot be launched in a private network attached to an Internet Gateway (NAT instance).

2. Conversely, you need to set up public networks which “shadow” your private networks in the same respective availability zones. In my case, I had two private subnets 10.0.1.0 and 10.0.2.0; I created two public subnets 10.0.10.0 and 10.0.20.0 to accomodate my ELB instances.

3. You need to add routing from those new public subnets¬†to your private original subnets, i.e. add the public route to these subnets and add the internet gateway for accessing the private networks. You can set this up in VPC -> Subnets. Here’s an example:

Screen Shot 2014-09-11 at 11.32.56

 

In your subnet view, it should look like this:

 

 

Screen Shot 2014-09-26 at 09.00.19

 

Remember that the public shadow subnets (10.0.10.x and 10.0.20.x) are connected to the public route, while the private subnets (10.0.1.x and 10.0.2.x) are attached to the nat interface.

4. You also need to adjust your security groups. The new subnets need to have explicit access to your application’s ports in your private networks.

When you’ve done all that, you can create your ELB – if you already have an ELB that doesn’t work, delete it. Amazon will not properly clean up ELB instances in private subnets and you’ll end up with more nodes than you asked for, some of them not working.

These are screenshots describing the relevant sections of the ELB creation process:

 

 

Screen Shot 2014-09-03 at 09.38.40

 

 

Screen Shot 2014-09-03 at 09.38.04

Posted in Distributed Computing | Tagged , , | 14 Comments

Mixing Scala and Java in a maven project

Most of us work in environments with a considerable amount of java real estate. In order to integrate our scala code into that setup it’s sometimes necessary to mix Java and Scala in one project on a maven set up.

Here’s a working POM for such a project with respective source folders for both languages in

src/main/java

and

src/main/scala

	<build>
		<defaultGoal>package</defaultGoal>
		<resources>
			<resource>
				<directory>src/main/resources</directory>
				<filtering>true</filtering>
			</resource>
			<resource>
				<directory>src/test/resources</directory>
				<filtering>true</filtering>
			</resource>
		</resources>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-resources-plugin</artifactId>
				<configuration>
					<encoding>${project.build.sourceEncoding}</encoding>
				</configuration>
				<executions>
					<execution>
						<goals>
							<goal>copy-resources</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
			<plugin>
				<groupId>net.alchim31.maven</groupId>
				<artifactId>scala-maven-plugin</artifactId>
				<version>3.2.0</version>
				<configuration>
					<recompileMode>incremental</recompileMode>
					<args>
						<arg>-target:jvm-1.7</arg>
					</args>
					<javacArgs>
						<javacArg>-source</javacArg>
						<javacArg>1.7</javacArg>
						<javacArg>-target</javacArg>
						<javacArg>1.7</javacArg>
					</javacArgs>
				</configuration>
				<executions>
					<execution>
						<id>scala-compile</id>
						<phase>process-resources</phase>
						<goals>
							<goal>compile</goal>
						</goals>
					</execution>
					<execution>
						<id>scala-test-compile</id>
						<phase>process-test-resources</phase>
						<goals>
							<goal>testCompile</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<configuration>
					<source>1.7</source>
					<target>1.7</target>
				</configuration>
				<executions>
					<execution>
						<phase>compile</phase>
						<goals>
							<goal>compile</goal>
						</goals>
					</execution>
				</executions>
			</plugin>

		</plugins>
		<pluginManagement>
			<plugins>
				<!--This plugin's configuration is used to store Eclipse m2e settings 
					only. It has no influence on the Maven build itself. -->
				<plugin>
					<groupId>org.eclipse.m2e</groupId>
					<artifactId>lifecycle-mapping</artifactId>
					<version>1.0.0</version>
					<configuration>
						<lifecycleMappingMetadata>
							<pluginExecutions>
								<pluginExecution>
									<pluginExecutionFilter>
										<groupId>
											net.alchim31.maven
										</groupId>
										<artifactId>
											scala-maven-plugin
										</artifactId>
										<versionRange>
											[3.1.6,)
										</versionRange>
										<goals>
											<goal>compile</goal>
											<goal>testCompile</goal>
										</goals>
									</pluginExecutionFilter>
									<action>
										<ignore></ignore>
									</action>
								</pluginExecution>
							</pluginExecutions>
						</lifecycleMappingMetadata>
					</configuration>
				</plugin>
			</plugins>
		</pluginManagement>
	</build>

Please do note that maven-compiler is actually not necessary as scala-compiler actually compiles the java code as well. The different executions make sure that scala-compiler comes first to allow access from java to scala too. I also force JDK 7 onto the scala-compiler. This might not be necessary in future releases of the scala-compiler plugin.

The plugin conf section is for eclipse’s m2e plugin. It suppresses pesky warnings.

Posted in Scala | Tagged , , | 1 Comment

Installing Hadoop CHD4 + MR1 on Mac OS X

As I had to do this once again for a customer not yet running YARN, here are my install notes again:

1. Set up SSH

SSH is preinstalled on Mac OS, you just should make sure that your have your keys set up properly:


ssh-keygen -t rsa -P ""

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

I’m assuming that you’re running this under your own user, as you want to use this as a dev environment. If not, create a user hadoop with remote login privileges.

2. Download distribution

Download the packages for hadoop proper and mr1 from CDH. It’s here:

http://archive.cloudera.com/cdh4/cdh/4/

At the time of writing, the packages were:

  1. mr1-2.0.0-mr1-cdh4.2.2.tar.gz (for MR1)
  2. hadoop-2.0.0-cdh4.7.0.tar.gz (Hadoop + HDFS)

unpack and move it to /opt or wherever you want to have it. I put put it in /Servers

symlink mr1 with mapred and hadoop-2.0.0-cdh4.7.0 with hadoop.

3. Configuration

Set JAVA_HOME. It’s in
/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home, change the minor version to the appropriate JDK release. Please note that CDH4 hasn’t been tested with JDK 8.

The best thing is to add JAVA_HOME to your ~/.profile or ~/.bashrc

Now edit /Servers/hadoop/etc/hadoop/core-site.xml:

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/Servers/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:8020</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
</property>

</configuration>

Now hdfs-site.xml:

<configuration>

<property>
  <name>dfs.name.dir</name>
  <value>/Servers/hadoop/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
The default is ${hadoop.tmp.dir}/dfs/name.
  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/Servers/hadoop/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
  The default is ${hadoop.tmp.dir}/dfs/data.
  </description>
</property>

</configuration>

Now copy and edit mapred-site.xml from hadoop-2.0.0-mr1-cdh4.2.0/conf to hadoop-2.0.0-cdh4.2.0/etc/hadoop directory.

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of tasks that will be run simultaneously by a
  a task tracker
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>8</value>
  <description>The maximum number of tasks that will be run simultaneously by a
   a task tracker
  </description>
</property>

4. Format Namenode and run

Now run


cd /Servers/hadoop
./bin/hadoop namenode -format

Then start DFS:

./sbin/start-dfs.sh

Running jps should give you:

jps
45271 SecondaryNameNode
45342 Jps
45089 NameNode
45168 DataNode

Now start JobTracker:

/Servers/mapred/bin/start-mapred.sh

Running jps should the again give you:

jps
45271 SecondaryNameNode
45401 JobTracker
45089 NameNode
45168 DataNode
45487 Jps
45468 TaskTracker

The web interfaces for the NameNode, JobTracker, TaskTracker and SecondaryNameNode are here:
NameNode – http://localhost:50070/
JobTracker – http://localhost:50030/
Task Tracker – http://localhost:50060
Second NameNode – http://localhost:50090

Posted in Uncategorized | Leave a comment

Running a Storm Cluster on Ubuntu

Step One:

Install Ubuntu Zookeeper Package.

sudo apt-get install zookeeper

Zooker is located in /usr/share/zookeper

Revise /etc/zookeeper/conf/zoo.cfg for any config changes

Step two:

Download and unpack storm, for instance in /opt/ and create a storm symlink pointing to the distribution directory.

Edit /opt/storm/conf/storm.yaml

you should configure zookeeper:

storm.zookeeper.servers:
   - 10.101.20.218

Also create /var/lib/storm for data, or modify the relevant entry in storm.yaml

Optional

Create a user for storm and chown the distribution directory.

Step Three

Start nimbus:

sudo su storm -c ‘./bin/storm nimbus &’

The output will be visible in logs/nimbus.log

After Starting nimbus, start the ui:

sudo su storm -c ‘./bin/storm ui &’

Step Four

On the slave nodes, install the storm distribution by repeating the steps above.

Edit Storm.yaml and add the entries for supervisor:

nimbus.host: &lt; master address &gt;

supervisor.slots.ports: 
  - 6700
  - 6701
  - 6702

You can specify more and/or other ports, this reflects a three worker configuration.

Now start supervisor.

sudo su storm -c ‘./bin/storm supervisor &’

Running stuff

To register a topology with the cluster, execute:

./bin/storm jar < path to your jar > < topology main class > < topology name >

To kill a topology, run:

./bin/storm kill < topology name >

Posted in Distributed Computing | Tagged , , | Leave a comment

Reading Avro files from HDFS

If you want to read Avro files from HDFS and you’re using schema – generated classes instead of GenericRecords, you’ll have to use the specific datum reader.

                    SeekableInput input = new FsInput(path, getConfiguration());
                    DatumReader<SpecificSchemaClass> reader = new SpecificDatumReader<SpecificSchemaClass>();
                    FileReader<SpecificSchemaClass> fileReader = DataFileReader.openReader(input, reader);
                    while (fileReader.hasNext()) {
                        SpecificSchemaClass event = fileReader.next();
                    }

So it’s basically as easy as reading the GenericRecords.

Don’t forget to add the dependencies if you’re using maven:

   <dependencies>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.7.5</version>
        </dependency>

        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro-tools</artifactId>
            <version>1.7.5</version>
        </dependency>

    </dependencies>
Posted in Uncategorized | Leave a comment

Fixing commons-logging and slf4j

For various reaons, commons-logging can be a nuisance, primarily because a) it still widely used (say spring framework) and b) it’s class loading mechanism is so pervasive that you cannot use anything else.

To fix this, you will have to exclude commons-logging from your dependencies:

		
		<dependency>
			<groupId>org.springframework</groupId>
			<artifactId>spring-core</artifactId>
			<exclusions>
				<exclusion>
					<groupId>commons-logging</groupId>
					<artifactId>commons-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		

and add SLF4J with bridging to replace it:

		
		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-api</artifactId>
		</dependency>
		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>jcl-over-slf4j</artifactId>
		</dependency>
		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</dependency>
		<dependency>
			<groupId>log4j</groupId>
			<artifactId>log4j</artifactId>
		</dependency>

Please be aware of the fact that you’ll need to provide versions if you don’t manage your dependencies in a parent pom, like we do in this example.

This will get rid of typical stack traces such as:

java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;Ljava/lang/Throwable;)V
        at org.apache.commons.logging.impl.SLF4JLocationAwareLog.info(SLF4JLocationAwareLog.java:120)
        at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:316)
        at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:303)
        at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:180)
        at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:216)
        at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:187)
        at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:251)
        at org.springframework.test.context.support.AbstractGenericContextLoader.loadBeanDefinitions(AbstractGenericContextLoader.java:253)

Execute this:

mvn dependency:tree |less

to find out where commons-logging is used in your transitive dependencies

Posted in Uncategorized | Leave a comment