Tag Archives: hadoop

Scalding merge, concatenation and joins of pipes

I recently built a scalding job that ran everyday collection a set of ids with timestamps to determine the newest and oldest occurrence of a set, whilst merging that with previously aggregated set. A very simple task, involving simple mapping … Continue reading

Posted in Scala, Uncategorized | Tagged , , , | Leave a comment

Counters using Cascading Flow Listeners in Scalding

As of now, Scalding doesn’t provide full support for counters – you will find a few pull requests and the Stats class, nothing more. This will probably change in the future, until then, I found using Cascading FlowListeners for counters … Continue reading

Posted in Uncategorized | Tagged , | Leave a comment

Hadoop: Useful links

In the following, I’ve listed some very useful resources for hadoop: http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ http://stackoverflow.com/questions/964332/java-large-files-disk-io-performance http://www.cloudera.com/blog/2010/01/hadoop-world-building-data-intensive-apps-with-hadoop-and-ec2/ http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29 https://twiki.grid.iu.edu/bin/view/Storage/HadoopOperations#Cleaning_Up_a_CORRUPT_Filesystem

Posted in Distributed Computing, Enterprise Java | Tagged | Leave a comment