Nothing ventured, nothing gained

a blog by Marc Chung

Gisting, an early preview of MapReduce in Ruby

»

by Marc Chung

From the datamining, google, openrain, opensource, and ruby part of the brain.

Earlier this year I gave a talk on ruby2ruby at my local Phoenix Users group. I followed up with a longer and more technical talk at RubyConf 2008. Not wanting to show up with a lack of code, I demonstrated the power of ruby2ruby by writing a couple of programs.

One of the programs I wrote is called Gisting, which is an open source, Ruby implementation of Google’s MapReduce framework which simplifies writing distributed data intensive applications.

inputs = args
 spec = Gisting::Spec.new
 inputs.each do |file_input|
   input = spec.add_input
   input.file_pattern = file_input
   input.map do |map_input|
     # 2722  mailbox  2006-05-23 00:08:39
     # 217  -  2006-05-23 15:41:48
     # 1326  www.crazyradiodeals.com  2006-05-23 18:00:30
     # 2722  mailbox  2006-05-23 00:08:39
     # 2722  mailbox  2006-05-23 00:08:42
     # 2722  jc whitney  2006-05-23 00:25:47  1  http://www.jcwhitney.com
     words = map_input.strip.split("\t")
     Emit(words[1], "1")
   end
 end
 output = spec.output
 output.filebase = "/Volumes/gisting/datasets/output"
 output.num_tasks = 2
 output.reduce do |reduce_input|
   count = 0
   reduce_input.each do |value|
     count += value.to_i
   end
   Emit(count)
 end

 result = MapReduce(spec)
 pp result

After the talk, I got so much positive feedback that I decided to build a releasable version of the software. The software is almost ready for a public release, but before that happens, I’d like to announce an early preview.

There isn’t much documentation available just yet, but I wanted to show just how easy it is to write MapReduce programs with Gisting. Here’s a snippet that performs a Frequency count for the AOL search logs:

Keep in mind that this is an early preview, so I’m well aware that it needs a lot of TLC before I’ll be happy making a 1.0 release, such as:

  • A test suite :(
  • A screencast of Gisting basics
  • A homepage/website with examples and documentation
  • Running Gisting in the clouds (Amazon EC2).

That said, I’m planning to release a gem in a few weeks. In the mean time, I hope you enjoy this early preview of Gisting.

Want to know more?

I'm Marc Chung, and you're reading Nothing ventured, Nothing gained, a blog about building beautiful software. I'm the founder of OpenRain Software, a web design and development company located in Arizona, where I make millions of users happy by building breathtaking software with brilliant people.

Presentations, Talks, Etc