Gisting, an early preview of MapReduce in Ruby
»From the datamining, google, openrain, opensource, and ruby part of the brain.
Earlier this year I gave a talk on ruby2ruby at my local Phoenix Users group. I followed up with a longer and more technical talk at RubyConf 2008. Not wanting to show up with a lack of code, I demonstrated the power of ruby2ruby by writing a couple of programs.
One of the programs I wrote is called Gisting, which is an open source, Ruby implementation of Google’s MapReduce framework which simplifies writing distributed data intensive applications.
inputs = args
spec = Gisting::Spec.new
inputs.each do |file_input|
input = spec.add_input
input.file_pattern = file_input
input.map do |map_input|
# 2722 mailbox 2006-05-23 00:08:39
# 217 - 2006-05-23 15:41:48
# 1326 www.crazyradiodeals.com 2006-05-23 18:00:30
# 2722 mailbox 2006-05-23 00:08:39
# 2722 mailbox 2006-05-23 00:08:42
# 2722 jc whitney 2006-05-23 00:25:47 1 http://www.jcwhitney.com
words = map_input.strip.split("\t")
Emit(words[1], "1")
end
end
output = spec.output
output.filebase = "/Volumes/gisting/datasets/output"
output.num_tasks = 2
output.reduce do |reduce_input|
count = 0
reduce_input.each do |value|
count += value.to_i
end
Emit(count)
end
result = MapReduce(spec)
pp result
After the talk, I got so much positive feedback that I decided to build a releasable version of the software. The software is almost ready for a public release, but before that happens, I’d like to announce an early preview.
There isn’t much documentation available just yet, but I wanted to show just how easy it is to write MapReduce programs with Gisting. Here’s a snippet that performs a Frequency count for the AOL search logs:
Keep in mind that this is an early preview, so I’m well aware that it needs a lot of TLC before I’ll be happy making a 1.0 release, such as:
- A test suite :(
- A screencast of Gisting basics
- A homepage/website with examples and documentation
- Running Gisting in the clouds (Amazon EC2).
That said, I’m planning to release a gem in a few weeks. In the mean time, I hope you enjoy this early preview of Gisting.