November 13, 2008

Gisting, an early preview of MapReduce in Ruby

Earlier this year I gave a talk on ruby2ruby at my local Phoenix Users group.  I followed up with a longer and more technical talk at RubyConf 2008.  Not wanting to show up with a lack of code, I demonstrated the power of ruby2ruby by writing a couple of programs.

One of the programs I wrote is called Gisting, which is an open source, Ruby implementation of Google's MapReduce framework which simplifies writing distributed data intensive applications.


After the talk, I got so much positive feedback that I decided to build a releasable version of the software.   The software is almost ready for a public release, but before that happens, I'd like to announce an early preview.

There isn't much documentation available just yet, but I wanted to show just how easy it is to write MapReduce programs with Gisting.  Here's a snippet that performs a Frequency count for the AOL search logs:


Keep in mind that this is an early preview, so I'm well aware that it needs a lot of TLC before I'll be happy making a 1.0 release, such as:
  • A test suite :(
  • A screencast of Gisting basics
  • A homepage/website with examples and documentation
  • Running Gisting in the clouds (Amazon EC2).
That said, I'm planning to release a gem in a few weeks.  In the mean time, I hope you enjoy this early preview of Gisting.

September 05, 2008

Chrome's Process Model Explained

Recently, Google released the Chrome web browser, which they describe as being the next step in web browsers for the current gamut of JavaScript intensive web applications. One new feature I'm particularly excited about is process affinity.  The online comic describes each tab as a separate running process.

Why is this important?

The short answer is robustness.  A web application running in your browser, is a lot like an application running on your operating system, with one important distinction:  Modern operating systems[1] run applications in their own separate process space, while modern browsers[2] run web applications in the same process space. 

By running applications in separate processes, the OS can terminate a malicious (or poorly written) application without affecting the rest of the OS.  The browser, on the other hand, can't do this.  Consequently a single rogue application can suck up mountains of memory and eventually crash your entire browser session, along with every other web application you were using at the time.

Chrome differs by running each web application in a separate process space.  By doing this, Chrome--or a user--can terminate a single web application without affecting the other tabs you have open. 

Process affinity in Chrome

Chrome's process model is extremely sophisticated.  The web comic only mentions the default behavior, but you can configure Chrome to manage processes differently: one process per web site, or one process per group of connected tabs, or one process for everything.

Process-as-site-instance

By default, there are two main Chrome processes, the Browser and the Renderer. The single Browser process is responsible for transporting messages to and from the Renderer, which in turn is responsible for rendering webpages to the user.

Browser_to_renderer

  • 1 Browser process communicates with N Renderer processes.

Each Renderer process has two threads: one Render thread--which renders web pages, and one IPC thread--which transports data in a thread-safe, non-blocking manner between the Render thread and an IPC counterpart sitting in the Browser process.

  • The Renderer process manages 1 IPC thread and 1 Render thread

Completely separate visits to the same site are managed by different processes, so if you had two tabs open to mail.google.com, one of them could crash without affecting the other.  Chrome treats separate browsing contexts as separate processes.

Process_per_site_instance

If you're on mail.google.com, and you navigate to hotmail.com, the tab's underlying process may switch.  In this case, Chrome switches your browsing context because you navigated to another site.

If a web page pops up another webpage (via JavaScript), then the sites are considered connected, and managed by the same process.  Chrome uses a single Renderer process to handle a browsing context.

This is Chrome's default behavior and is called process-per-site-instance. It's intuitive in that your tab count is (more or less) your process count.

Process-per-site

Since multiple tabs can be assigned to a single Renderer process, wouldn't it be neat if the Renderer process could manage a group of sites?

That's what process-per-site does. Chrome defines a "site" similarly to the Same Origin Policy with subdomains added into the mix. 

For example, in process-per-site mode, mail.google.com, docs.google.com and reader.google.com are all managed by a single Renderer process.  If one of those web applications crash, then the responsible Renderer process will crash, thus taking down the entire collection of tabs.

Process_per_site  

Unlike the previous process model, a tab does not imply a separate Renderer process.

Process-per-tab

The third and most intuitive process model is called process-per-tab.

In this model, tabs have their own process but unlike process-per-site-instance and process-per-site, none of the underlying process switching logic is applied.

Each tab has it's own process for the life of tab, so a tab will never change process even if a user consecutively visits hotmail.com, gmail.com, and ymail.com.

Process_per_tab

One process per tab, forever.

Single-process

Finally, the fourth and simplest process model is the single-process behavior. You can run Chrome in a mode that combines the Browser and Renderer process into a single process.  This makes Chrome behave a lot like the browsers we have today[2].

Choose your Process model

Anyway, if you made it this far down, then the take away from all this is that,
the various process models define different ways of assigning tabs to processes, therefore your user experience will vary depending on your OS, your browsing behavior, and the websites you frequent

To use a specific process model, you can launch Chrome with one of the following arguments.

--process-per-site
--process-per-tab
--single-process

If you're interested in reading more about memory isolation and the challenges in building a browser like Chrome, check out Charlie Reis' paper on
Using Processes to Improve the Reliability of Browser-based Applications. Chrome's process model is derived from this paper.

Thanks to Ben Smith, and the developers in #chromium (irc.freenode.org) for reading drafts of this article.

Updated: Charlie dives into the reasons for a multi-process architecture browser.

[1] Vista, Linux, Unix, OS X, pretty much anything after Windows 2000. 
[2] I'm specifically referring to Firefox 3 and Safari 3, which run in a single process.  I'm not familiar with
Opera, Konquerer, or Explorer's process model, so there may already be browsers which do a great job at isolating processes or managing threads (Like Opera, I love Opera).

July 18, 2008

OpenRain 1.0

In less than a year, OpenRain went from this:

_mg_0928s
Ruby, ruby, ruby

_mg_0930s
I miss the guitar

_mg_0931s
And the Elliptical.. in our conference room

to this:

_mg_1544s_2
Seating area

_mg_1547s_3
Hacking area

_mg_1549s
Eating area

To celebrate, OpenRain is throwing our first open house tonight. In addition to friends and family, this is an open invitation to folks in Phoenix, AZ interested in the design, development, and business of web software.

In other words, you're invited.

Why a celebration?

Officially, it's because OpenRain recently moved into a new office in sunny Mesa, AZ, and to commemorate this upgrade, we're throwing a "1.0 Release Party."

Unofficially, it's to celebrate just how far OpenRain has come from two guys programming in the spare bedroom.  Personally, I'm delighted with just how much growth we've experienced in the last nine months and even though the best is yet to come, it's important that we take a moment to celebrate our recent successes.

Congratulations to entire OpenRain 1.0 Team.

July 16, 2008

Take the red pill

Labor Day 2007 was the day of reckoning.  Inter-Tel and Mitel, after spending who-really-knows-how-many months courting the possibility of a merger, had finally decided to bite the bullet and "combine portfolios in order to increase range across the spectrum of market segments" at $25.60 a share.

It's always calm before the storm.

The great firing at Inter-Tel took place a few months only a few months after the great fire at Inter-Tel, which took out a small portion of the roof and caused massive damage to the 2nd floor.  Fortunately, no one was hurt... by the fire.

On 10 am, the Monday after Labor Day, Marsha (not her real name) approached my desk and requested I follow her to the conference room. Even through Marsha's glasses, I could tell she was not wearing her poker face.

There were about a dozen of us, sitting around the table.  Was this a secret award ceremony for the top 12 associates at Inter-Tel?  Were we going to be taken out for dinner and told that we could work on a new industry strength, open source PBX written in Erlang?

Not quite.  "Due to the recent merger, your positions have been eliminated. Here's your awesome severance package." By 10:10, I was restructured out of a job and by 10:20, I had left the building.  It was fast.

I got into my car and drove.  I made three phone calls.

The first, was to the dear friend who recommended me for the Inter-Tel interview. 

I said jokingly, "Sorry, but you're not going to get your full referral bonus"

Inter-Tel required new associates to stay for a year before the referrer received the full bonus.

We chatted for awhile and made plans for lunch later in the week at our usual Japanese hangout.

The second, was to my dear girlfriend.  It does seem strange that I didn't first call the love of my life, but I knew the conversation would last a while.  We were planning to purchase a house together.

"Don't worry..."

"No, really don't worry.  Yes we can continue the house search..."

"Yes, we'll have enough money"

"I'll find a job.. don't worry, seriously."

This went for awhile until I realized that I was just around the corner from my destination.

"I got to go, hugs and kisses"

I dialed the third number.

"I'm coming over."

I pulled up outside Preston's house.

"Like, right now."

It's been less than a year, and OpenRain has done superbly well.

I never had any illusions about how easy it would be to start a company.  I knew it would be constantly challenging.  And, it is.

When I think about how I got here, I am truly surprised that it stemmed from a single event that Labor Day just under a year ago.

In The Matrix, Neo is given a choice between the blue pill, a life of simulated reality, or the red pill, an opportunity to awaken and a given a chance to change the real world.

I wasn't courageous enough to take the red pill.  Sure I had thought about it, but in the end someone had to force it down my throat.  Taking the red pill, turned out to be more fun, more challenging, and more rewarding.  Take the red pill, I highly recommend it.

May 09, 2008

New OpenRain code repository

A quick note for our Subversion fans, OpenRain's new source code repository is now located at code.openrain.com.

April 18, 2008

Mountain West Ruby 2008 Review

Now that the videos are out on Confreaks, here's my mini review.  Ruby readers beware: this is my first Ruby conference since starting fulltime at OpenRain, so our levels of interest might vastly differ.

My favorite sessions on Ruby

Patrick Farley on Ruby Internals.  Patrick gave a great overview of how C-Ruby works.  He started out by explaining about the relationships between several C structs, such as RBasic and RObject, which he then used to explain the magic behind Ruby's runtime object hierarchy.  A fascinating talk, if you're interested in the nitty gritty internals of Ruby.  Makes me appreciate the work being done by the Rubinius team on their Ruby virtual machine.

Tammer Saleh on Behavior driven development with Shoulda.  If you're starting out with Rails for the first time and you'd like to ramp up on best testing practices, then this talk is for you.  Don't get me wrong, you'll still need to spend time on learning the details and getting use to the develop-test (or test-develop) cycle but listening to this talk will expedite the process by pointing you in the right direction.  If you've been working with Rails for awhile and you're wondering what the story behind behavior driven development is, then this talk is also for you.  My thoughts: you don't need a new tool do BDD; you can get away with Rails integration tests.  Think of this talk as a +8 to your Testing ability.

Phillipe Hanrigou on What to do when Mongrel stops responding.  Answer: Debug with GDB and DTrace.  I recently came from C++ land, so seeing Phillipe live debug a halted Ruby processes with GDB was personally useful.  Even with the amazing work done by the Eclipse and Netbeans teams, GDB isn't going away anytime soon.  Phillipe also put out a PDF titled Troubleshooting Ruby Proesses where he describes strace and lsof, in addition to GDB, as valuable debugging tools.

My favorite sessions not-entirely-about-Ruby

Giles Bowkett on Code Generation: The Safety Scissors Of Metaprogramming.   The premise of the talk was that, like Lisp, writing code which generates data which generates code is a Good Thing. The demo he gave wasn't super glitzy but the ideas behind it were incredibly thought provoking.   Giles is a great speaker; I know this since I was having a hard time devoting attention to my own note taking.  Realizing and figuring out how the code generation in Rails works is an "ah hah" moment and a mental model you need to construct if you're ever going to fully understand Rails.  His talk will take you there. Lastly, did you notice that almost every speaker included a slide with a photo of Darth Vader-wearing a-Sombrero?  That's Giles' fault.

Senor_darth

Jan Lehnardt on Next Generation Data Storage with CouchDBCouchDB is a document-oriented storage system that doesn't support full relational models.  It can replicate in real time making it highly available, and accessible over RESTful HTTP making it scalable.  It's also written in Erlang, which is the poster child for highly scalable programming.

Why might you care?  Since the dawn of computing, databases have generally come in one kind of style: the relational kind.  In fact, it's a very common paradigm to use relational databases to manage all sorts of, you know, relational data.  Like a database of Customers, with shopping Carts, filled with items from a product Catalog, sold by Vendors, from different States at different Tax rates.  If you're working with data and money, say terabytes of data and billions of dollars, then sometime during development, two questions arise: How do I scale all this data?  How do I make this highly available for our millions of paying customers?  There are a bunch of approaches and patterns which have stood the test of time, but they typically involve adding more hardware, or caching large parts of the system in memory.  Both solutions also scale budgets and complexity.

There are two notable database storage systems which aren't relational: Google's BigTable and Amazon's SimpleDB which is likely inspired by Dynamo.  Both store structured data, neither support full relational models.  Let me rephrase that: two very large software shops dealing with far more data than I do have concluded that scaling data using traditional RDBMS techniques could be easier.  It's the potential for being easier with CouchDB that excites me. As of this writing, the version of CouchDB I'm running is 0.72 and some of details in Jan's talk is still under active development. 

I hope you're paying attention, database paradigms are shifting right now.

March 10, 2008

Not really XmlSimple

Use Ruby? Like to parse XML--who doesn't? Using Ruby's XmlSimple library? Don't do that... like, ever, ever.. EVER.

But if you must, take heed of the following advice.

Test #1: What do the Ruby defaults do?

Before we get started, let's define a few snippets of XML to be used in the following examples:

xml1:

<xml>
<head/>
<list>
   <item frames='1' id='2'>who</item>
</list>
</xml>

and xml2:

<xml>
<head/>
<list>
   <item frames='3' id='1'>asdf</item>
   <item frames='1' id='2'>who</item>
</list>
</xml>

Two very simple, well-formed, non-semantic chunks of XML that contain attributes and empty elements

Now let's take a look at XmlSimple in straight up Ruby:

irb#1(main):010:0> pp XmlSimple.xml_in(xml1)
=> {"head"=>[{}], "list"=>[{"item"=>[{"id"=>"2", "content"=>"who",
"frames"=>"1"}]}]}

Ugh, every key returns an array of hashes so you'll end up doing things like hash["head"].first or hash["item"][2] to access values.  It looks nasty, but it actually makes sense since there's no way to know a priori whether "list" or "head" contain 1 or many items.

Let's try that with the XmlSimple option of forcearray => false.

irb#1(main):011:0> xo = XmlSimple.xml_in(xml1, "forcearray" => false)
=> {"head"=>{}, "list"=>{"item"=>{"id"=>"2", "content"=>"who",
"frames"=>"1"}}}

A little cleaner, but problematic as we'll see later.

By default <item>'s value ("asdf") is referenced by the key "content" and all attributes ("id", "frames") become key=>value pairs.

Test #2: Since XmlSimple uses "content" to reference element values, what happens if you have an attribute called "content" ?

xml1a:

<xml>
<head/>
<list>
   <item frames='1' id='2' content='nuts'>who</item>
</list>
</xml>

irb#1(main):019:0> pp XmlSimple.xml_in(xml1, "forcearray" => false)
=> {"head"=>{}, "list"=>{"item"=>{"id"=>"2", "content"=>["nuts", "who"],
"frames"=>"1"}}}

By default, both values for the attribute and element named "content" are returned in a single array.  There's no way to distinguish between the two.

Test #3: What happens if you have more than one <item>, like in the case of xml2?

irb#1(main):038:0> pp XmlSimple.xml_in(xml2, "forcearray" => false)
=> {"head"=>{}, "list"=>{"item"=>[{"frames"=>"3", "id"=>"1",
"content"=>"asdf"}, {"id"=>"2", "content"=>"who", "frames"=>"1"}]}}

In this example, note that <item> returns an array of two hashes. Like I previously mentioned, there's no way for XmlSimple to know that an element will have 1 or many items. With the "forcearray" => false option, a key could return a Hash or an Array depending on the XML.  Not desirable, but you can probably coerce the correct behavior with the right XmlSimple configuration options.

Now, let's take a look at XmlSimple embedded and mixed-in with the Hash class, as it is in Rails.

Test #4: What do the Rails defaults do?

console> pp Hash.from_xml(xml1)
>> {"xml"=>{"head"=>nil, "list"=>{"item"=>"who"}}}

By default, it looks like Hash.from_xml in Rails will eat your attributes.  Yikes!

Test #5: Similarly to before, what happens if you have more than one item, like in the case of xml2?

console> pp Hash.from_xml(xml2)
>> {"xml"=>{"head"=>nil, "list"=>{"item"=>["asdf", "who"]}}}

Same as before, the attributes are removed, and "item" references both element values with a single key.

In summary, Ruby's XmlSimple is bork^H^H^H^H surprising to use and in Rails, doubly so.  Actually this really shouldn't be surprising since most of these cautions are already mentioned on the XmlSimple homepage.  What to use instead of XmlSimple?  One powerful alternative is REXML, which comes bundled with Ruby by default.

Pretend we had a new type of XML, HeroXML.  How do we tell Rails to use REXML to process it.

0. A snippet of HeroXML:

<HeroXML>
  <Hero>
    <Name>Clark Kent</Name>
    <Title>Superman</Title>
  </Hero>
</HeroXML>

1. Add a Mime type to the Rails environment or appropriate initializer config file:

Mime::Type.register_alias "application/hero+xml, :heroxml # Mime::HEROXML

2. Update @@param_parsers in your application controller:

@@param_parsers[Mime::HEROXML] = Proc.new do |rpd|
  node = REXML::Document.new(rpd)
  { node.root.name => node.root} # params[:HeroXML] = <REXML document>
end

This instructs Rails to examine all requests with the CONTENT_TYPE of "application/hero+xml" and process the request body with REXML.  It'll add a key to the params hash called "HeroXML" and make the root node of the HeroXML document available to your actions for further processing.

3. Be sure to set the correct CONTENT_TYPE in your tests.

4. Optional: If the HTTP_ACCEPT variable is correctly configured, you may respond to HeroXML enabled remote clients with something like:

respond_to do |format| {
  format.heroxml do # defined by ":heroxml"
    ## solid code
  end
}

The Rails defaults work fine, but if you're processing XML with attributes, you'll need to use something other than XmlSimple.

February 21, 2008

EC2 at Phoenix Rails User Group

Recently, I spoke at the Phoenix Ruby Users group about Amazon's EC2 services.

The presentation started off with a basic introduction of EC2:

  1. Virtual computing environment,
  2. Running RedHat FC4,
  3. With a pay-as-you-grow, no long term contract payment plan

Since EC2 is a web service (SOAP), there's no shortage of freely available tools that help you manage your EC2 instances, including:

  1. The official EC2 command-line tools
  2. The Firefox browser extension, EC2UI
  3. The amazon-ec2 RubyGem
  4. If you're feeling particularly ambitious, the WSDL file is available too

After the introduction, I talked about some of the deployment architectures that we've been using, including a 1-box, 2-box, and an N-box approach.  Since it was a Ruby/Rails talk, I included some notes/gotchas on various configuration files and deployment scripts.

Finally, I demoed a 2-box instance running MPICH2 and MPI Ruby, which is a set of Ruby bindings for MPI.

Derek Neighbors posted a recap of the event.
Chris Matthieu recorded the presentation over at Rubyology
Here are the slides to follow along.

Thanks to IntegrumTech for hosting the event.

February 16, 2008

Life imitating art

A few years ago, I read about the city of Beloit's recreation of George Seurat's "Sunday Afternoon on the Island of LaGrande Jatte"

Medseurat

Beautiful.

Last year, my friends and I went to Chicago. We stopped by the Art Institute of Chicago and did a recreation of our own.

February 14, 2008

Collect with Ruby and Java

Recently I've been doing a lot of Ruby programming.  I've done a lot of Java and C++ in the past, so it's always interesting to compare styles and design techniques between languages.

Ruby has closures which, amongst other things, allows the language to operate on collections in a compact and concise manner. For instance, take for-loops.  Ruby has for-loops, but you rarely use them.

Take the following list:

list = ["matz", "eats", "sushi"]

Instead of looping over it with:

for i in list
  puts i
end

A common Ruby idiom is:

list.each {|i| puts i}

The power, in this case, comes from expressiveness balanced with brevity.

There are a whole bunch of other collection methods such as select, find, and collect

For a bunch of reasons, my favorite method is collect.

Say you had a collection of User objects with the method "first_name". To get a list of first names, you could do something like this:

users = [...]
first_names = []
for u in users
  first_names << u.first_name
end

But as before, idiomatic Ruby looks like:

first_names = users.collect { |u| u.first_name }

Again, brevity with expressiveness.

Just for comparison, let's try and do this in Java6-land.  Assuming a User object with the method "getFirstName()", one easy approach might look like:

List<User> users = ... 
List<String> first_names = new ArrayList<String>();
foreach(User user : users) {
  first_names.add(user.getFirstName())
}

But what if we wanted to call User.getLastName(), or User.getAge() which returns a totally different type.  Without closures, the only approach is to duplicate the same for-loops over and over again, each with a different method call and return type.

Is a closures-like approach possible in Java6?  Let's give it a shot.

First, since Java6 doesn't come with closures, you're going to have to model one.

public interface Closure<R, T> {
  public R call(T t);
}

A closure in this case is simply a function that accepts an object, type T, and returns an object, type R.  Seeing a concrete implementation will help clear things up.

A User object, which we'll skip. Just keep in mind that it has a method called "getFirstName()" which returns a String and "getAge()" which returns an Integer.

An actual closure implementation which looks like

Closure<String, Person> nameColl = new Closure<String, Person>() {
  public String call(Person t) {
    return t.getFirstName();
  }
};

And finally, the Collect method:

public static <T, R> List<R> collect(List<T> list, Closure<R, T> clo) {
  List<R> res = new ArrayList<R>();
  for (final T t : list) {
    res.add(clo.call(t));
  }
  return res;
}

A test harness:

List<User> list = new ArrayList<User>();
list.add(new User("marc", 26);
list.add(new User("michelle", 25);
List<String> results = collect(list, nameColl); => ["marc", "michelle"]

To get a list of ages, your closure implementation would look like this:

Closure<Integer, User> ageColl = new Closure<Integer, User>() {
  public Integer call(User t) {
    return t.getAge();
  }
};

A test harness:

List<Integer> results = collect(list, ageColl); => [26, 25]

As you can see the Java solution is much longer.  In Java, more Typing means more typing.

An open challenge: Is a more concise approach possible in Java6?

Frustrated with all that typing? Don't worry, there is ongoing work to make closures part of the Java programming language.