My CSE572 review or What I learned from the AOL search logs
»From the datamining part of the brain.
Last week, I wrapped up my first grad-level class, CSE 572-Data Mining. It was hard. Not prove ”P = NP” hard, but “juggling new job, real life, gymming, and awesome girlfriend” hard. The class was fun, I got to learn a lot about the usual algorithms like decision trees and stumps, k-nearest neighbors, Naive Bayes classification, bagging, boosting, classifier comparisons, lots of clustering techniques–top-down, bottom-up, distance-, graph-, and density-based methods, and techniques for mining association rules, like apriori, which I’ll write about in a future post. I even got to argue with the CTO of an upcoming startup over his approach for social suggestions, which was fun, until he IPOs and I don’t :-).
But I have to say, the really fun work was the project and all the stuff I learned outside class which, in another nutshell, includes distributed computations, first-class functions, concurrency issues, and techniques for (trivially?) parallelizing really, really, space and time intensive computations. I got to work with Amazon’s EC2 (Thanks Mike), a platform providing scalable hardware and bandwidth as a service, the recently released AOL search logs, and techniques for mining association rules. For my project, I decided to perform an analysis on the AOL search logs. Specifically, deriving association rules from the logs similar to the market-basket problem.
Market-basket analysis.. old and busted, or new hotness?
Let me take a second to talk about the market-basket problem. In this problem, we are given a set of all items, let’s call it A, and a large collection of transactional data. Each transaction (or basket) is a subset of A. Now the task here is to find relationships between the items within these baskets based on various factors.
Too complicated? OK, let’s try that again.
The canonical example of the market-basket problem is the supermarket (where the problem gets its name from). Now, instead of “the set of all items”, think “everything Safeway sells” and instead of “large collection of transactional data” think “what people purchase per visit.” Now we can find what products people tend to buy together. Knowing this, Safeway can stock their inventory, plan their sales, and market accordingly. This of course, increases ROI, embiggens synergistic… powers, and makes our lives easier overall. That’s the theory anyway.
Snap back to reality
My project was to analyze the AOL search logs in a very similar fashion. Instead of “the set of all items,” think “all unique queries in the search logs” and instead of “large collection of transactional data” think “what people search for per search session.” Now I can find out what searches people tend to make together. And what did I learn? Damn. People search for the craziest things. And by craziest I mean sexually explicit. And by people I mean you. No, not you, the person sitting next to you.
As I previously mentioned, the project was easily the most interesting component of the course. The third and current iteration of the code is under development. It turns out that, for a variety of reasons I won’t get into right now, implementing a scalable and efficient algorithm to mine association rules is a non-trivial task. Stay tuned, I’ll try to write more about it in the future.