Saturday, September 12, 2009

What's the New York Times Doing with Hadoop?

Interesting yet very brief interview on what the New York Times is doing with Hadoop. It's always fascinating to me to read about the tools and approaches people use with the level of scalability most of us don't have to worry about. Also interesting to me is the MapReduce functionality in Hadoop since it's the same idea used by CouchDB views, and I'm absolutely loving the bit of work I've been doing with CouchDB.

4 comments:

cfwhisperer said...

Matt this is all very fascinating to me I got interested at the O'Reilly Velocity Conference in San Jose. I have to discipline myself and dedicate time to completing build-out of our LA lab I am intrigued to at least get rolling with CouchDB. We are also looking closely at aiCache which is a web acceleration product, thanks for the pointer.

Matthew Woodward said...

Thanks DrQz--great info.

DrQz said...

Welcome. BTW, the functional part is not all that difficult either. Just think of EVERYTHING (including a '+' operator) as function or procedure with args as inputs and returns as outputs. The main difference from procedural languages, like C or Java, is that the output of one function can be the input to another function ... a LISP-ism.Here's an example in Mathematica:In[1]:= Times[Plus[a, b], c] produces: Out[1]:= (a + b) cwhich is what a mathemagician would've written in the first place (the multiply being implicit in math). If 'a' and 'b' are given numerical values, you'd get a single number as the output. MMA can do either numbers or symbols.From a programming standpoint, you see the 'a' and 'b' are args into the function Plus and it's output (together with the input 'c') is an input into the function Times.This example is pedestrian but it generalizes into some very cool and powerful constructs that can be written with relatively little code. For example:1. A Quine (code that reproduces itself):Print[# 1,FromCharacterCode[91], #1,FromCharacterCode[93]]&[Print[#1,FromCharacterCode[91], #1,FromCharacterCode[93]]&]2. Sequence generator using recursion:Nest[Join[ # , ReplacePart[ # , Length[ # ] -> Last[ # ] + 1]] &, {0, 1}, 5] See http://www.research.att.com/~njas/sequences/A007814 (bottom of page).

DrQz said...

With respect to how Hadoop/MR might be getting applied at NYT, here's a video of an ACM talk (http://www.sfbayacm.org/?p=88) that I attended recently, about how Google.com is actually using MR in their AdSense group (i.e., where the action is). http://fora.tv/2009/08/12/Josh_Herbach_PLANET_MapReduce_and_Tree_LearningThe MR slide appears @ 00:37:37 approx.This presentation also provides a quick overview of the whole schmeer related to Biz Analytics and DM. Hard to find, otherwise.