2012-08-25

Dynamic Programming versus Memoization

[Edit on 2012-08-27, 12:31EDT: added code and pictures below. 2012-08-27, 13:10EDT: also incorporated some comments.]

I wrote this on the Racket educators' mailing list, and Eli Barzilay
suggested I post it here as well.

The article is about the difference between memoization and dynamic programming (DP). Before you read on, you should stop and ask yourself: Do I think these two are the same concept?; if you think they are different, How do I think they differ?; and for that matter, Do I even think of them as related?

Did you think? Okay, then read on.

They most certainly are related, because they are both mechanisms for optimizing a computation by replacing repeated sub-computations with the storage and reuse of the result of those sub-computations. (That is, both trade off space for time.) In that description is already implicit an assumption: that the sub-computation will return the same result every time (or else you can't replace the computation with its value on subsequent invocations). You've almost certainly heard of DP from an algorithms class. You've probably heard of memoization if you're a member of this language's community, but many undergrads simply never see it because algorithms textbooks ignore it; and when they do mention it they demonstrate fundamental misunderstandings (as Algorithms by Dasgupta, Papadimitriou, and Vazirani does).

Therefore, let's set aside precedent. I'll tell you how to think about them.

Memoization is fundamentally a top-down computation and DP is fundamentally bottom-up. In memoization, we observe that a computational tree can actually be represented as a computational DAG (a directed acyclic graph: the single most underrated data structure in computer science); we then use a black-box to turn the tree into a DAG. But it allows the top-down description of the problem to remain unchanged. (As I left unstated originally but commenter23 below rightly intuited, the nodes are function calls, edges are call dependencies, and the arrows are directed from caller to callee. See the pictures later in this article.)

In DP, we make the same observation, but construct the DAG from the bottom-up. That means we have to rewrite the computation to express the delta from each computational tree/DAG node to its parents. We also need a means for addressing/naming those parents (which we did not need in the top-down case, since this was implicit in the recursive call stack). This leads to inventions like DP tables, but people often fail to understand why they exist: it's primarily as a naming mechanism (and while we're at it, why not make it efficient to find a named element, ergo arrays and matrices).

In both cases, there is the potential for space wastage. In memoization, it is very difficult to get rid of this waste (you could have custom, space-saving memoizers, as Václav Pech points out in his comment below, but then the programmer risks using the wrong one...which to me destroys the beauty of memoization in the first place). In contrast, in DP it's easier to save space because you can just look at the delta function to see how far “back” it reaches; beyond there lies garbage, and you can come up with a cleverer representation that stores just the relevant part (the “fringe”). Once you understand this, you realize that the classic textbook linear, iterative computation of the fibonacci is just an extreme example of DP, where the entire “table” has been reduced to two iteration variables. (Did your algorithms textbook tell you that?)

In my class, we work through some of the canonical DP algorithms as memoization problems instead, just so when students later encounter these as “DP problems” in algorithms classes, they (a) realize there is nothing canonical about this presentation, and (b) can be wise-asses about it.

There are many trade-offs between memoization and DP that should drive the choice of which one to use.

Memoization:
  • leaves computational description unchanged (black-box)
  • avoids unnecessary sub-computations (i.e., saves time, and some space with it)
  • hard to save space absent a strategy for what sub-computations to dispose of
  • must alway check whether a sub-computation has already been done before doing it (which incurs a small cost)
  • has a time complexity that depends on picking a smart computation name lookup strategy
In direct contrast, DP:
  • forces change in desription of the algorithm, which may introduce errors and certainly introduces some maintenance overhead
  • cannot avoid unnecessary sub-computations (and may waste the space associated with storing those results)
  • can more easily save space by disposing of unnecessary sub-computation results
  • has no need to check whether a computation has been done before doing it—the computation is rewritten to ensure this isn't necessary
  • has a space complexity that depends on picking a smart data storage strategy
[NB: Small edits to the above list thanks to an exchange with Prabhakar Ragde.]

I therefore tell my students: first write the computation and observe whether it fits the DAG pattern; if it does, use memoization. Only if the space proves to be a problem and a specialized memo strategy won't help—or, even less likely, the cost of “has it already been computed” is also a problem—should you think about converting to DP. And when you do, do so in a methodical way, retaining structural similarity to the original. Every subsequent programmer who has to maintain your code will thank you.

I'll end with a short quiz that I always pose to my class.

Memoization is an optimization of a top-down, depth-first computation for an answer. DP is an optimization of a bottom-up, breadth-first computation for an answer. We should naturally ask, what about
  • top-down, breadth-first
  • bottom-up, depth-first
Where do they fit into the space of techniques for avoiding recomputation by trading off space for time?
  1. Do we already have names for them? If so, what?, or
  2. Have we been missing one or two important tricks?, or
  3. Is there a reason we don't have names for these?

Where's the Code?

I've been criticized for not including code, which is a fair complaint. First, please see the comment number 4 below by simli. For another, let me contrast the two versions of computing Levenshtein distance. For the dynamic programming version, see Wikipedia, which provides pseudocode and memo tables as of this date (2012-08-27). Here's the Racket version:
(define levenshtein
  (lambda (s t)
    (cond
     [(and (empty? s) (empty? t)) 0]
     [(empty? s) (length t)]
     [(empty? t) (length s)]
     [else
      (if (equal? (first s) (first t))
   (levenshtein (rest s) (rest t))
   (min (add1 (levenshtein (rest s) t))
        (add1 (levenshtein s (rest t)))
        (add1 (levenshtein (rest s) (rest t)))))])))
The fact that this is not considered the more straightforward, reference implementation by the Wikipedia author is, I think, symptomatic of the lack of understanding that this post is about.

Now let's memoize it (assuming a two-argument memoize):
(define levenshtein
  (memoize
    (lambda (s t)
      (cond
       [(and (empty? s) (empty? t)) 0]
       [(empty? s) (length t)]
       [(empty? t) (length s)]
       [else
 (if (equal? (first s) (first t))
     (levenshtein (rest s) (rest t))
     (min (add1 (levenshtein (rest s) t))
   (add1 (levenshtein s (rest t)))
   (add1 (levenshtein (rest s) (rest t)))))]))))
All that changed is the insertion of the second line.

Bring on the Pitchers!

The easiest way to illustrate the tree-to-DAG conversion visually is via the Fibonacci computation. Here's a picture of the computational tree:


Now let's see it with memoization. The calls are still the same, but the dashed ovals are the ones that don't compute but whose values are instead looked up, and their emergent arrows show which computation's value was returned by the memoizer.


Important: The above example is misleading because it suggests that memoization linearizes the computation, which in general it does not. If you want to truly understand the process, I suggest hand-tracing the Levenshtein computation with memoization. And to truly understand the relationship to DP, compare that hand-traced Levenshtein computation with the DP version. (Hint: you can save some manual tracing effort by lightly instrumenting your memoizer to print inputs and outputs. Also, make the memo table a global variable so you can observe it grow.)

10 comments:

Paddy3118 said...

Nice teaser blog, but dry. Where is the code to explain such broad statements?

It sounds as if you have a point - Enough to make me want to see examples but there is nothing beneath to chew on.

Václav Pech said...

Thank you for such a nice generalization of the concept. I elaborated on a specific task in one of my earlier posts (http://www.jroller.com/vaclav/entry/memoize_groovy_functions_with_gpars), where by simply adding memoization on top of a recursive Fibonacci function I end-up with linear time complexity.
Since Groovy supports space-limited variants of memoize, getting down to constant space complexity (exactly two values) was easily achievable, too.

simil said...

@Paddy3118: The simplest example I can think of is the Fibonacci sequence. The implementations in Javascript can be as follows.
function FIB_MEMO(num) {
var cache = { 1: 1, 2: 1 };
function innerFib(x) {
if(cache[x]) { return cache[x]; }
cache[x] = (innerFib(x-1) + innerFib(x-2));
return cache[x];
}
return innerFib(num);
}
function FIB_DP(num) {
var a = 1, b = 1, i = 3, tmp;
while(i <= num) {
tmp = a;
a = b;
b = tmp + b;
i++;
}
return b;
}
It can be seen that the Memoization version "leaves computational description unchanged". And the DP version "forces change in desription of the algorithm". Also note that the Memoization version can take a lot of stack space if you try to call the function with a large number. The trade-offs mentioned at the end of the article can easily be seen in these implementations.

commenter23 said...

A small note from someone who was initially confused - it was hard to parse what you meant by converting a DAG into a tree as the article didn't mention what the nodes and edges signify. Presumably the nodes are function calls and edges indicate one call needing another. And the direction of the arrows point from the caller to the callee? It would be more clear if this was mentioned before the DAG to tree statement.
Nevertheless, a good article.

sbloch said...

Inserting the line "memoize" may work beautifully, but it doesn't really illuminate what's going on. Would there be any point adding a version that expands that into explicitly checking and updating a table?

shreevatsa said...

1) I completely agree that pedagogically it's much better to teach memoization first before dynamic programming. The latter has *two* stumbling blocks for students: one the very idea of decomposing of a problem in terms of similar sub-problems, and the other the idea of filling up a table bottom-up, and it's best to introduce them one-by-one. Then you can say "dynamic programming is doing the memoization bottom-up". As an aside, for students who know mathematical induction, it sometimes helps them to say "dynamic programming is somewhat like induction".

2) What are the fundamental misunderstandings in the Algorithms book? (I haven't seen it.)

3) One issue with memoization that you didn't mention is stack overflow. Because of its depth-first nature, solving a problem for N can result in a stack depth of nearly N (even worse for problems where answers are to be computed for multiple dimensions like (M,N)); this can be an issue when stack size is small.

Shriram Krishnamurthi said...

Stephen (sbloch), sorry, but no time to do that right now. You're right that that would help, but I was assuming the reader had some knowledge of memoization to begin with, or could look it up.

Shriram Krishnamurthi said...

shreevatsa, +1 to everything you said.

I can't locate the comment in Algorithms right now, but it was basically deprecating memoization by writing not particularly enlightened remarks about "recursion".

One slight counter to your comment #2: if depth of recursion really is a problem, one could systematically eliminate it using techniques like CPS. Thus the solution can still be expressed as base functionality + functional abstractions + program transformations. This would be easier to read and to maintain.

wolf said...

Shriram and sbloch,

About to talk memoization to a class today. Here's a Racket memoize that should work for any number of args on the memoized function:

(define (memoize f)
  (local ([define table (make-hash)])
    (lambda args
      ;; Look up the arguments.
      ;; If they're present, just give back the stored result.
      ;; If they're not present, calculate and store the result.
      ;; Note that the calculation will not be expensive as long
      ;; as f uses this memoized version for its recursive call,
      ;; which is the natural way to write it!
      (dict-ref! table args
                 (lambda ()
                   (apply f args))))))

Shriram Krishnamurthi said...

@wolf, nice, thanks. Keep in mind that different uses might want different kinds of equality comparisons (equal? vs eq?, say). May be good to remind your class of that. They could generalize your memoize to be parameterized over that (even in each position, if they want to go wild).