Placeholder Image

字幕表 動画を再生する

  • [MUSIC PLAYING]

  • SPEAKER: OK.

  • So what is dynamic programming?

  • Normally, of course, I like to ask you to contribute ideas

  • that we can discuss.

  • But in this particular case, the point that I want to make

  • is either you already knew the answer, or you're wrong.

  • Dynamic programming is designed to be buzzword compliant.

  • It's something that was invented by Richard Bellman,

  • and he came up with the name.

  • And it's actually-- and this is according to his own autobiography--

  • the problem was that Rand is a defense contractor,

  • so he was working at Rand, working on computer algorithms,

  • and Rand worked for the Air Force, and still works for--

  • it's part of the big industrial military complex--

  • which meant that their budget depended on ultimately the Secretary of Defense.

  • And if the Secretary of Defense didn't like what he was doing,

  • or Congress didn't like what he was doing, and they got wind of it,

  • that would be the end of it.

  • So apparently, according to Bellman, the Secretary of Defense at the time

  • had a pathological aversion to research.

  • The way he described it is that he would suffuse and become red in the face

  • if you mentioned the word research in his presence.

  • And he really didn't like it if you said something like mathematics.

  • And so Bellman wanted to be sure that whatever he was doing if this showed up

  • in a Congressional report or in a report that landed on the Department

  • of Defense's-- on the Secretary of Defense's-- desk,

  • that it wouldn't say anything objectionable.

  • So he came up with this term programming, which in the math world,

  • means optimization-- find the best answer to a problem.

  • And dynamic and the idea of dynamic, was this

  • is an adjective that, yes, it sort of expresses that things are changing,

  • but mostly it has no negative connotations.

  • Dynamic just sounds good.

  • So that's where the name comes from.

  • So dynamic programming is basically a name

  • that nobody could possibly object to, and therefore

  • decide to cancel the project.

  • There is a term that is much more descriptive of what's going on,

  • which is a lookup table.

  • So what dynamic programming really is, when it comes down to it,

  • is it's a way of looking at certain kinds of problems.

  • With linked data structures, linked lists,

  • tries, hash tables, all of these sort of fall into a category of data structure

  • that you use when you've got data and you want

  • to map the connections between them.

  • And so there's a whole category of different data structures

  • that involve using pointers to map the connections between things.

  • Dynamic programming is a category of algorithms

  • where you say, in my problem, when I want to solve it,

  • I end up asking the same question over and over again as part of solving it.

  • The same question with the same parameters, if you will.

  • And so rather than doing all the same work over and over again,

  • let me just remember the answer after I figure it out once,

  • and the next time I ask the same question,

  • I'll just go look up what the answer was.

  • That's the idea of dynamic programming.

  • And of a lookup table.

  • You've already used lookup tables.

  • For example, going way back to the beginning of the class,

  • Here's a for loop.

  • for(i=0, i less than strlen of some string variable that I've called str,

  • i++.

  • this.

  • Is a loop the loop through every character

  • of this string, str, one by one.

  • But the problem is every time you go through the loop,

  • you recalculate the length of this string.

  • And calculating the length of the string in C

  • takes some time because you have to walk down the whole string until you

  • reach that null character at the end.

  • So going all the way back again to the beginning of the class,

  • remember, we talked about, it would be better from a performance perspective

  • not to keep recalculating the length of the string.

  • Instead, why not say something like int length=strlen(str),

  • and save that value to a variable, and just look up that value every time we

  • go through the loop and are checking-- have we reached the last value of i?

  • There, length you can think of as a lookup table that stores one value.

  • There's already an example that you've seen.

  • We're going go to lookup tables that store

  • the answer to more than one thing.

  • And so I'm going to go through today a few different examples of problems

  • that you can solve using dynamic programming.

  • So these are problems where you can break them down in a way

  • that you keep asking the same question over and over again

  • and just look up the answer in a table or an array.

  • And the first such problem I want to talk about is rod cutting.

  • Now the idea of this is I have a rod of something valuable,

  • so maybe it's a Tootsie Roll, or plutonium,

  • there's a wide range of things that come as sort of like a long cylinder.

  • And you want to chop it up into shorter pieces to sell.

  • And what you know is that different length pieces, different length rods,

  • sell for different amounts.

  • And the question is, how should I cut this long thing up

  • into smaller pieces to maximize the money that I can sell it for?

  • So if I know, for example, that a rod that's one inch long sells for $1,

  • and a rod that's two inches long turns out to be a lot more valuable,

  • I can sell that for $5.

  • And when it's three inches long I can sell for $8.

  • Now there's a lot more demand for things that are three inches long

  • then things that are four inches long.

  • So if it's four inches long, I can sell it

  • for $9, which is a little more but not a lot more.

  • And then comes the question--

  • I've got a rod of length 10--

  • how should I chop it down into 1s and 2s and 3s and 4s to maximize the profit.

  • And there's different ways I could do this.

  • I could chop it into 10 little one inch rods, and sell each one for $1

  • and I've made $10.

  • I could chop it into five two inch rods, and it would get 5 times 5 is $25.

  • That's a lot more money.

  • So being the unabashed capitalist than I am, that's what I'm likely to do.

  • And there are some in-between options that might make me some money.

  • The question is, how to quickly figure out

  • what I should chop this rod into to maximize my profit

  • and be sure that that's the most that I can sell it

  • for There wasn't some other way to chop this up that would be better.

  • And one way to think about the problem is,

  • if I've got a rod that's 10 inches long and I can chop it once an inch,

  • there are 1, 2, 3, 4, 5, 6, 7, 8, 9 possible places I could cut the rod.

  • And any one of those places I could either decide to make a cut

  • or not make a cut.

  • And that will affect the end result.

  • So that's kind of 9 bits--

  • I either cut or I don't.

  • They're a 1 or a 0.

  • So I end up in terms of the possible different combinations

  • of how I could cut this with every possible 9-bit number

  • in binary, which has 2 to the 9th possible values, which is 512.

  • So I could try 512 possible different ways to cut this rod.

  • And I can figure out which one does the best.

  • And then I could try having a rod that was 20 inches long instead,

  • and discover there's 19 possible places to decide I'm going to cut it

  • or I'm not, and there's 2 to the 19th possibilities

  • now, which is about 500,000.

  • 2 to the 10th is about 1,000, 2 to the 20th is about a million, 2 to the 19th

  • is about a million divided by 2, it's about 500,000.

  • So in general, if we said this rod is n inches long,

  • so that there's n places or n minus 1 places where I could make a decision

  • to cut it, there are 2 to the n approximately possible different ways

  • of making combinations of cuts.

  • And that's a lot of things to try.

  • At the same time, you can sort of see that splitting it into rods

  • of 1, 1 5, 5, and 9 inches long--

  • this is not the only combination of cuts that would give me that.

  • There are other combinations that would also give me two 1s, two 5s, and--

  • sorry, a 4 inch long.

  • Two 1s, two 2s, and a 4.

  • I could do a 4 and then two 2s and then two 1s.

  • I'd get the same value out of it.

  • So a lot of the possible ways that we could cut--

  • they would give the same result in terms of the value.

  • And so maybe I want to reorganize the problem

  • where I don't have to try everything, because I realize a lot of times

  • I'm just recalculating the same thing.

  • And one way to do that--

  • and this is where this concept of dynamic programming or a lookup table

  • comes in--

  • is to say, let me decide first where to cut where the left most cut will be.

  • So instead of deciding, am I going to cut it 1 inch or not,

  • am I going to cut it 2 inches or not, am I going to cut it 3 inches or not,

  • I'm going to decide what's the first place that I'm going to cut?

  • That could be at 1 inches or 2 inches or 3 inches or 4 inches

  • or 5 inches or 6 or 7 or 8 or 9 inches, or I

  • could decide I'm not going to cut at all,

  • I'm just going to leave the rod whole.

  • So with his rod of length 10, there's 10 possible places to make the first cut.

  • And the thing that I'm going to decide is that that splits the rod into two--

  • the part on the left I'm not going to cut again.

  • I'm not going to cut this 2 part--

  • or the 3 part--

  • I'm making that decision.

  • This is the first piece that I'm going to sell.

  • So the total value that I can get will be

  • whatever I can sell the piece on the left for like $1 or $5 or $8 or $9,

  • plus whatever I can get for the right part if I chop it into smaller pieces

  • the best possible way.

  • So the part on the left I won't cut again.

  • I can just look up what is the value of a rod of this length in the table.

  • But I also need to figure out for the part on the right, what

  • is the best way to cut this up.

  • And even that, if you draw out all the possibilities,

  • you're going to end up with the same 2 to the n,

  • except you can start seeing some commonalities.

  • So for example, if you decided to cut a rod of length 2,

  • you've got eight leftover, and you want to find the best way to cut it.

  • Now, if you only chopped off one inch first, that was your first cut,

  • and then you did another cut where you chopped off another inch,

  • suddenly you're end up with what is the best way to cut a rod of length eight.

  • So in the possible ways of chopping this thing of length 9,

  • one possibility is, again, we take one inch off to sell,

  • and we have eight inches left.

  • If right from the start we've taken two inches off,

  • we would also have eight inches left.

  • And so we're asking the same question both times.

  • We've said, I've got some stuff I've already cut that I'm going to sell,

  • and now I've got an eight inch rod left.

  • How do I cut that?

  • There's no point figuring that out twice.

  • We could just keep a little array in which we store, what's

  • the best way to cut a rod of length 8?

  • What's the best way to cut around of length 9, of 7, 6, of 5,

  • of 4, and for each one of those, we just look in the array and say,

  • have we put a value in there yet?

  • If so, we just use it.

  • We don't recompute it.

  • And the result is, if you start looking at it,

  • we've got 10 choices for our first cut, and then on average we've

  • got something like up to nine choices for our second cut,

  • and up to eight choices for our third cut.

  • And so for each of the possible 10 choices the first time,

  • we've got nine choices to make for our second cut, we're up to 90.

  • But this would get us exponential.

  • The problem is a lot of these cuts end up being the same.

  • And so you end up saying, I only have to deal

  • with sort of 10 choices of what to make as the first cut for right of length

  • 10, and nine choices to make for the first cut of a rod of length 9,

  • and eight choices for the first cut to make on a lot of length 8,

  • because after that I'm getting down to something shorter.

  • I'll figure that out eventually.

  • And you end up with something like n squared.

  • So for rod of length 10, I'd have roughly 10 times 10 choices

  • that I actually have to do work for.

  • And so you get down to n squared as opposed to n cubed.

  • Another way to do this is to say, let me start with a rod of length 1.

  • I have no choices for how to cut it, so I know the answer

  • for how I'm going to sell that.

  • For a rod of length 2, I can either do the cost for rod of length 1,

  • and make a cut, and then have a rod of length 1 leftover.

  • I can look up what that costs.

  • And I'll store then which of those was the best value for rod of length 2.

  • I'll say, for rod of length 2, I can get $5, and the way to do that

  • is to not cut.

  • And now for a rod of length three, well, I could decide to cut it after 1 inch,

  • and I've got 2 inches leftover.

  • So we'll look and say, well, the best thing to do with a rod of 2 inches

  • is get $5 for it, so great.

  • For something with 3 inches, I can get $6 by making a cut after one inch.

  • If I make the cut after 2 inches, I get $5 for selling a rod of length 2,

  • and then I look up what do I do with a rod of length 1,

  • and the answer is you just sell it for $1, that gets me $6.

  • I look up, and what happens if I just sell a rod of length 3--

  • and that's $8.

  • So I'll record-- great, for rods of length 3, don't chop it up,

  • just sell it.

  • For a rod of length 4, now I can chop it after 1 inch.

  • I look up, ah, for a rod of length 1, I sell it

  • for $1, plus what do I do with a rod of length 3?

  • I just sell it.

  • I'm done.

  • I've found that that gets me $9.

  • What if I cut at 2 inches?

  • Well I've got a two inch bar I sell for $5,

  • and then I look up how do I recover the right half, it's 2 inches long.

  • I won't bother.

  • My table says just keep it. .

  • And after I've tested all of those, I know that cutting 4 into 2 and 2

  • is going to be the best thing to do.

  • So what's going to happen is, as you build this up,

  • eventually you get to where you're dealing with your rod of length 10.

  • For every cut that you make, you can just

  • look up the left half you know you're going

  • to say as a single thing, the right half you can look up in a table

  • and say, what's the maximum value I can get for this?

  • And the table will say, you can get this many dollars,

  • and here's where you make the first cut.

  • And then you've got a shorter piece left over,

  • and you can look up where to make the next cut.

  • Questions?

  • So I'm going to move on to some other examples.

  • And the next one I'm going go through fairly quickly.

  • And the one after that, I'm going to go into more detail

  • even than I did the rod cutting.

  • And we'll draw out that table, filling in some of the values.

  • If you remember to networking last week, we've

  • got all sorts of computers that are connected together-- for instance,

  • my computer, at least when I'm in my office,

  • is probably hooked up through a department-wide server,

  • or maybe just through Yale's University server,

  • and Natalie also has a computer that's connected,

  • and because our offices are close, they may actually have a connection

  • directly to each other.

  • Now, David and Doug up in the CS50 office at Harvard, or their computers

  • may also be directly connected to each other.

  • So they can send messages back and forth directly,

  • and they're also connected to Harvard's server.

  • But if I want to send a message to David, I will need--

  • and that might be an HTTP request, for example,

  • if he's running a web server on his computer--

  • I'm going to need to find a way to hop from computer to computer.

  • I can't talk directly to David's computer,

  • because there's no wire running between them.

  • So I need to find some computer where I say, I've got a message for David,

  • do you know how to deliver that?

  • And it will say, yeah, I can do that for you.

  • And then it will pass it on to some other computer,

  • saying this is for David, and it keeps getting

  • passed from server to server, until it gets to one that is actually

  • connected to David's computer.

  • And the problem of how to find who to pass that message to is called routing.

  • And it's one of the problems in networking.

  • And it's particularly difficult because networking,

  • you have to deal with other people.

  • When you're just writing a program of your own on your laptop,

  • if something goes wrong, it was probably your fault,

  • and you can probably fix it.

  • If you're talking about a network, it could

  • be that somebody turned off the lights somewhere else,

  • or unplugged their computer without warning.

  • That's completely out of your control.

  • But you still have to deal with it.

  • And so how do you get from my computer to David's computer

  • is something that you have to keep track of

  • and may keep changing, and to get there, we need to find--

  • I need to be able to find not just how I could get to David's computer,

  • but essentially who to pass that message to to get it there the fastest.

  • I may have more than one way of doing it.

  • And one way to do it is going pretty fast,

  • and the other way is going to go through China and Sierra

  • Leone and a congested undersea cable to Africa which has very poor internet

  • links, and it will take a long time for the message to get to David.

  • Unless, of course, somebody from MIT went and found

  • the cable that connects Harvard to Yale and snipped it just

  • for fun, in which case I might be stuck taking the longer route for a while

  • because something broke outside of my control.

  • And the question I want to talk about today

  • is, how do you figure out who to pass that message to?

  • And this is another place where dynamic programming turns out

  • to be really helpful.

  • So the idea is, it's not just that I need

  • to know how to get to David's computer, I

  • need to know how to get to everywhere on the internet.

  • And everywhere else on the internet also needs

  • to know how to get to everywhere on the internet.

  • Everybody is trying to solve this problem of how to find the quickest

  • way to get to any other computer.

  • And to start out, I'm just plugged in, I don't even know who I'm connected to,

  • let alone who else is out there.

  • So the process is, rather than me trying to go snooping and exploring

  • the entire internet--

  • and by the time I'm done, somebody will have unplugged their computer

  • and the process will need to change--

  • I want to try and make sure that we are all

  • doing this work together and sharing as much information as possible.

  • So what I will do is I will just broadcast a message to all

  • of my neighbors, saying haha, I'm here, I'm Benedict,

  • and now Natalie will know, and Yale server will know,

  • aha, Benedict's computer is connected directly to us.

  • So if we need to get a message to Benedict, we can just give it to him.

  • Now at the same time, everybody else is going to do that.

  • So I'll get a message from Natalie, saying haha,

  • here I am, and I'll get one from Yale saying haha, here I am.

  • And now, I know that I'm connected directly to Natalie and Yale.

  • This is what we call being one hop away.

  • So if I need to send a message, it needs to make one hop,

  • just like an airplane makes one hop from city

  • to city, to get to the next computer.

  • And as far as I know, that's all there is in the world.

  • But everybody's just done this, so now everybody knows all of the computers

  • that they're adjacent to.

  • The next thing is, we can all send that information around, and say, here

  • I am, and oh, by the way, here's everybody that I'm connected to.

  • So if you give me a message for any of those people,

  • I'm saying, anybody who passes a message to me that needs to go to Natalie--

  • I can get it to her in one hop.

  • So then you can figure out how many hops it takes you one more hop than that

  • if you pass the message to me.

  • So after we share that we can update our lists.

  • And what will happen is Yale server has learnt from me, hahaha,

  • I can get a message to Natalie for you, just give it to me and in one hop,

  • I'll get it to Natalie.

  • And Yale server says, yeah, but I can just give it to Natalie myself,

  • so why bother doing that?

  • So Yale server will keep in its list, I just found a way to get Natalie in two

  • hops, I don't care.

  • I can get there in one hop by just giving it straight to Natalie.

  • However, I will have learned that Yale can reach Qwest's server in one hop.

  • And so if I need to pass a message to Qwest, I can give the message to Yale

  • and ask it to give it to Qwest and that's

  • the fastest way I know to get there.

  • And so each step everybody sends a pretty short message

  • to all their neighbors.

  • It takes a fixed amount of time.

  • And we're accumulating this information and always keeping

  • track of, what was the shortest--

  • what is the fewest number of hops it takes us to get to a server,

  • and who do we give the message to to get there?

  • And as long as I give the message to Yale,

  • Yale can worry about how it gets to Qwest, it's just promised me it

  • takes one hop.

  • If I do this again, now it turns out I can get to Google in three hops.

  • I can get to Harvard in three hops.

  • But all I care about is, I can get there in three hops and the way I do it

  • is I give the message to Yale's server.

  • Yale's server doesn't even know directly how to talk to Harvard's server

  • or to Google's server.

  • It just knows to pass the message to Qwest's server.

  • Qwest, by the way, owns a lot of the big fibers that

  • connect big data centers and things.

  • So when I traced what happens if I try and connect

  • to Google's server from my office, it goes up through a departments--

  • CS department-- server, then a Yale server, then it goes to Qwest,

  • and then it goes to Google.

  • There are some other ones, like Internet2,

  • which connects a lot of universities, and America Online

  • used to own a lot of this, I don't know if they ever sold it off.

  • There's a few companies that own most of those cables.

  • And eventually, after four rounds, I find out that I can even reach David

  • and Doug in four hops by giving the message to Yale.

  • And Yale knows that it needs to just give

  • that message to Qwest, which gives that message to Harvard,

  • and then Harvard knows how to pass the message on.

  • And this is in fact how the major routers on the internet

  • share this information.

  • Now, your computer or my computer, what they typically know

  • is any message I need to send, I just send it to the server I'm connected to,

  • to the wireless access point.

  • And the wireless access point basically says, I send it to Yale's server.

  • Or maybe there's a very small-- it knows the following laptops are

  • connected directly to me and anything else I just send to Yale's server.

  • But when you get up to the level of all of Yale University, for example,

  • it may well start to participate in this process of sharing this information

  • to know, I could talk to Qwest, I could talk to Internet2,

  • I could talk to Comcast, there's a bunch of people I could talk to.

  • Which cable should I send this message out

  • on to make sure it gets where it's going?

  • And the dynamic programming aspect of this is, at each step

  • I'm reusing all the information that was computed before

  • to just add another layer on top.

  • How do I get to places in three hops?

  • I'm not recomputing the route to get from Yale to David,

  • let's say, when I figure out that I can get there in four hops.

  • I'm just using the information that I know that the Yale server can do it.

  • And so that reduces the amount of actual work that I have to do.

  • Any questions?

  • Now, the third category of algorithm that I want to talk about--

  • and I'm going to talk about this one in a little more detail,

  • and I'm going to talk about a few different applications of exactly

  • the same algorithm, so not just of dynamic programming

  • but of this specific algorithm, of different problems that it can solve.

  • And the one I'm going to start with is DNA sequence alignment.

  • You also do this for aligning protein sequences.

  • If you have a strand of DNA, it's made up

  • of a chain of what are called bases, which are chemicals.

  • They're molecules.

  • And there's four of them that are used in DNA.

  • There's adenine, and thymine and guanine and cytosine.

  • And we'll typically just abbreviate each one with a letter.

  • So if you were to unroll a strand of DNA,

  • and just write down the list of these bases of these molecules

  • that are connected in a long chain, you essentially

  • get a string that's made up of four letters.

  • So it's a string of some combination of the characters A, C, G, and T.

  • Now, it might be that I know by heart the entire genome of a mouse

  • and I know what every single gene does.

  • Or at least I know what some of them do.

  • Now, over time we diverged from mice and various mutations happened to our DNA,

  • because whenever it gets copied, we sometimes make mistakes,

  • and somehow we ended up instead of being a proto mouse we ended up

  • being a modern human.

  • And that same proto mouse whatever ended up being a modern mouse.

  • through different mutations.

  • So if I start sequencing my DNA, I might find

  • a strip of it that corresponds to a particular gene,

  • and I don't know what it does, but I'd like to find out.

  • One way to do that would be to try and match it and find

  • the best possible place for this to match which gene of mouse DNA

  • does this match the best?

  • Keeping in mind that there might be some gaps, a gene might have disappeared,

  • a gene might have reappeared, or gotten added, and sometimes one of these bases

  • might have gotten replaced by a different one.

  • So we might have a mismatch.

  • So there's different things that could happen

  • in the process-- it won't be a perfect identical gene in the mouse to what's

  • in the human, it will just be mostly the same.

  • And I could use that to say, what's the best match,

  • and then use that to predict what this gene codes for,

  • or what the function of the protein it codes for is.

  • You can do the same thing with proteins, they're made up of amino acids.

  • You can look at that chain and look for similar things between one protein

  • and a database of proteins where you know what they do

  • to see if you can predict the function.

  • But the problem is there's a lot of different ways of doing it.

  • So let me write these on the board we've got AAC, AGT, TACC.

  • Now, let me try and come up with the worst possible match that I can.

  • TAAGGTCA.

  • So there's different possible ways that I could line these strings up.

  • They're not the same length, so I'm going to need some gaps somewhere.

  • There's different places I could try and put them and see

  • how the characters line up.

  • And the rule is going to be--

  • and this is based, in fact, on how common

  • it is for a mutation to occur where one base

  • switches to another one, versus a base is actually deleted or added

  • in genetic mutations.

  • It turns out that saying, it's roughly twice as likely that a base just

  • changes to another base than to have a base completely disappear or appear out

  • of nowhere.

  • during mutation.

  • So we will say that when we line up the strings,

  • any time we have a mismatch between two characters--

  • so two bases that are not the same--

  • we're going to have that a penalty of 1.

  • And we're a cost of one.

  • That's what we pay to force the strings to match up that way.

  • And any time we've got a gap, where, say, the C from the first string or DNA

  • strand doesn't match anything in the second one,

  • so that corresponds to a base getting inserted or deleted,

  • we're going to assign that a penalty or a cost of 2.

  • So we'd rather have two mutations--

  • two things were base turned into something else--

  • that seems about as likely as having something just get added or deleted.

  • And then I can add up all those costs, and I

  • get the cost of that particular way of matching the strings.

  • And I'm looking for the way to do this with the lowest cost possible match.

  • This is also called edit distance, because it's also

  • measuring how many edits would have to make to a text,

  • how many changes would I have to make to the text to turn it from one thing

  • into another.

  • This doesn't just have to be these four letters.

  • This could be anything.

  • But the thing that I've written on the board that I want you to see

  • is the worst possible way I could match these would be to say,

  • what if I just deleted the entire gene from the mouse

  • and then did a bunch of insertions to produce the entire gene from the human?

  • So each one of these characters, we say, matches to a Gap.

  • That is one possibility.

  • It's not a very good one.

  • That would be 2 plus 2 plus 2 plus 2 plus 2 plus 2 blah blah blah blah

  • blah blah blah.

  • That would cost a lot.

  • That's a very unlikely way that we would have gotten from this to this,

  • and we're looking for the most likely way.

  • But what it does say is that there's not really any point inserting yet more

  • gaps in the middle, because those things line up perfectly,

  • they won't change the matching score.

  • That's kind of silly.

  • Once this thing is completely gone, I might as well

  • start adding the new one in.

  • So the longest possible sequence where we try and line the two of these up

  • is the length of 1 plus the length of the other which here is about 18.

  • And now if we look at each of these columns.

  • We sort of had a choice.

  • The same way with a rod, we had a choice-- do we cut or do we not cut?

  • Here we have a choice of do we do a deletion--

  • so we had a letter on the first string, and we're

  • going to match it to a gap on the second string, as if this letter got deleted.

  • Or we could do an insertion, in which case it's kind of like this situation

  • where we had nothing and we added a letter.

  • We could do that at the beginning of the string.

  • Or we could actually try and match two letters.

  • So at each point here, we've got three choices we can make--

  • an insertion, a deletion, or a match.

  • And if we could make that choice independently,

  • between each of the letters we'd have three to the 17th possible combinations

  • of things to try.

  • That's a lot.

  • How much is 3 to the 17th?

  • Nobody knows this off the top of your head?

  • OK, the technical term is a lot.

  • That's a number.

  • It's defined as 3 of the 17th.

  • So again, we don't want to keep--

  • we don't want to try everything.

  • And we really want to do this.

  • Computational biologists really want to figure out, look

  • through a database of genes or of proteins

  • and try and match something to figure out what it might do.

  • Which parts of the genome that we just sequenced

  • or the DNA strand that we just sequenced are worth zeroing in on

  • and doing more studies on, because they might code for something important?

  • It turns out that we like to do this also on occasion,

  • or at least we do do it, when we want to look at two people's submissions

  • and see how similar they are.

  • So fundamentally, cheat checkers do this.

  • They want to quickly compute how many changes would I

  • have to make to one student's program to turn it into another student's program?

  • And once we've computed all of that, we want to look and find the ones where--

  • for most people maybe it took 1,000 edits,

  • but here we've got a pair where it only took two.

  • And that's kind of suspicious.

  • So I'm going to sort of rephrase the problem.

  • And it's going to be similar to rod cutting, where I'm going to say,

  • let's pretend that I already know how to match the right portion of two

  • of the strings.

  • So if I already know the best way to match from GTTACC to GTCA,

  • then to figure out the best way to match the whole thing

  • is I just have to figure out how to do the first part.

  • And this is kind of like saying, let me decide

  • where I'm going to put the first letter of the first word,

  • where I want to put the first gap, let's say, or the first insertion,

  • and then figure out the rest of it.

  • But I'm phrasing it going from the end to the beginning.

  • And I'm phrasing it going to the end to the beginning

  • kind of like I've shown here just because when we actually

  • compute the table, that's going to end up putting

  • the answer in a more logical place.

  • So we can kind of say, at the last position in this match,

  • are we going to have an insertion, are we going to have a deletion,

  • or are we going to try and match two characters?

  • And that gives us three possibilities.

  • And each one has a cost.

  • And so whatever the best way to line up these two strings is,

  • the last thing that happens is going to be one of these three things--

  • those are the only three possible ways for the last part of the match

  • to happen--

  • has to be an insertion or a deletion or a match

  • because they're the only options.

  • So if we know the costs of those, then we

  • can say the cost of matching the entire string is the cost of doing whichever

  • choice we make here plus the best edit distance cost, or the best alignment,

  • of whatever's left.

  • Which is slightly different in each of these three cases.

  • Except that if, for instance, we do a deletion, and then for this choice

  • where we're got TAAGGTCA at the bottom, the next time around we

  • tried to do an insertion, and then the best of whatever's left,

  • we would have AACAGTTACC matching to TAAGGTC, go figure that out.

  • And then we would have insert A then delete C.

  • Sorry, we would we would end up with these two strings.

  • But instead of having to try to match C to A,

  • we would have modeled one insertion and one deletion.

  • We'd have the same thing left here.

  • So as we build these things up, we're going

  • to try avoiding finding the best match for the same pair of strings twice.

  • And, realistically what's going to happen is we can start with, well,

  • what if both strings were empty?

  • Then they match perfectly.

  • No problem.

  • What happens if, at the end, I would like

  • to have the string C or the character C, be deleted.

  • So I would like to do some sort of match here

  • where the very last thing that happens is C is deleted,

  • and the rest of that TAA blah blah blah blah is somewhere over here.

  • But I'll worry about where later.

  • So in that case, we have a cost of 2, and when we were done,

  • we would have used up this character C, because we deleted it.

  • And so what we're left with after that is what is in the corresponding cell

  • to the right in the table, which is the cost of matching nothing to nothing.

  • Similarly, if we said the last thing that we want to have in the table

  • is CC matching to nothing, the cost of that

  • is 2 for the match, the deletion of C, plus the cost of matching nothing,

  • C to nothing, which was this cell in the table.

  • So we have 2 plus the cost of 2 that's right here is 4.

  • So in general, what's going to happen is if I look at the cell

  • where the little hand is right now--

  • that just disappeared-- right here, this is the cost--

  • we'll ultimately record the cost of matching GTTACC GGTCA.

  • And it will say, in order to get the best match between these,

  • so the smallest edit distance, if you will,

  • here's what the total cost will be, and it will tell you

  • should you try and match those two G's to each other, or should you do it as--

  • should you have a deletion or an insertion?

  • That's what's going to be stored in this cell of the table.

  • And ultimately that means in entry 0, 0 of your array

  • at the beginning of the table, will tell you the cost

  • of matching AACAGTTACC to TAAGGTCA.

  • It will tell you what the total cost is, and it

  • will tell you whether you should start with the insertion

  • or deletion or a match.

  • And depending on which one, that's going to use up characters from one

  • or both of the strings, which will move you to a new cell in the table that

  • will tell you what to do next.

  • So once it's primed, I can say, well, at the end, if I just

  • look at the end of each string, I have to match the letter C to the letter A.

  • And what is the best way to do that?

  • Well, I've got three options.

  • One option is I could delete the C. Oh, there's an error here,

  • this should read i plus 1.

  • So I could say, let me go over to cost i plus 1.

  • So I've used up the C. But I'm going to stay in--

  • sorry, i plus 1 is this row, but I'm going to stay in column j--

  • so I'm going to come down to here, which says I've used up the A,

  • so this A right here got matched to a gap, which is like an insertion.

  • It says somewhere at the end of the table,

  • I'm going to end with that A from the second string.

  • And, oh, I still have the C floating around.

  • So let me switch back to some white chalk.

  • But leftover, we still have a C here.

  • Because we were trying to match A to C, and so far we

  • haven't dealt with that C. So we would have

  • the cost of matching A to a gap, which is

  • 2, plus whatever the cost of matching C to nothing is, which is 2.

  • Or we could say what if, instead of doing that, we did C

  • and we'll delete C as opposed to inserting A,

  • but it means we still have an A floating around here that we haven't used yet.

  • That was an alternative way of matching C and A. So then we have a cost of 2,

  • because we're doing a deletion.

  • Plus we have left over the cost from the cell

  • to the right in the table of matching the string A to nothing,

  • so that would be a cost of 4.

  • Or we could try and match C and A. And when you match C and A,

  • well the cost of this is zero if the letters are the same and one

  • if they're different.

  • So in this case it's one, because c and A are not the same letter.

  • And we're left with two empty strings, because we've used up the C and the A.

  • So we end up with 1 plus the cost of whatever's over one

  • and down one in the table, which is 0.

  • So that total cost is 1.

  • And of those three possibilities, we just take the smallest one.

  • We say, ah, the smallest one was if I went diagonal-- if I matched A to C.

  • So that's what I'll record in the table.

  • Total cost of matching A to C is one, and the best thing to do

  • is to actually try and match these two characters.

  • It would be a mismatch.

  • And then look up in the table what to do with whatever is left.

  • And now I could say, well, what about matching the string CC to A?

  • I can do the same question.

  • I could say, I could start by matching the letter C and the letter A,

  • and now I've got a C leftover.

  • I could delete the first C and now I've got an A and a C leftover.

  • That would be going this way.

  • Sorry, if I match both.

  • I might have a C leftover.

  • If I delete-- if I just first start with a C and do a deletion,

  • I end up with a C and A leftover.

  • If the first thing I do is insert the A, I've got a CC leftover.

  • So each one of those I've got a penalty of 2

  • to go right, 2 to go down, one to go diagonal, that I add to whatever's

  • in that table.

  • I discover that if I try and match A to C I pay a penalty of one,

  • and I've got a C leftover that I have to delete,

  • so there's a total cost of three.

  • And that was as good an option as I had, so

  • that's what I filled in to the table.

  • And then I can try and figure out for the string ACC to A,

  • and just work backwards through this table and up the rows

  • until it's all filled in.

  • And the nice thing about this is every time when I look at the table,

  • I need three pieces of information.

  • I need the piece of information in the column just to my right.

  • I need the piece of information in the column in the row

  • immediately below me--

  • the cell immediately below me--

  • and I need the piece of information that's one down and one to the right.

  • And because I'm working backwards to the table,

  • it's nice and easy because the data's already there.

  • I started filling in the bottom row in the last column, which

  • are sort of special cases, and now I just

  • have some loops that run backwards through this.

  • And if one string is n letters long, that's n is in number,

  • and the other string is m, which is m as in the mnemonic,

  • then the total is m times n.

  • And I see Doug smiling at mnemonic.

  • I have to attribute that particular joke to Professor Mitzenmacher from Harvard,

  • as reported to me by my roommate, because I wasn't actually

  • in the class where he made the joke.

  • But I liked it.

  • So you can fill up the table.

  • And now, instead of having 3 to the n, essentially, things that you

  • had to try, you only have n squared.

  • So 18 letters.

  • 18 squared is a little less than 20 squared.

  • And 20 squared is 400.

  • So we've got at the most about 400 but in this case

  • it's not even n plus m squared, it's really just n squared.

  • It's really like 10 squared.

  • It's like 100.

  • So we've got roughly 100 different steps to find the answer.

  • If we tried everything, we'd be doing the same work

  • over and over and over again, and we'd do more like 3

  • to the 20th, which again, is a really big number.

  • 2 to the 20th is a million.

  • And 2 to the 3rd is like--

  • 3 is like 2 to the 1.2 or something, let's say, so it's roughly a million

  • to the 1.2.

  • It's about 130 million, there you go.

  • I knew we had an idiot savant in the room somewhere.

  • And again, each one of these cells in the table--

  • it tells you essentially what to do.

  • Do you match this first pair of letters?

  • Do you do a deletion?

  • Or do you do an insertion?

  • And from that you can figure out how much of each of the strings is left.

  • And then you look up in the next cell of the table that corresponds

  • to that what the next thing to do is.

  • And so to figure out exactly how these things match up

  • as well as just what is the cost of matching them up

  • in the best possible way, you work your way through the table.

  • Rather than trying to store all that information in each cell

  • and store the same information over and over again,

  • you store only one piece of the answer in each cell.

  • Questions?

  • Yes.

  • AUDIENCE: In the runtime, it says there's omn, is the m a constant,

  • or is it another--

  • what is that [INAUDIBLE]?

  • SPEAKER: Right, so in this running time, where it says OMN,

  • and what does that mean?

  • What we're saying is we've got two different strings.

  • One of them is n characters long, the other one is m characters long.

  • So in terms of the function of the input,

  • we're just being slightly more precise and saying,

  • we've got two words that are each n letters long.

  • And then it's sort of n squared.

  • Saying maybe they're very different lengths,

  • and so we can be a little more precise as to how many steps they take.

  • But the key thing there is you see one thing times another,

  • so that's kind of like something squared.

  • As you think about, as these strings get longer and longer,

  • how much work is this?

  • So this is not too bad.

  • And it's really actually pretty feasible to do this for DNA sequences

  • that might be a few hundred base pairs long.

  • It's feasible to do this for a source file

  • from homework for cheat checking-- is maybe a few thousand characters.

  • No big deal, even though we've got 1,000 students,

  • and we've got all the past submissions, so we've got 10,000 submissions,

  • so we've got roughly a million pairs of files.

  • Something like that, and that's not a big deal for a computer.

  • And each one of those takes somewhere like a million steps,

  • because you've got 1,000 characters times 1,000 characters.

  • We can manage that.

  • Other questions?

  • Yes.

  • AUDIENCE: Do you have to gain brute force to fill out the table intially?

  • SPEAKER: So do you have to do brute force to fill up the table initially?

  • That depends on how you define brute force.

  • You do have to fill up all the cells of the table.

  • Because in order to find the answer at the top left,

  • even though you're ultimately only going to use the value from one

  • of these three, you need to look at all three

  • values-- the one to the right and the two below.

  • So in order to find the final answer, you

  • have to have filled up the whole table.

  • And there's m by n entries in the table, which is where you get to this m times

  • n or n squared-ish kind of thing.

  • But compared to what you might think of as brute force of try

  • every possible alignment of the two strings

  • which involves lots of subalignments, lots

  • of parts that align the same way, in lots of different places,

  • we're not redoing those over and over.

  • So we're not doing this really brutal brute force that

  • would say there's something like 3 to the 2000 steps

  • to compare a single pair of files, and then we have to do a million of those.

  • That would be a problem.

  • Other questions?

  • So I've sort of folded two applications into one

  • right there with edit distance.

  • One is this computational biology where you treat a DNA strand or a protein

  • sequences as a string.

  • And the other was some sort of fancy file comparison

  • you might use for search and replace.

  • You might use this for spelling checking to figure out

  • the best word in the dictionary.

  • Not just is the word there, but of the words

  • that are in the dictionary, which one is most likely what

  • I meant to type based on some model of how frequent different kinds of errors

  • are.

  • And you could compute the edit distance to every word in the dictionary

  • to propose a list of suggestions.

  • But there's another application that I want to talk about,

  • which is image stitching.

  • This is what happens when you take out your cell phone

  • and you build a panoramic image.

  • So you sweep the camera around and it's going to take a video

  • or it might just end up taking a series of pictures every second or two.

  • And then it stitches them together into one big image.

  • And there's a few parts to that process.

  • One part is each of the pictures that got taken, figuring out

  • how your camera moved between each one.

  • So it can sort of transform the images as if--

  • to where they overlap in just the right places.

  • But even then, they won't line up perfectly, because maybe somebody's

  • walking around in the scene, or it turns out

  • that if you're actually walking along with your camera like this,

  • as opposed to just rotating around, you get a parallax effect

  • where things that are closer to you appear to be moving faster

  • than things that are further away.

  • So right now I see part of the middle chair.

  • The right hand side of it to me is hidden behind the tripod.

  • But if I walk over here, that's come into view and the left side of the seat

  • is hidden behind the tripod.

  • So within that panorama, maybe there's no perfect way to light up the images,

  • because there's different information visible even in the overlapping areas.

  • And to resolve that problem, the key is once you figure out

  • how the two images overlap, you want to figure out and take

  • part of that overlapping area from one image and part of it

  • from the other image.

  • So there's a seam between the two.

  • And a seam is just a connected line of pixels.

  • And everything on one side comes from image A, everything on the other side

  • comes from image B. And the question is where should that seam go?

  • Now, that seam should go, ideally, somewhere where we won't notice it.

  • So let's switch from one image to the other at a place

  • where they are really similar, and somewhere

  • where I'm seeing the parallax from the tripod moving,

  • or somebody was walking along in the scene.

  • I try and make this seam where I connect the two images go around that.

  • So it's not picking up sort of a jump where the two images were actually

  • different.

  • Well, the simplest way to do that is to look

  • at a pair of this overlapping rectangle, and say, let's draw this seam

  • where pixels are really similar.

  • And I could do that by looking, for instance, at the difference

  • in the pixel color between the two.

  • In this case, it's nice and easy because the images are black and white right

  • here, so I just take the difference in intensity,

  • which is a number between 0 and 255, and I'll

  • get a different number that's between 255 and negative 255.

  • So I might subtract white from black.

  • I take the absolute value of that.

  • If that's 0, it means that pixel was the same color in both images.

  • If it's really big, it means the colors were really different.

  • And now, I could say the cost of a seam going

  • through a particular pixel is that difference in values between the image

  • pixels--

  • the difference in intensity.

  • And so for the total seam, I want to figure out from there,

  • should I go right?

  • Should I go down?

  • Or should I go down and to the right?

  • Which I can do by looking up in the table

  • the cost of a seam passing through the pixel to the right or the pixel down

  • and to the right or the pixel to the left.

  • Here's the table, it's just that now, a cell in the table,

  • instead of saying the cost of matching ACCC for example to TCA,

  • filling in a cell in the table.

  • What that-- I'm going to interpret that cell as is

  • the cost of having the image seam run from here to the lower right corner.

  • And so in that sense, if you're looking at image stitching,

  • you can see that matrix.

  • It's this rectangle of overlapping pixels between the two images.

  • And the algorithm to find the optimal seam is identical to edit distance.

  • Absolutely identical.

  • The only difference-- even though I said it was absolutely identical,

  • there is of course a difference, otherwise it would be no fun--

  • but the difference and the reason I say the algorithm is identical

  • is this cost function to decide what is the total cost of this cell based

  • on the cost of the thing to the right, down and to the right

  • and down below, is different.

  • That cost function for edit distance was some number

  • depending on whether they matched or mismatched

  • or it was an insertion or deletion plus the cost of the thing to the right

  • or down or diagonal.

  • So there was a different penalty that you

  • paid for moving right in this table, or down, or diagonal.

  • With this image stitch right now, the way that I've defined it,

  • the cost is the same.

  • The penalty that you pay for having a seam go through a particular pixel

  • is the same even if you then go on to the right or diagonal or down.

  • It's just the difference in intensity.

  • So you would take the minimum of--

  • you would take difference in intensity plus the minimum

  • of what's to the right, what's down below, or what's diagonal to the right.

  • Slightly different cost function, but really the same algorithm.

  • And there's ways to implement this where you can essentially

  • say to the algorithm, OK, go fill in this table.

  • Use this cost function, so that you could use exactly the same code.

  • You implement this once and you can do edit distance

  • and you can also do image stitching.

  • And this is one of the respects in which algorithms

  • and the abstraction of computer science becomes

  • really interesting is when you start to see two problems that seem really

  • unrelated and turn out to be the same.

  • And it's even more fun when you can write

  • one program that does two completely different things for you

  • and don't even have to re-implement it.

  • So any questions?

  • I'm going to end, then, with a demo of one more algorithm that

  • uses dynamic programming--

  • and you can find various demos of this on the web-- this is just one of them--

  • called seam carving.

  • This is very similar to image stitching, where

  • we found that seam going from the upper left corner to the lower right corner.

  • The idea here is that we want to resize the image on the right.

  • And you did this in p set 3.

  • You resized images-- or p set 4, excuse me.

  • You resized images.

  • And if you made them narrower, they got kind of squashed,

  • and if you made them wider, they got kind of stretched,

  • and everything looked wrong.

  • Wouldn't it be nice if we could resize the image

  • but have it still look normal?

  • So the idea of the seam carving algorithm

  • is essentially to resize the image.

  • Instead of scaling every pixel to be a little narrower than it was before,

  • let's just take a row of pixels and delete it.

  • And so we might take in this row right here, or this column of pixels,

  • and delete it, and maybe nobody would notice.

  • The problem is you can only do this so many times in an image

  • before somebody notices.

  • Because you get some sort of jump, where you've deleted useful information.

  • And one of the reasons for that is that, for example, in this column right here,

  • in the water you could get away with deleting a pixel.

  • In the sky you could get away with deleting a pixel.

  • But in the balloon, you better not delete a pixel,

  • because that would be really obvious.

  • The balloon will start to look really funny really fast.

  • So instead of finding a straight line of pixels to delete,

  • let's find a wiggly line.

  • We're still deleting one pixel from every row,

  • so every row will end up being one pixel smaller than it was before,

  • which means the image will still be rectangular, but we have a choice.

  • If we delete a particular pixel at the top of the image, on the next row down

  • we could either delete the pixel right below it or the one

  • to the left or the one to the right.

  • So this is also pretty much-- we've got these three choices,

  • just like we had before--

  • slightly different cost function.

  • And then you've got-- you could start by deleting any one of the pixels

  • at the top.

  • You pick the one that's going to have the lowest cost.

  • And that cost might be how different are you than your neighboring

  • pixels on the left and right.

  • So if you're pretty much the same color as the thing you're left

  • and the thing to your right, nobody will notice if you go away.

  • And with just a little bit of extra work,

  • you can then update that information, that table,

  • to reflect the fact that you've deleted this one pixel from every row.

  • You can patch up all of the seams that that

  • would have affected to figure out if I need to squashed

  • by another pixel, what do I throw out.

  • And you can start resizing the image.

  • And you can see, it's deleting a lot of sky.

  • It's deleting a lot of water.

  • It's being pretty careful about how it thins out these bushes.

  • And thinning them a lot less than it thins some other things,

  • trying to do them in sensible ways so.

  • The stuff that we're used to seeing is mostly staying normal and empty space

  • is getting squashed.

  • But that empty space is in different places

  • depending on where you are in the image.

  • And down here, for example, we don't have a lot of empty space.

  • It's hard to delete grass without you knowing it,

  • because you're going to get half a blade of grass and it's going to look funny.

  • And there's only so much water you can delete before that becomes obvious.

  • So somewhere we had to start deflating the reflection of the balloon.

  • So it looks funny, but it looks less funny than if everything had just

  • been squashed by an equal amount.

  • As long as you save the information about what you were deleting,

  • you can re-expand it.

  • So you see, even if we'd squashed a whole bunch,

  • this balloon is still whole, because it's

  • so easy to delete pixels in the sky and still have the sky look good

  • that there's no need to shrink the balloon.

  • The reflection of the balloon-- that got harder because we're balancing it

  • off with deleting stuff in the grass.

  • So it takes just a little bit of work beyond that image seam

  • and edit distance problem to do this seam carving.

  • And it takes just a little bit more work to be

  • able to resize both horizontally and vertically, which is pretty cool.

  • And this is something that was published at the ACM Siggraph Conference, which

  • is a giant computer graphics conference every August, about 10 years ago,

  • I want to say.

  • I can tell you the exact date by looking for the reference--

  • 2007.

  • So yes, 10 years ago.

  • And it's actually a pretty short paper.

  • And it's really easy to understand, because it's a pretty simple algorithm.

  • And it's a really cool idea.

  • And so it's one of these rare papers that you

  • can sit down and read and understand.

  • You can go to the talk and understand.

  • And you can go home afterwards.

  • And if you've got it a little bit of a background in computer science,

  • and particularly in computer graphics, you can understand this.

  • And you can just sit down and in an hour you can implement it.

  • And so you get these web demos that popped up.

  • Photoshop can do this.

  • And it started appearing all over the place really fast,

  • because it was a really good idea.

  • It worked surprisingly well.

  • And it does that through this dynamic programming.

  • Let's not recompute information about how to get from some particular point,

  • how to delete a scene from there down to the bottom.

  • It doesn't compute that for every possibility

  • It just sort of figures out, well from here, the best thing to do

  • would be to go down or the best thing to do would be to go down and left.

  • And that pixel worries about how to go on from there.

  • Any questions?

  • OK, well thank you all for coming.

  • Remember that tomorrow morning there will be another lecture

  • streamed from Harvard.

  • And if you're at Harvard, you can always have fun going.

  • We'll talk about an introduction to Python,

  • which is another programming language like C.

  • It's got slightly different syntax, but the same basic deal.

  • It tends to get used for certain kinds of programs

  • where its syntax makes it a little bit easier to write them.

  • Then you will be getting instead of a programming assignment,

  • you'll be getting a fun more sort of written exercise

  • to work on over the weekend.

  • Remember your exam.

  • So that will come out after lecture tomorrow.

  • And you'll have three days for that.

  • Then next week at Yale, there are no sections because of our fall break.

  • So there will be no sections Tuesday or Wednesday.

  • You've got some time off.

  • You've got some time to recover.

  • You've got some time for sleep.

  • And next Friday, which is during our fall break,

  • there will be another lecture streamed from Harvard

  • that will be more about Python.

  • And p set 6 will be coming out.

  • It's likely to tie a lot of the things we've been talking about this week

  • and next week together into a programming assignment.

  • And you will of course have 10 days to work on that.

  • So you can look at it when it comes out, but it doesn't have

  • to ruin the rest of your fall break.

  • I promise.

  • And with that, again, thank you for coming.

  • This was CS50.

[MUSIC PLAYING]

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

B1 中級

CS50 2017 - 再生7 - ダイナミックプログラミング (CS50 2017 - Lecture 7 - Dynamic Programming)

  • 81 7
    小克 に公開 2021 年 01 月 14 日
動画の中の単語