Placeholder Image

字幕表 動画を再生する

  • The following content is provided under a Creative

  • Commons license.

  • Your support will help MIT OpenCourseWare

  • continue to offer high quality educational resources for free.

  • To make a donation, or to view additional materials

  • from hundreds of MIT courses, visit MIT OpenCourseWare

  • at ocw.mit.edu.

  • GILBERT STRANG: So this is a pretty key lecture.

  • This lecture is about principal component analysis, PCA--

  • which is a major tool in understanding a matrix of data.

  • So what is PCA about?

  • Well first of all, let me remember what

  • was the whole point of last--

  • yesterday's lecture-- the singular value decomposition,

  • that any matrix A could be broken into r rank 1 pieces--

  • r being the rank of the matrix.

  • And each piece has a U times a V transpose.

  • And the good-- special thing is, the U's are orthonormal,

  • and also, the V's are orthonormal.

  • OK.

  • So that's the whole matrix.

  • But we have a big matrix, and we want

  • to get the important information out of it--

  • not all the information.

  • And people say, in machine learning,

  • if you've learned all the training data, you haven't

  • learned anything, really.

  • You've just copied it all in.

  • The whole point of neural nets and the process

  • of machine learning is to learn important facts about the data.

  • And now, here we're at the most basic stage of that.

  • And I claim that the important facts about the matrix

  • are in its largest k singular values--

  • the largest k pieces.

  • We can take-- k equal 1 would tell us

  • the largest single piece.

  • But maybe we have space and computing power

  • to handle a hundred pieces.

  • So I would take k equal 100.

  • The matrix might have ranked thousands.

  • So I claim that Ak is the best.

  • Now here's the one theorem for today, that Ak--

  • using the first k pieces of the SVD--

  • is the best approximation to A of rank k.

  • So I'll write that down.

  • So that really says why the SVD is perfect.

  • OK.

  • So that statement says, that if B--

  • another matrix-- has rank k, then the distance from A to B--

  • the error you're making in just using B--

  • that error is greater than or equal to the error

  • you make for the best guy.

  • Now that's a pretty straightforward,

  • beautiful fact.

  • And it goes back to people who discovered

  • the SVD in the first place.

  • But then a couple of psychologists

  • gave a proof in a later paper--

  • and it's often called the Eckart-Young Theorem.

  • There is the theorem.

  • Isn't that straightforward?

  • And the hypothesis is straightforward.

  • That's pretty nice.

  • But of course, we have to think, why is it true?

  • Why is it true?

  • And to give meaning to the theorem,

  • we have to say what these double bars are.

  • Do you know the right name for this?

  • So that double bar around a matrix is called the--

  • the norm of the matrix, the norm.

  • So I have to say something about matrix norms.

  • How big is-- that's a measure of how big it is.

  • And what I have to say is, there are many different measures

  • of a matrix--

  • how large that matrix is.

  • Let me tell you, for today, three possible measures

  • of a matrix.

  • So different ways to measure--

  • I'll call the matrix just A, maybe.

  • But then I'm going to apply the measure to A minus B,

  • and to A minus AK, and show that that is smaller.

  • OK.

  • So I want to tell you about the norm of A--

  • about some possible norms of A. And actually, the norms I'm

  • going to take today will be--

  • will have the special feature that they can be found--

  • computed by their singular values.

  • So let me mention the L2 norm.

  • That is the largest singular value.

  • So that's an important measure of the--

  • sort of the size of a matrix.

  • I'm talking here about a general m by n matrix

  • A. Sigma 1 is an important norm--

  • often called the L2 norm.

  • And that's where that index 2 goes.

  • Oh.

  • I should really start with vectors--

  • norms of vectors-- and then build to the norms of matrices.

  • Let me do norms of vectors over on this side.

  • The L2 norm of a vector--

  • do we know what that is?

  • That's the regular length of the vector that we all expect--

  • the square root of v1 squared up to vn squared.

  • The hypotenuse-- the length of the hypotenuse

  • in n dimensional space.

  • That's the L2 norm, because of that 2.

  • The L1 norm of a vector is just add up those pieces

  • without squaring and square rooting them.

  • Just add them.

  • That's the L1 norm.

  • And you might say, why do we want two norms?

  • Or there are more norms.

  • Let me just tell you one more.

  • The infinity norm-- and there is a reason for the 1 and the 2

  • and the infinity--

  • is the largest of the v's.

  • OK.

  • Have you met norms before?

  • I don't know.

  • These are vector norms, but maybe you have met.

  • Then we're going to have matrix norms, that maybe will be new.

  • So this is the norm that we usually think of.

  • But this one has become really, really important,

  • and let me tell you just why.

  • And then we'll-- later section of the notes and a later

  • lecture in this course will develop that--

  • develop this.

  • This is the L1 norm.

  • So this is L2, L1, and L infinity--

  • [INAUDIBLE]

  • So what's special about this one?

  • Well, it just turned out-- and it was only discovered

  • in our lifetimes--

  • that when you minimize some function using the L1 norm,

  • you minimize some, let's say, signal the noise,

  • or whatever you minimize--

  • some function.

  • If you use L1, the winning vector--

  • the minimizing vector-- turns out to be sparse.

  • And what does sparse mean?

  • Sparse means mostly zero components.

  • Somehow, when I minimize in L2--

  • which historically goes back to Gauss,

  • the greatest mathematician of all time.

  • When you minimize something in L2, you do the least squares.

  • And you find that the guy that gives you the minimum

  • has a lot of little numbers--

  • lot of little components.

  • Because when you're square those little ones,

  • they don't hurt much.

  • But Gauss-- so Gauss didn't do least L1 norm.

  • That has different names-- basis pursuit.

  • And it comes into signal processing and sensing.

  • Right.

  • And then it was discovered that if you minimize--

  • as we'll see in that norm--

  • you amazingly get-- the winning vector has--

  • is mostly zeros.

  • And the advantage of that is that you can understand

  • what its components are.

  • The one with many small components,

  • you have no interpretation for that answer.

  • But for an answer that just has a few non-zero components,

  • you really see what's happening.

  • And then this is a important one, too.

  • OK.

  • Now I'm going to turn just to-- so

  • what's the property of a norm?

  • Well, you can see that the norm of C times a vector is--

  • just multiplying by 6, or 11, or minus

  • pi, or whatever-- is the size of C. Norms

  • have that nice property.

  • They're homogeneous, or whatever word.

  • If you double the vector, you should double the norm--

  • double the length.

  • That makes sense.

  • And then the important property is that--

  • is the famous triangle in equality--

  • that if v and w are two sides of a triangle,

  • and you take the norm of v and add to the norm of w--

  • the two sides--

  • you get more than the straight norm along the hypotenuse.

  • Yeah.

  • So those are properties that we require,

  • and the fact that the norm is positive, which is--

  • I won't write down.

  • But it's important too.

  • OK.

  • So those are norms, and those will apply also

  • to matrix norms.

  • So if I double the matrix, I want to double its norm.

  • And of course, that works for that 2 norm.

  • And actually, probably-- so the triangle in equality for this

  • norm is saying that the largest singular value of A plus B--

  • two matrices-- is less or equal to the larger

  • the singular value of A plus the largest singular value of B.

  • And that's-- we won't take class time to check minor,

  • straightforward things like that.

  • So now I'm going to continue with the three norms

  • that I want to tell you about.

  • That's a very important one.

  • Then there is another norm that's named--

  • has an F. And it's named after Frobenius.

  • Sorry about that.

  • And what is that norm?

  • That norm looks at all the entries in the matrix--

  • just like it was a long vector--

  • and squares them all, and adds them up.

  • So in a way, it's like the 2 norm for a vector.

  • It's-- so the squared--

  • or shall I put square root?

  • Maybe I should.

  • It's the square root of all the little people in the matrix.

  • So a1, n squared, plus the next a2, 1 squared, and so on.

  • You finally get to a-m-n squared.

  • You just treat the matrix like a long vector.

  • And take this square root just like so.

  • That's the Frobenius norm.

  • And then finally, not so well known,

  • is something that's more like L1.

  • It's called the nuclear norm.

  • And not all the faculty would know about this nuclear norm.

  • So it is the sum of the sigma of the singular values.

  • I guess there are r of them.

  • So that's where we would stop.

  • Oh, OK.

  • So those are three norms.

  • Now why do I pick on those three norms?

  • And here's the point--

  • that for those three norms, this statement is true.

  • I could cook up other matrix norms

  • for which this wouldn't work.

  • But for these three highly important norms,

  • this Eckart-Young statement, that the closest rank k

  • approximation is found from the first k pieces.

  • You see, that's a good thing, because this is

  • what we compute from the SVD.

  • So now we've solved an approximation problem.

  • We found the best B is Ak.

  • And the point is, it could use all the-- any of those norms.

  • So there would be a--

  • well, somebody finally came up with a proof that

  • does all three norms at once.

  • In the notes, I do that one separately from Frobenius.

  • And actually, I found--

  • in an MIT thesis--

  • I was just reading a course 6 PhD thesis--

  • and the author-- who is speaking tomorrow, or Friday in IDSS--

  • Dr. [? Cerebro ?] found a nice new proof of Frobenius.

  • And it's in the notes, as well as an older proof.

  • OK.

  • You know, as I talk here, I'm not too sure

  • whether it is essential for me to go through the proof,

  • either in the L2 norm--

  • which takes half a page in then notes--

  • or in the Frobenius norm, which takes more.

  • I'd rather you see the point.

  • The point is that, in these norms-- and now,

  • what is special about these norms