Name: イントロ - 知性の数学 (Intro - The Math of Intelligence)
Uploaded: 2021-01-14T07:04:45.000Z
Duration: 11 min 17 s
Description: VoiceTubeの動画で発音を聞きながら英語表現を覚えよう！学べる英語：

And welcome to "The Math of Intelligence"

For the next 3 months, we're going to take a journey through the most important math concepts that underlie machine learning.

That means [that] all the concepts you need from the great disciplines of calculus, linear algebra, probability theory, and statistics.

The prerequisites are knowing basic python syntax and algebra.

Every single algorithm we code will be done without using any popular machine learning library,

because the point of this course is to help you build a solid mathematical intuition around building algorithms that can learn from data.

I mean let's face it, you could just use a black box API for all this stuff, but if you have the intuition you'll have the intuition you'll know exactly which algorithm to use for the job or even custom make your own from scratch.

As humans, we are constantly receiving data through our five senses and somehow we've got to make sense of all this chaotic input so that we can survive.

Thanks to the evolutionary process we've developed brains capable of doing this.

We've got the most precious resource in the universe —intelligence—

the ability to learn and apply knowledge.

One way to measure our intelligence against the rest of the animal kingdom is using a ladder.

Ours is indeed the most generalized type of intelligence, capable of being applied to the widest variety of tasks.

But that doesn't mean that we are necessarily the best kind of intelligence.

In the 1960s, a primate researcher named, Dr. Jane Goodall, concluded

that chimpanzees had been living in the forest for 100's of thousands of years without overpopulating or destroying their environment at all.

Orcas have the ability to sleep with one hemisphere of their brain at a time, which allows them to recuperate, while at the same time being aware of their surroundings.

In some ways animals are more intelligent than us.

Intelligence consists of many dimensions.

Think of it like a multi-dimensional space of possibility.

When building a AI, the human brain is a great road map, after all the neural networks have achieved state of the art performance in countless tasks,

but it's not the only road map, there are many possible types of intelligence out there that we can and will create.

Some will seem familiar to us, and some very alien.

Thinking in a way we've never done before.

Even the best Go players in the world were stunned at the move.

It went against everything we've learned about the game from millennia of practice, but it turned out to be an objectively better strategy that led to its win.

The many different types of intelligence are like symphonies,

each comprising of different instruments and

these instruments vary, not just in their dynamics but in their pitch and tempo and color and melody.

The amount of data that we're generating is growing really fast.

In the time since you started watching this video enough data was generated for you to spend an entire lifetime analyzing.

Creating intelligence isn't just a nice to have, it's a necessity.

Put in the right hands it will help us solve problems we never dreamed could be possible to solve.

At it's core, machine learning is all about mathematical optimization.

Every single problem can be broken down into an optimization problem.

Once we have some data set that acts as our input, we'll build a model that uses that data to optimize for an objective - a goal that we want to reach.

And the way it does this is by minimizing some error value that we define.

One example problem could be, "what should I wear today?"

I could frame this as optimizing for stylishness, instead of say, comfort,

then define an error that I want to minimize as the amount of ratings a group of people give me that are negative.

Or even what's the best design for my iOs app's homepage.

Rather than hardcoding in some elements, I could find a data set of app designs and their ratings from users.

If I want to optimize for a design that would be the highest rated I would learn the mapping between design styles and ratings.

This is the way that every single layer of the stack will be built in the future.

there are different techniques we can use to find patterns in this data.

And sometimes optimizing for an objective can happen not through the frame of pattern recognition but

through the exploration of many possibilities and seeing what works and what doesn't.

There are many ways that we can frame the learning process...

But the easiest way to learn is when we used labelled data.

Mathematically speaking we have some input.

Theres a domain, X, where every point of X has features that we observe.

Then we have a label set Y. So the data consists of a set of labeled examples that we can denote this way.

The output, then, would be a prediction rule. So given a new X value, what’s its associated Y value?

We’ve gotta learn this mapping, which is an unknown distribution over X,

to be able to answer this. So we have to measure some error function that acts as a performance metric.

So what we’d do is choose from a number of possible models to represent this function.

We’ll initially set some parameter values to represent the mapping, then we’d evaluate the initial result,

measure the error, update the parameters, and repeat this process optimizing the model again and again until it fully learns the mapping. So that brings us to the main topic of this video, first order optimization. What is this?

Was it convex or concave functions that were easier to optimize? I think convex. I really hope my lab partner is epic at optimization.

I guess I should be thankful, not many data scientists get a grant from CERN to detect the Higgs-Boson.

What was her name again? Eloise, I think. Yup, she did win an award at ICML. I wonder if she’s cute?

No, that doesn’t matter. I am not going to mix business and pleasure, not this time.

Suppose I’ve got a bunch of data points. These are just toy data points, like what Apple probably trained Siri on.

They’re all x-y value pairs where x represents the distance a person bikes,

and y represents the amount of calories they lost. We can just plot them on a graph like so.

We want to be able to predict the calories lost for a new person giving their biking distance.

How should we do this? Well we could try to draw a line that fits through all the data points but it seems like our points are too spaced out for a straight line to pass through all of them.

So we can settle for drawing the line of best fit, a line that goes through as many data points as possible.

Algebra tells us that the equation for a straight line is of the form y = mx+ b.

Where m represents the slope or steepness of the line and b represents it’s y-axis intercept point.

We want to find the optimal values for b and m such that

line fits as many points as possible, so given any new x value, we can plug it into our equation and it’ll output the most likely y value.

Our error metric can be a measure of closeness, which we can define like this. So lets start off with a random b and m value and plot this line.

For every single data point we have, lets calculate its associated y value in our already randomly drawn line.

Then we’ll subtract the actual y value from it to measure the distance between the two.

We’ll want to square this error to make our next steps easier.

Once we sum all these values we get a single value that represents our error given that line we just drew.

Now if we did this process repeatedly, say 666 times, for a bunch of different randomly drawn lines,

we could create a 3D graph that shows the error value for every associated b and m value.

Notice how there is a valley in this graph. At the bottom of this valley, the error is at its smallest.

And so the associated b and m values would be the line of best fit, where the distance between all our data points and our line would be the smallest!

But how do we find it? Well we’ll need to try out a bunch of different lines to create this 3D graph.

But rather than just randomly drawing lines over and over again with no signal, what if we could do it in a more efficient way,

such that each successive line we draw brings us closer and closer to the bottom of this valley.

We need a direction a way to descend this valley. What if for a given function, we could find the slope of it at a given point.

Then that slope would point in a certain direction, towards the minima of the graph.

And when we re-draw our line over and over again we could do so using the slope as our compass, as our guide on how best to redraw as we “walk through the valley of the shadow of death”

towards the minima until our slope approaches 0. In calculus, we call this slope the derivative of a function.

Since we are updating 2 values, b and m. We want to calculate the derivative with respect to both of them, the partial derivative.

The partial derivative with respect to a variable means that we calculate the derivative of that variable while ignoring the others.

So we’ll compute the partial derivative with respect to b. Then the partial derivative with respect to m.

To do this we usr the power rule. We multiply the exponent by the coefficient and subtract 1 from the exponent.