Name: データ分析5：データの削減 - Computerphile (Data Analysis 5: Data Reduction - Computerphile)
Uploaded: 2021-01-14T10:13:27.000Z
Duration: 17 min 50 s
Description: VoiceTubeの動画で発音を聞きながら英語表現を覚えよう！学べる英語：

Let's imagine that you work for a major streaming media provider right? So you have I know some 100 million drivers

So you've got I don't know ten thousand videos on your site or many more audio files, right

so for each user you're gonna have collected information on what they've watched when they've watched it how long they've watched it for whether they

Went from this one to this one. Did that work? Was that good for them? And

So maybe you've got 30,000 data points per user

We're now talking about trillions of data points and your job is to try and predict what someone wants to watch or listen to next

So we've cleaned the data we've transformed our data everything's on the same scale we've joined data sets together

The problem is because we've joined data sets together perhaps our data set has got quite large right now

or maybe we just work for a company that has a lot a lot of data certainly the

General consensus these days is to collect as much data as you can like this isn't always a good idea

It's the smallest most compact and useful data set we can otherwise you're just going to be wasting

CPU hours or GPU hours training on this wasting time

We want to get to the knowledge as quickly as possible

And if you can do that with a small amount of data that's going to be great

So we've got quite an interesting data set to look at today based on music

It's quite common these days when you're building something like a streaming service for example Spotify

You might want to have a recommender system

This is an idea where you've maybe clustered people who are similar in their tastes, you know

what kind of music they're listening to and you know, the

attributes of that music and if you know that you can say well this person likes high tempo music

So maybe he'd like this track as well. And this is how playlists are generated

One of the problems is that you're gonna have to produce

Descriptions of the audio on things like tempo and how upbeat they are in order to machine learn on this kind of system

Right, and that's what this data sets about. So we've collected a dataset here today. That is

Lots and lots of metadata on music tracks right now. These are freely available

Tracks and freely available data and put a link in the description if you want to have a look at it yourself

I've cleaned it up a bit already because obviously I've been through the process of cleaning and transforming my data

So we're gonna load this now this takes quite a long time to do

Because there's quite a lot of attributes and quite a lot of instances

It's loaded right? How much is this data? Well, we've got 13,500

Observations that's instances, and we've got seven hundred and sixty-two attributes, right?

so that means another way of putting this if in sort of machine learning parlance is we've got thirteen thousand instances and

760 features now these features are a combination of things. So let's have a quick look at the columns

we're looking at so we can see what this data sets about so names of

760 features or attributes and you can see there's a lot of slightly meaningless text here

But if we look at the top you'll see some actual things that may be familiar to us

So we've got the track ID album ID the genre, right?

So Jean was an interesting one because maybe we can start to use

Some of these audio descriptions to predict what Jean with its music is or something like that

things like the track number and the track duration and

Then we get on to the actual audio description features. Now. These have been generated by two different libraries

the first is called Lib rosa, which is a publicly available library for taking an mp3 and

Calculating musical sort of attributes of it

What we're trying to do here is represent our data in terms of attributes an mp3 file is not an attribute

It's a lot of data. So can we summarize it in some way? Can we calculate by looking at the mp3?

What the tempo is what the amplitude is how loud the track is these kind of things this is a kind of thing

We're measuring and a lot of these are going to go into a lot of detail down at kind of a waveform level

so we have the Lib Roza features first and then if we scroll down

After a while we'd get to some echo nest features. Echinus is a company that

Produces very interesting features on music and actually these are the features that power Spotify is recommender system and numerous others

We've got things like acoustic nurse. How a coup stick does it sound we've got instrumental nurse

I'm not convinced that the word speech enos their hat hat to what extent is it speech or not? Speech

And then things like tempo how fast is it and valence?

How happy does it sound right a track of zero would be quite sad?

I guess and a track of one will be really high happy and upbeat and then of course

We've got a load of features. I've labeled temporal here and these are going to be based on the actual music data themselves

We're actually using its dimensionality reduction

well way of thinking about it is we as we started we've been looking at things like attributes and we've been saying what is the

Mean or a standard deviation of some attribute on our data

but actually when we start to talk about clustering and machine learning

We're going to talk a little bit more about dimensions. Now. This is in many ways

The number of attributes is the number of dimensions

It's just another term for the same thing, but certainly from a machine learning background

We refer to a lot of these things as dimensions so you can imagine if you've got some data here

So you've got your instances down here and you've got your attributes across here

So in this case our music data, we've got each song. So this is puts on one

This is on two song three and then all the attributes of a temple echo nest attributes its tempo and things like this

These are all dimensions in which this data can vary so they can be different in the first dimension, which is the track ID

But they can also down here be different in this dimension

What that actually means is it has seven hundred different ways or different attributes in which it can vary and you can imagine that first

Of all this is going to get quite big quite quickly

My seven hundred a tribute seems like a lot to me

Right and depending on what the algorithm you're running is it can get quite slow when you're running

Oh this kind of size of data and you can maybe this is a relatively small data set compared to what Spotify might deal with

But another way to think about this data is actually points in this space

so we have some 700 different attributes that you can vary and when we take a

Specific track it sits somewhere in this space

So if we were looking at it in just two dimensions

You know a track one might be over here and track two over here and track three over here and in three

Dimensions track four might be back at the back here. You can imagine the more dimensions

We add the further spread out these things are going to get

But we can still do all the same things. We can in three dimensions in 700 dimensions. It just takes a little bit longer

So one of the problems is that some things like machine learning don't like to have too many dimensions

So things like linear regression can get quite slow if you have tens of thousands of attributes or dimensions

So remember that perhaps the the default response to anyone collecting data is just deflect it all and worry about it. Later

This is a time reporting when you have to worry about it. What we're trying to do is

Move any redundant variables if you've got two?

Attributes of your music like tempo and valence that turn out to be exactly the same

Why are we using Bo for making our problem a little bit harder right now in actual fact echo nests features are pretty good

They don't tend to correlate that strongly but you might find where we've collected some data on a big scale

A lot of it variables are very very similar all the time and you can just remove some of them or combine some of them

字幕リスト動画再生

データ分析5：データの削減 - Computerphile (Data Analysis 5: Data Reduction - Computerphile)

sort

assume

slightly

completely