字幕表 動画を再生する
Okay, so artificial intelligence machine learning data mining data analysis
clustering classification data pre-processing
big data
It's hard to go anywhere now without hearing about AI and machine learning and data data, particularly
It's everywhere research
We've suggested that every two years we generate more data than ever existed before
So the amount of data is doubling every two years now, that isn't absolutely am, you know astronomical amount of data
but the thing is of course that
This data doesn't necessarily mean anything the fact you can create tables of data
But unless you understand what's in them and what they mean, you haven't got any knowledge, right?
So there's a distinction between having data and having knowledge. So all very well saying yes as a species
We're producing a huge amount of data
But actually a lot of it doesn't get used a lot of it sits there on a hard disk
Waiting for someone to look at it and that's kind of what we're talking about here if we want to extract knowledge from data
we're going to need some tools and processes to do this in a formal way and that's that's what data science is, right and
Things like machine learning and AI have a place within it
So perhaps if you do this for your job, then data analysis is going to be useful for you
Maybe your company's generating data and you want to analyze this data?
But on the other hand, perhaps you're just a consumer and companies are using data on you. They're generating data on you
And actually they're profiting from data on you. These are sometimes life-changing decisions that are being made on your data
And so it's empowering to know how this process works and I'm a very simple example
Which you might even do yourself suppose you go online to book some flights for a holiday
And then you decide that actually two flights via an intermediate Airport is cheaper than a single flight, right?
You're doing data analysis
Say you're taking lots of different data sources and working out the optimal route and this of course happens automatically as well
Depending on the flight website that you're using. All right, so this kind of stuff you're already doing it
It's just a case of trying to formalize this process. So what do any of the things I listed at the beginning mean?
Well one problem is that everyone's definitions differ slightly
But also I think that a lot of these terms are used completely interchangeably AI is the classic example
So AI is everywhere right talk
You can't buy a product without it having been having AI added to it a lot of the time you see AI
We're actually talking about machine learning
so machine learning is the idea that we're
Training a machine to perform a task without explicitly
programming it to do so. A good example of AI that isn't machine learning would be lets say a mouse in a maze where all
You're doing is telling it to turn left or right at random not learning anything
It doesn't understand what the maze is but it will eventually get to the end right that's a kind of rudimentary artificial intelligence
That doesn't involve learning anything
Machine learning is about not giving it
Conditions not saying if you're here turn left if you're here turn, right
It's just giving it examples and hoping it will learn to perform most tasks itself, right?
So machine learning is a subset of AI but they shouldn't be used interchangeably if we're using machine learning often
What we'll do is we train it based on samples of data
So we'll have some existing data set that we're trying to train on and we're trying to use the machine learning to either tease out
information or make predictions on this data
The problem is that not all data is sort of made equal some of its noisy and messy
Maybe we don't know what it is and don't know whether we can apply a certain technique to it
Right. And so we need to clean this data up. We need to take this data understand what it is and extract some knowledge
So that we can then apply these AI or machine learning techniques to it
So this combination of things that can take data and prepare it in a way that we can then use it or understand it
That's data science
There are quite a few ways we could do this data analysis right throughout this course
We could use R, we could use Python, we could use MATLAB. They all have their pros and cons
We're gonna use R because it's free and it's really good for statistical analysis
It's got loads of great libraries
If you're really familiar with Python, then maybe that's what you want to start with for this kind of stuff
But we know we're going to be working with R
We have our script area here where we can write scripts and run scripts. You can save them and then come back to them later
Console where we're going to be putting in, you know specific commands
we have our environment which is where all our
Variables and our data is held and we can look at them there and then we have plots any plots of which you can do
quite a lot of different plots in R, very versatile. That's going to appear down here
Okay, so you've probably got everything you need to get started with data analysis. In my opinion
The best way to get into R is just to kind of have a go
So it's going to look at a few of the most obvious things that it does it has
A little bit of a learning curve only because it's syntax is slightly unusual
If you can program you'll be fine
but even if not
you should get there pretty quickly. Most of the time in R we'll be using either matrices or vectors or
Which are kind of a special case of matrices or maybe data frames data frames a really nice aspect of R which you can kind
Of think of like a table that you might have in in Excel, except you've also got headings for your columns
so let's have a look at some of these things and just a few of the things we can do with them before we perhaps
Go into a little bit more detail in other videos
so for example
We might look at our variable X which I've created and X is a sequence going from 0 all the way up to a few
multiples of Pi which I used to create this plot
That was only one line of code that produced that and I've used that to create my plot by essentially saying y equals sine X
And then just simply plotting that if you wanna get a little bit more complicated we can start looking at matrix data
So I created a CSV file with a Gaussian function in it. So essentially a two dimensional array of
Values that get bigger in the center very straightforward
the CSV file is essentially a text file with
commas separating those values very easy to read and write these out of Excel and other
packages and so they're off you'll often find data is passed around in this way at least
Moderately sized data, if it isn't too, you know to it too huge. I can load this in using my read CSV function
So I can say name data
Now the arrow operator is essentially equivalent in R for the assignment operators or equals equals will often work
But I tend to try and use this one. So namedata
I'm going to assign read dot CSV and the file is going to be norm dot CSV
And I've got no header for this file. So I don't want it to use the top row for the labels
So I'm going to say header equals
false and that's loaded in namedata and we can have a look so I'm gonna click on namedata here and if we click
On it you can see we've got the rows and the columns of our data in here
We can look at individual elements in this array so we can say data at position three four
right
And that's going to be the third row down and the fourth value across we can also leave one empty and just have an entire
row or
Conversely an entire column like this and so it's very easy to take ranges of values
You've got a huge table of data selecting certain columns looking at certain columns plotting certain columns
This is one of the reasons why R is very popular quite often when you're looking at data
We'll actually be looking at something called a data frame. Now a data frame. I've got a load one up is simply a
In essence a table of values, but it will have to be the same type
So in an array, normally they'll all be floats or they'll all be integers. In a data frame, there can be different things
So you could have first and last name next to age. For example
So I've just created a tiny little CSV file with some random people in it. So let's load this up
So I'm going to say namedata
assign read CSV
names dot
CSV and if I look at name data, you can see that it's got three columns
it's got first name surname and age and
Five rows and there's five people in this dataset and then you can do just like I did before but now we can also index
By the names of these columns so I could say I want all of the first names for example so I can say namedata
dollar
first-name and I can see
All the different first names so you can start to look at this data set and more in more detail, obviously
This isn't absolute tiny data set but you get the idea you could also look at individual instances
So we could say name data and I want just the second row for example name data the second row
There we go, Bill Jones and he's 18 years old as we move through these videos
It's going to be very common for us to load in
Datasets like this in this format and then start to process them based on these data frames. So perhaps an example, right?
so, so let's imagine you're an online retailer and someone comes into your shop and buy some things and maybe they you
Trying to understand what it is what they do so that you can let's say send them emails to try and get them to buy
More products or show them recommended products and things like this
So you want to try and build up a pattern of their behavior, right?
And all you've got is what they click on what they add to their basket and what they buy, right?
So you've learned that they're looking at these kinds of items and they look at these ones regularly
And then sometimes they just buy something completely random seemingly, and that goes in their basket and gets bought straight away
Maybe it's a present right? So maybe it's not tied to them as a person
So you're taking all of this data all of these purchases all of these?
Products are they're looking at and you're turning this into a kind of picture of this person and you're clustering that person in with other
consumers that bought similar things and trying to predict what they want to buy next, right?
And that's when you send them an email say you should look at this one because this one's really good and you didn't buy it
Last time but you'll definitely want to buy it this time. So we've got some data we want to extract some knowledge
What's the first thing we do?
well
We have to start to look at it and try and tease out some kind of information
Right or analyze this data the data analysis is the idea of using statistical measures to try and work out what's going on
This is kind of a cycle. We're going to analyze the data
So we're going to do a data analysis and perhaps sometimes just using statistics to analyze the data isn't enough
You can't really learn everything about it
Yes, you can learn, you know, mathematically how it works, but you might not understand about what it all means
So visualizing the data can be really helpful. So what we'll also do is we'll visualize the data
Visualization so that's going to be charting it plotting it trying to work out
trends and
Links between different variables and things like this and these are kind of being back and forth
Right, you could do both of these things numerous times and work out what we've got, right?
So you're gonna do something like this. And then what we're going to do is we're going to pre-process the data
Often you'll be finding your recording much more data than you actually need. Right. This is certainly true of an online shop
I'm going to be looking at a lot of products
But I don't end up buying and I was never really going to buy I know maybe a pipe dream and they've got a sort
Of weed out this information to work out what it is that they might actually better convince me to buy right?
So this is going to you going to preprocess data and remove a nonsense and drill right down to the stuff that's really useful
So this is pre-processing and this is going to be a kind of cycle of analysis and visualization
and
Pre-processing and we can repeat these things and then we can really drill down and whittle down our data into the most usable sort of
Core of knowledge that we can
And get the most out of it. Now it may be that just analysing the data is enough, right?
You've now sort of you've obtained some knowledge
You kind of understand what the trends are and maybe that was all you wanted to do. That's sometimes the case
Maybe actually what we want to do is take things a little bit further
We're going to use machine learning or modeling to try and model this system and predict what's going to happen next?
So for example in the case of an online shop
We might want to start predicting what people are going to buy next and if we can do that
That's when we can send out these emails or flag things in their recommended items and get many more sales as an example
Let's imagine that someone has spent a lot of time looking at DIY tools right. I've you know recently moved house
I spent a lot of time doing DIY and I'm always trying to buy new tools because it just seems like a good idea
So, you know, maybe I buy a certain kind of saw and then you know a few months later. They're starting to recommend me
a slightly different kind of saw
that serves a slightly different purpose that suddenly I definitely need to be doing and I think another yeah
Maybe I will buy that and then the end I have 10 saws and I don't know how to use any of the saws
But you know, the retailers job is done
It's if we want to extract this data
We're going to use machine learning or modeling to put to model this system and make predictions right now
So for example, we could cluster the data together. We could link my purchase history with similar people. What are they buying?
Can I be tempted to buy those things as well, right?
Maybe I'm very different from someone else
And so it's not a good idea to recommend me certain products because I'm unlikely to buy those things
Perhaps use a different example in the medical domain
It's quite common to classify people into kind of risk categories, right so that we can maybe use preventative treatments
So every time I go to a doctor they're going to collect data on me on what I can't cope
What's currently one with me? And what was wrong with me before and?
Combine that with with you know standard data
like how much exercise someone does and you know their family history and
How what their stress levels are and things like this?
We can combine all these things to make a prediction as to what they were at risk of in the future
So, you know heart disease or something else like this. It could save someone's life
If you spot that they're at risk of a certain thing and you can really advise that person to you know
Increase their level of exercise or alter their diet. There are two other terms that we come across, you know a lot, right?
So there's data mining and big data right now
I'm not really sure what data mining is because I don't think anyone is it's a bit. It's a bit of a buzzword
Really what data mining is is a combination of pre-processing your data and maybe using clustering to extract some knowledge from it, right?
So that's our sort of it's a word that's come to be used in place of those things, right?
If someone says they're doing data mining, that's what they're doing. They're pre-processing and extracting some knowledge from their data
It's a night it's a cool sounding word. You're not actually mining anything, right?
you're just doing what everyone else does on data. Big data is the idea that maybe we've collected a lot of examples of
something
You know a huge number or each of our examples is quite complicated and it has a lot of variables right in that case
The amount of data we've got is sort of unwieldy, right?
So I would argue perhaps that big data is not data that you can run on your laptop like you might be using cloud compute
Infrastructure or certainly parallel processing in some way to to pre-process and analyze this data
Right so exactly where the line, how big is big. I don't know but exactly where we draw the line in some ways
It's not really important, right the idea is just that
The amount of data we as a species are now producing more and more of our data is becoming big data
But you know exactly where the cutoff is isn't it's not doesn't really matter
What is data right? I'm pretty sure that's data
Right is this data? this picture or that data
Is this data? What what is data?