字幕表 動画を再生する
DENNIS HUO: Hi, I'm Dennis Huo.
And I just want to show you how simple, smooth, and effective
running open source, big data software
can be on Google Cloud Platform.
Now, let's just begin by considering
what we might expect out of any modern programming language.
It seems it's not quite enough to be turning complete.
But we probably at least expect some standard libraries
for common things like say, sorting a one megabyte list
in-memory.
Well, nowadays, it seems like everybody
has a lot more data lying around.
So what if it could be just as easy to sort a terabyte of data
using hundreds or even thousands of machines?
More recently programming languages
have had to incorporate much stronger support for things
like threads and synchronization as we all moved on
to multi-processor, multi-core systems.
Nowadays, it's pretty easy to distribute work
onto hundreds of threads.
What if it could be just as easy to distribute work
onto hundreds of machines?
Nowadays, it's also pretty common
to find some fairly sophisticated packages
for all sorts of things like cryptography, number
theory, and graphics in our standard libraries.
What if we could also have state of the art distributed machine
learning packages as part of our standard libraries?
Well, if you've been watching the code snippets over here
the last few slides, you may have
noticed that none of what I just said is hypothetical.
It's all already possible right now, today.
Which is fortunate because just as we spent the last decade
finally getting the hang of programming
for multi-threaded systems on a single machine, nowadays
it seems more and more common to find yourself developing
for a massively distributed system of hundreds of machines.
And with these changes in the landscape,
it's pretty easy to imagine that big data could
be a perfect match for a cloud.
All the key strengths of cloud like on-demand provisioning,
seamless scaling, the separation of storage from computation,
and of course, just having such easy access
to such a wide variety of services and tools.
These all contribute to opening up
a whole new universe of possibilities
with countless ways to assemble these building
blocks into something new and unique.
And through this shared evolution
of big data and distributed computing,
big data has become a very core ingredient of innovation
for companies big and small, old and new alike.
Take, for example, how streaming media companies like Netflix,
Pandora, and Last.fm have contributed
to changing the way that consumers
expect content to be presented.
They've applied large scale distributed machine learning
techniques to power these personalized recommendation
engines.
Millions of users around the world
now expect tailor-made entertainment as the norm.
As another example, just taking a look at our Google's roots
in web search and along side a multitude of other search
engines, news aggregators, social media,
and e-commerce sites, we've all long
grappled with applying big data technologies just
to tackle the sheer magnitude of ever-growing web content.
Users now expect to find a needle
in a haystack hundreds of thousands of times,
every second of every day.
And amidst this rise of big data,
it's not that the data or the tools
have suddenly become magical.
All these innovations ultimately come from developers
just like you, always finding new ways
to put it all together so that big data can still
mean something completely different to each person
or entrepreneur.
On Google Cloud Platform, through a combination
of cloud services and a wealth of open source technologies
like Apache Hadoop and Apache spark,
you'll have everything you need to focus
on making a real impact instead of having to worry about all
the grungy little details just to get started.
Now for an idea of what this all might look like,
let's follow a data set in the cloud on its journey
through a series of processing and analytic steps.
And we'll watch as it undergoes its transformation
from raw data into crucial insights.
Now what I have here is a set of CSV
files from the Center for Disease Control containing
summary birth data for the United
States between 1969 and 2008.
These files are just sitting here in Google Cloud Storage.
And, as you can see, it's pretty easy to peek here and there
just using gsutil cat.
Now suppose we want to answer some questions by analyzing
this data like, for example, whether there's
a correlation between cigarette usage and birth weight.
At more than 100 million rows, it
turns out we have at least 100 times too much data
to fit into any normal spreadsheet program.
So that means it's time to bring out the heavy machinery.
With our command line tool, bdutil, and just
these two simple commands, you can have your very own 100 VM
cluster fully loaded up with Apache Hadoop, Spark,
and Shark, ready to go just five minutes or so after kicking it
off.
Once it's done bdutil will print out a command for SSHing
into your cluster.
Once we're logged into the cluster,
one easy way to get started here is just
to spin up a Shark shell.
We just type shark and this prompt will appear.
Now, what we'll be doing here is creating
what's known as an external table.
So we'll just provide this location parameter to Shark
and point it at our existing files in Google Cloud Storage.
Shark will go in and list all the files
that match that location.
And we can immediately start querying those files in place,
treating them just like a SQL database.
For example, once we have this table loaded,
we can select individual columns and sort on other columns
to take a peek at the data.
We can also use some handy built-in functions
to calculate some basic statistics
like averages and correlations.
Now one thing I've noticed about this kind of manual analytics,
is that while it's a great way to answer some basic questions,
it's an even better way to discover new questions.
For instance, over here we've found a possible correlation
between cigarette usage and lower birth weights.
So could we, perhaps, apply what we
found to build a prediction engine for underweight births?
And what about other factors like the parent's age
or alcohol consumption?
Now, this kind of chain reaction of answers leading
to new questions is part of the power of big data analytics.
With the right tools at hand, then every step along the way
you tend to find new paths and new possibilities.
In our case, let's go ahead and try out
that idea of building a prediction
engine for underweight births.
Now, to do this we're going to need some distributed
machine learning tools.
Traditionally, building a prediction model on 100 node
cluster might have meant years of study
and maybe even getting a Ph.D. Luckily for us,
through the combined efforts of the open source community
and, in this case, especially UC Berkeley's AMPLab,
Apache Spark already comes pre-equipped with a state
of the art distributed machine learning library called MLlib.
So our Spark cluster already has everything
we need to get started building our prediction engine.
The first thing we'll need to do here
is just to extract some of these columns
as numerical feature vectors to use as part of our model.
And there's a few ways to do this.
But since we already have a Shark prompt open,
we'll just go ahead and use a select statement
to create our new data set.
We'll start out with a rough rule of thumb here, just
defining underweight as being anything less than 5.5 pounds.
And we can just select a few of these other columns
that we want to use as part of our feature vectors.
We'll use this WHERE clause to limit and sanitize our data.
And we could also just chop off the data
from the year 2008, just for now,
to put in a separate location, so that we
can use that as our test data, separate from our training
data.
Now, since we're doing all this as a single create table
as a select query statement, all we have to do
is provide a location parameter and tell Shark
where to put these new files once it's created them.
We can also go ahead and run that second query on the data
from 2008, so that we'll have our test
data available in a separate location, ready to go later.
Sure enough, after running these queries,
we'll find a bunch of new files have appeared in Google Cloud
Storage, which we can look at just using gsutil or Hadoop FS,
for example.
Now, with all our data already prepared, all we have to do
is spin up a Spark prompt so that we have access to MLlib.
It just so happens we can pretty much copy and paste
the entire MLlib Getting Started example for support vector
machines, here.
And we'll just make a few minor modifications that
point it our training data here, and plug that into Spark.
Spark will go ahead and kick off the distributed job,
entering it over the state at 600 times
to train a brand new SVM model.
Now, that will take a couple minutes.
And once that's done, we'll have a fully trained model
ready to go to make predictions.
Here, for example, we can just plug
in a few examples of underweight and non-underweight predictions
using data from our real data set.
We can also go in and load that separate data from 2008
that we saved separately, and rerun our model against it
to make sure that our error fraction is still comparable.
Now, taking a look at everything we've done here,
I'll bet we've raised more questions than we've answered,
and probably planted more new ideas
than we've actually implemented.
For instance, we could try to extend this model
to predict other measures of health.
Or maybe we can apply the same principles
but to something other than obstetrics.
We could also explore a whole wealth
of other possible machine learning tools
coming out of MLlib.
Indeed, as we dive deeper into any problem
we'll usually find that the possibilities are pretty much
limitless.
Now, sadly, since we don't quite have
limitless time in this video session,
we'll have to come to an end of this leg of the journey.
Now everything you've seen here is only a tiny peek
into the ever ongoing voyage of data
through ever-growing stacks of big data analytics
technologies.
Hopefully, with the help of Google Cloud Platform,
your discoveries can reach farther and spread faster
by always having the right tool for the right job,
no matter what you might come across.
Thanks for tuning in.
I'm Dennis Huo.
And if you want to find out more,
come visit us at developers.google.com/hadoop.