LEE FLEMING: Good evening.
I am really pleased to welcome you all to "Leaders in Big
Data" hosted by Google and the Fung Institute of Engineering
Leadership at UC Berkeley.
I'm Lee Fleming.
I'm director of the Institute and this is a Ikhlaq Sidhu,
chief scientist and co-founder.
The first and most important thing is to thank Google for
hosting the event.
So thank you very, very much.
There's a couple people in particular, Irena Coffman and
thank you-- and also Arnav Anant, our entrepreneur in
residence at the Fung Institute.
So here's Arnav.
AUDIENCE: A lot of work.
LEE FLEMING: Huge amount of work.
The Fung Institute-- we were founded about two years ago.
And the intent is to do research and pedagogical
development in topics of engineering leadership.
We have our degree, the Master's of Engineering--
professional Master's of Engineering M. Eng. program--
mainly around the Institute.
We also have ties though across the campus, as you'll
This is our intent to have a series of talks on topics of
interest to engineering leaders.
As it turns out, this Wednesday we
have our next talk.
It's sponsored by [? Thai ?] and the Fung Institute.
And the topic is entrepreneurship--
being an entrepreneur within your firm.
And fittingly, we have representatives from Google,
and Cisco, and SAP.
Consult the Fung website or the [? Thai ?] website for
details on that.
So besides enjoying a good discussion tonight, we have an
ulterior motive, as you can probably tell.
We're trying to advertise all of our fantastic programs in
big data at Cal.
Now, whether you're interested in computation, or inference,
or application, or some combination of those things,
we've got the right program for you.
As I mentioned, the professional Masters of
Engineering, or M. Eng., across all the different
one year degree.
We have another one-year degree in the stats
department-- a professional degree.
There's a two-year degree in the Information School.
And finally, there's the Haas MBA.
Tonight we've got people from all these programs.
You can find their tables, ask them questions, and hopefully
we'll see you see at Cal soon.
And we also have an additional executive and other programs
associated with each of those departments
and schools as well.
Ikhlaq will now introduce our speakers.
IKHLAQ SIDHU: OK, thanks.
So let me see.
LEE FLEMING: Just slide this here.
IKHLAQ SIDHU: All right.
Welcome, I want to also thank a couple of people.
One is [? Claus Nickoli ?], who is not here at the moment,
but to you in the ether, he's just not at the meeting.
But he's our host here, and so thank you.
You guys can tell him that I thanked him.
And also, many of you I've seen here are basically
friends, and so thanks for coming.
It's good to see you again.
This is an event on big data.
And so I'm going to give you a little data on
who is speaking today--
who is here.
And the way I think of this is, what we've got is three
perspectives of big data from leading firms--
from people who represent leading firms in the area.
And so let's start with NetApp.
We've got Gustav Horn.
He is a senior consulting engineer with 25 years of
And he's built some of the largest enterprise-class
Hadoop systems in the world-- on the planet.
And from Google, Theodore Vassilakis, and he's a
principal engineer at Google.
He's ahead of the team that works on data analytics.
And he's been responsible for numerous contributions to
Google in terms [? about ?] search, and the visualization
and representation of the results.
And from VMware, Charles Fan, who's senior VP of strategic
R&D. He co-founded Rainfinity and was CTO of the company
prior to its acquisition by EMC in 2005.
And our distinguished set of speakers is moderated by our
distinguished moderator, Hal Varian.
He is chief economist here at Google.
He's an emeritus professor at UC Berkeley and the founding
dean of the School of Information.
So with that, there's hardly anything more I
could possibly say.
Come on up Hal and take it away.
HAL VARIAN: Thank you.
I'm very impressed with the turnout tonight, seeing as
you're missing both the debate and the baseball game.
But at least it eliminates a difficult
choice for many people.
I will say that I'm going to follow the same rules as the
So no kicking, biting, scratching, or bean balls are
allowed during this performance.
We're going to talk about foreign policy, wasn't that
In any event, what I thought we'd would do is, we'd have
each person talk for about five minutes, lay out their
theme, where they're coming from, what their perspective
is on big data.
And I will take some notes, and then ask some questions,
get a conversation going.
And I think we'll have a little time at the end for
some questions from the floor.
So, take it away.
THEO VASSILAKIS: Sure.
So, should I start, Hal?
HAL VARIAN: Yes.
THEO VASSILAKIS: All right.
Well, hey it's a real pleasure to be here.
Thank you guys also, and thank you guys for coming.
It's a huge, huge audience.
Just a couple of words.
As you heard, my name is Theo.
I lead some of our analytical systems.
So I'm responsible--
well, actually up until two weeks ago, I was responsible
for a stack that had parallel data warehousing components,
query engines, pieces like Dremel, and Tenzing systems
that let you query this data, and
visualization layers on top.
And that's one of the many, many systems at Google that I
think, outside, one would think of as
big-data type of systems.
And so I'll try to give you my perspective at least on the
Google view of big data.
And hopefully someone will cut me off when it's time.
I think I'll probably go for five minutes.
This could take a while.
THEO VASSILAKIS: All right, sounds good.
I think, as you guys know, Google's business is primarily
about taking data and organizing the world's
information, and making it universally
accessible and useful.
So a lot of what the company does is really about sucking
in data-- whether it be the web, whether it be the imagery
from Street View, or satellite imagery, or maps information,
or Android pings, or you name it.
And then transforming it into usable forms.
So really, Google is kind of a big data
machine in some sense.
And I think the term big data came into
currency relatively recently.
And we all said, yeah, OK, that speaks to what we do.
Because we don't really have a word for it.
We just kind of knew that the data was large.
But just to try to put maybe more structure on to that, I
think the Google view on a lot of "what is big data
processing" kind of splits up into probably what I would
call ingestion type of processes--
things like the crawlers, things like all those Street
View cars running through all the streets of the world.
And then goes into transaction processing systems, where
perhaps we capture data through interactions on a lot
of our web properties, or a lot of the web properties that
we partner with.
This means people clicking on search, or people interacting
with docs, or people interacting with maps.
All generate many, many clicks and many, many interactions
that then become transactional big data.
Of course, that also includes people using let's say Google
Analytics on their sites to measure traffic on their
properties, which then generates huge volumes of
pings into Google--
many tens of thousands of QPS of pings.
So that's kind of the second big component.
And then probably the third component is the processing
side of all of that.
The process side includes things like map [? reduce, ?]
analysis, generating insights from that data--
maybe in the form of building machine learning models.
Maybe in the form of building, for example, Zeitgeist top
queries that can then be served out to the world to
say, hey here is what people are searching for.
Maybe in the form of engrams of all the books that Google
scanned over many, many years of its ingestion processes.
But it's really baking all of that information and then
presenting it in some usable form, either through a system
such as our ad system that takes models and decides what
ads to show, or in a more direct
form such as the engrams.
Just to say, OK, here are those three broad classes--
ingestion, transaction processing, and analytical
To dig a little bit deeper into each of those areas, I
would say the ingestion processes, especially the very
large scale ingestion processes, are
highly custom systems.
If you think about our web crawlers, if you think about
the Street View cars, if you think about maps stitching, or
satellite imagery stitching--
those are very, very custom processes that I think, at
least to this date, don't have a clear analog
in the general industry.
And maybe this is something that you guys might address or
might see differently than how I see the version.
They're still highly-specialized systems
that produce very large images.
And they're very high performance, very complex
systems that are run by dedicated engineering teams.
The transaction processing systems or the storage systems
are things like the Google File System.
These are things like Big Table.
These are things like Megastore.
Those are the ones that we've actually published papers
about and that are now reasonably well
known in the industry--
have evolved a little bit past the purely custom stage, where
they're fairly general purpose.
And there was a time at Google where actually most people did
their own storage in some form or another, until these
GFS-like systems evolved to the point where they were good
enough that more than one team could use them.
And actually, that evolution had many steps in which, for
example, everybody ran their own GFS.
And so maybe the ads team had their own GFS cells, and the
search team maybe had their own GFS cells.
And in time, the systems matured to the point where
actually we could have a centrally-managed file system.
And I think recently you may have seen, we've now talked
about this global file system called Spanner which takes
that to yet another level of transactions and global
And then the third step, which is I think still in a
relatively immature stage compared to some of the
storage systems, is the analysis.
And I think a lot of people know about MapReduce and some
of the systems that have been built on top of that.
So for example, Flume is the way of chaining MapReduces in
a more programmer-friendly way so that you don't end up with
50 MapReduce stages that are individually managed.
But rather, you end up with one program that can then be
pushed down into many MapReduces that are
The process there is still very engineering focused and
essentially requires engineering teams to process
this large data.
And so I think what we're seeing in that area is the
same maturation that we saw in the storage and transaction
Where little by little, systems such as Dremel, such
such as many others inside of Google that we haven't talked
are aggregating a lot of that usage, and saying hey, we
really should do it in a much simpler manner.
And not really require people to have a full engineering
team to get the value out of all that big data.
Because at the end of the day, that's what
Google wants as a whole.
And that's what Google's customers want as a whole.
How do we get the value out of those big pieces of
I would just leave you with those three big pieces.
And also this idea that, this is evolving into a
higher-level service that people can use without
necessarily being very, very
low-level engineering oriented.
And that more and more value is being derived out of that,
and hopefully something that you're seeing in Google's
properties and Google's services.
I don't know how much if I'm over, but I
can hand over here.
GUSTAV HORN: I'm Gus Horn, and thanks again for everybody for
I know it's a big baseball night and you probably want us
to get done quick.
I come to it from a different approach in a sense and feel,
because Theo has--
Google has-- really been at the forefront of big data, big
data analytics, and in
particular Hadoop and MapReduce.
So I'm not going to go on the premise that everybody in this
room understands what MapReduce is, or what big data
is, and what data scientists are.
These are all buzz words that are really evolving.
I think what I found in my travels globally is that we're
really at the forefront right now of big data analytics.
I have a presentation that really characterizes it more
like a tsunami of data.
It's relentless, and it's coming at us.
It's coming at us from our Android
phones, from our iPhones.
It's coming at us from cameras that are everywhere, from our
TiVo boxes, from our PVR boxes, from everything we do
and touch in our world today.
We're generating data.
And the question is, do we either let the data fall on
and we do nothing with it-- or are we going to pick that data
up and actually do intelligent things with it?
And we're finding more and more commercial applications.
Google I look at from a pragmatic perspective.
It's a commercial entity, but they are having a much more
philanthropic and broad approach to the world as well.
It was great back in 2003 that they defined GFS and gave us
MapReduce, which brought us back to the
mainframe days of old IBM.
But this is basically what it feels like to me, right?
Because it's batch-oriented processing at that time, when
we're talking MapReduce jobs.
But basically that was the genesis or the beginning of
what we call the Hadoop as we know it--
the Facebooks, the Yahoos, the LinkedIn--
all of these companies that are embracing this technology.
But now we look at companies like Progressive Insurance,
where they're giving you these dongles to plug into your car.
They're generating data.
They're collecting data on your habits,
your driving habits.
Health care industry is looking at how often do you
see the doctor, what are your statistics?
I was at the Mayo Clinic recently, and they have a
human genome initiative where they are looking at all of
And they're actually doing a full genetic map all of their
And their following these people for their entire life
And they want to keep their data 25 years, post mortem.
They want to build a repository where they can
understand exactly how does that one genetic mutation
affect your propensity to be carrying a disease.
Because they recognize that diseases
aren't just on or off.
There can't just be one mutation that
gives you that problem.
It's your environment, the mutation.
And that builds a susceptibility.
They're trying to really paint a huge picture, and that's a
big data problem.
So I see big data problems from health care.
I see big data problems in consumer-related industries,
whether they be the Walmarts, the Targets.
And not everybody is trying to be evil about this.
If you think about Target or Walmart, they would much
rather show you an advertisement that you care
about than to bore you to tears with something that
Just as Google doesn't want you to see a pop-up ad for
baby diapers if you're 60 years old and you're not going
to have a baby.
It doesn't do them many good, it doesn't do you any good.
There are a lot of positive things to take away from a lot
of this big data, and there's some negative things, too.
I'll focus on the positive in that I look at what companies
like the auto manufacturers in Europe are doing.
You look at BMW.
All of these cars are data-generating monsters.
And nowadays, you don't even know when you have to go for
an oil change, because they're predictively analyzing the
fluids in that car.
And they're determining when is it time for you to get that
It's not like, oh I have to do to every 4,000 miles.
Your car tells you when you need to get it done because of
viscosity changes and because of analytical testing.
And they're collecting all of this data.
I think we're very lucky that we are at this forefront.
And I think that big data-- big data scientists--
are going to become more and more important.
And I think that, as Theo said, that it's going to get
to the point where, you don't have to become a
MapReduce job expert.
You really need to become a logical thinker and be able to
articulate the questions you're asking against a data
set, where you don't even care where the data came from.
You just know that all the data is in there.
And that's the key-- is to have a repository that's able
to hold all the data, and be able to allow for this kind of
processing to take place on that data, and produce results
in a timely fashion.
And what I've done is, I'm approaching it from more of a
corporate perspective, where people are looking at
enterprise-class systems, versus what we call white box
or dirt cheap.
And there're different kind of cut-offs for companies.
And I think as you go through your process at UC Berkeley,
and you're learning about where you want to go, you'll
see that you have to pick and choose your battles when it
comes to big data.
And the battle you have to choose is, am I going to be
setting up my data centers and my infrastructure to support
commodity-based platforms, and this-- do I want to own all
the data internally?
Do I want to virtualize the data in the cloud?
At what point do I bring that data internally.
Do I want to use services from Google?
They're all inflection points that you are going to be
making decisions over the next five years to
decide how to do that.
And this is what I'm dealing with all the time.
I think, hopefully, we all learn a lot from this
CHARLES FAN: Thank you for coming.
My name is Charles.
And unlike presidential debate, I agree with
what they just said.
Big data is like an elephant.
We were told we are allowed to touch this elephant from
different angles, from different perspectives.
But before that, I'll just try to repeat what Theo and Gus
First, I think Internet is pretty big in terms of its
impact to our lives.
And not only to our lives, but also to enterprise IT.
And I think what we have seen in the last 20 years has been
the repeated tidal waves that's caused by the Internet
and the leaders in the Internet
space, including Google.
The advances they are making, and how those are hitting the
And I think big data is the latest of such a tidal wave.
Essentially what the scale of data that the Internet
providers are dealing with, with consumers, the
enterprises are facing the same.
And now the challenge is, how do we adopt and massage this
technology so it's consumable by the various people inside
the enterprise worlds.
And that's what's behind the big data world we see.
And I think, like what Gus said.
Enterprises are working different sectors.
There are people doing retailing--
There are people doing a manufacturing--
There are people in health care.
There are people doing financial trading.
In almost every field, they are generating
more and more data.
And almost every field has many questions they need to
ask based on those data.
And they need to make decisions based on those data.
And unlike the DWBI world, which has been around also for
20 years, the amount of data, the variety of data, and the
speed of data coming at you are going beyond the existing
infrastructure can take.
And that's why to answer these different questions in
different verticals, everybody is seeing a need for new
infrastructure, a new database, a new storage to be
created to support the decision making based on all
What's different in those data, besides just the size or
the volume of it?
When people typically refer to big data, they call it the
"three v," which is volume, velocity and variety of data.
Some of them call them "four s." It's the source--
there more data sources--
the size, the speed, and structure of data that are
And I have another name for it, which is probably less
elegant, but also I think it's pretty true.
When we look at the old data, the small data, or the classic
data, they're typically record-based data, especially
those generated by transactional applications.
They usually have people generate it.
And they go through the whole life cycle.
So we typically call them CRUD data that you need to create,
read, update, and delete.
I'm sure all of you Berkeley students know
the CRUD data word.
You manage on the storage front.
You also have database design for it.
But with the new data, more and more of
them are machine generated.
We just have more and more devices that's connected to
Not all of them have a warm body sitting behind them.
There're both servers, as well as sensors, RFID, mobile
devices, cameras, and so on.
And they're all generating Google Cars, they're all
generating tons and tons of data, without people sitting
But you still need to create them, but you don't update
them that much.
Those are usually write once and read many type of data.
So there's not much update.
And there's not much delete.
You need to retain data 25 years after people die.
And even after 20 years--
25 years-- people don't remember to delete them.
So there's not much delete, not much update.
There are a lot of application.
So instead of CRUD, now it's like
create, replicate, append.
There's more and more append.
All the data in append-only mode.
there's a constant need to process them in real-time,
during ingestion, or interactive.
So it's just crap data, is what big data is.
create, replicate, append, process.
And when we are talking about structured data verses
unstructured data, we say there are more and more data
that are unstructured than structured.
I think it's just because the database technology or the
underlying technology is not scalable enough to put them in
a schema or in some kind of structure.
That's why they are all CRAP.
But you still need to process them in a more efficient way.
And that causes a lot of your challenges.
I think essentially whoever designs the new data
management system for CRAP and makes them consumable by
enterprises, is going to be the winner of
this big data race.
GUSTAV HORN: So Google invented the new crapper?
HAL VARIAN: Yes, OK, thank you for starting us out on such
I wanted to follow up on your own your little troika there
with the ingestion, transaction, and analytical.
I come at the end of that food chain.
So what we get is, the data's been pulled in, the data is
available to us, and we're working on
the analytical side.
I want to say a few words about that.
When we have these analytical systems at Google, one of the
things you can do is just monitor the system and make
sure everything's running the way we expect it to.
And these guys have done a fantastic job, because now you
can take almost anything that's gathering data at
Google and create a dashboard with about 20 minutes of work,
which is a fantastic thing for running the business.
The other you can do is, you can build the machine-learning
models that he alluded to and engage in this kind of
That's very in-vogue these days and it's a
great thing to do.
But the thing that a lot of people miss, I think, is you
can use that data to conduct experiments.
And that's really the secret sauce at Google.
Our leader of the search team, Amit Singhal, said that a
couple years ago, we did over 5,000 experiments with the
search algorithm-- made 400 changes.
On the ad side, we're running roughly 500
experiments at any one time.
Any time you're logged into Google-- or any time you're
accessing Google, I should say--
you're probably in a dozen or more experiments.
And it's having the capability to manage that data, not just
for the current incarnation of the system, but all the
variations you might contemplate, is really a
fantastic help in moving the whole system forward.
So that experimentation rule is very, very
important at Google.
I wanted to raise a question of standards and
You mentioned Hadoop.
That's really become an industry
standard here at Google.
We have our own internal staff.
It's a lot easier to enforce these standards for
interoperability internally, than industry wide.
But to make this system work--
of starting with ingestion and transactions,
and then the analysis--
outside of Google, or outside of other big data companies,
you've got to have this kind of standards to interconnect
the flow of data.
And Charles, why don't you say a few things about what's
going on in that area
CHARLES FAN: I do think we are at the early
stage of this industry.
And right now there is no standards, per se, to my
knowledge that has emerged.
Hadoop has been a very popular technology that's born out of
the open source community's effort to--
based on the Google papers-- to create the MapReduce and
the GFS, as well as the other things they
built on top of it.
And I think, in lieu of standards, my perspective is,
open source plays a huge role here.
That in terms of overall data management as I mentioned, we
are going from a world that every thing is relational.
You basically have your relational data model, which
is the standard across all--
SQL being the standard query language.
Go into a more chaotic world, where there's many kinds of
data stores, many kinds of queries.
Even on Hadoop, there are various ways you can
query on top of it.
And open source really gives people the choice.
In this chaotic period, it is the choice.
It's basically the developers and users who's going to
decide which will become the standard.
And open source really provide this way to make it happen.
GUSTAV HORN: I just want to make one comment.
I think open source actually is the best way to make sure
that you don't get yourself pigeonholed into anything
And I think that with Hadoop and big data, as I look five
or ten years down the road, I think that standards aren't
going to provide structure.
It'll be more of an inhibitor than it's going to be of a
benefit in this area.
I think one of the key attributes--
and I think you can maybe talk more about that-- is the fact
that you want to be able to connect or stitch together a
bunch of disparate data sets.
You want to be able to look at things where you don't have to
be rigidly defined from the standard.
You want to be able to look at strange queries where weather
patterns, and people's buying habits, and the cars they
drive have some correlation.
And if you start imposing standards on top of something
that is that robust, I think it's going to probably stifle
So I think the key here is open source.
The key is to have published innovations so that people are
publishing their works.
And I think as we get better and better at natural-language
processing and being able to get away from having to be
hard-core programmers, to glean insight into any of this
data store, it's going to be more beneficial.
I think in the next decade you'll find that you'll
probably be doing less and less Java programming and more
and more just natural language logic, I would think.
HAL VARIAN: Theo, I hope you're going to say a word or
two about protocol buffers.
THEO VASSILAKIS: Protocol buffers, yes, of course.
I'll plug protocol buffers for sure.
HAL VARIAN: Which you made as an open standard, right?
THEO VASSILAKIS: Right.
It's actually an open source system.
But before that, I was actually going to say I really
agree with your point about experimentation.
And I actually remember a time at Google where, if you wanted
to run an experiment--
for example, on search-- there was one engineer who is one of
our distinguished engineers now, Diane.
And you had to go ask her for some cookies on which you
could run your experiment.
It was sort of like, she would allot you some cookies.
Those days are over, but they really do generate a lot of
this CRAP data, because all of those experiments accumulate
over the years.
And yet it's really important to have the historical view of
hey, we tried this.
Here's what happened then.
And I think actually this plugs directly into this
problem of standards, because the way that all of the
engineers years back recorded their results, was very, very
different than the ways that engineers today
record their results.
So maybe, at the time, some of them didn't
have protocol buffers.
Which is, if you like, a kind of XML-like format for
representing data that Google created, but is a much more
efficient to represent type of format.
And so I think the problem comes because we want to
integrate all of this variety of data.
And what I would say is, I agree with Gus that I don't
see a lot of appetite for very generic standards.
But I do see people having a need to bridge all of their
old data and the new data.
And I would basically make two analogs here-- is that I think
one of the things that really helped the development of data
warehousing was fairly standard SQL.
And it was never a standard standard.
Like, there existed a standard, but no one really
followed the standard very closely.
But if it was close enough, you could get
your systems to work.
And I think the other aspect is file formats.
If you can take a file format and feed it into different
systems, that will really help.
And so until now, CSV was the end-all, be-all file format
I think we'll see more of these as we need to trade data
that's more structured--
that has protocol buffers or XML.
THEO VASSILAKIS: And if I could, let me add a plug for
VMware, as well.
As we mentioned, I think we are agreeing that we should
allow the chaos to continue for a little awhile.
However, there are certain parts I think we can help
people to make it easier.
Which is how do you stand things up.
Hadoop has a great system, but as Gus can probably tell you,
it's not so easy for enterprises to stand up a
Often the enterprise needs to stand up many of
those Hadoop clusters.
And some will need to stand up other type of data stores.
And that's where VMware is a leader in the virtualization
software and cloud infrastructure.
And we are building tools which includes some open
source project called Serengeti, which is helping
people to easily stand up their Hadoop clusters, as well
as other data stores--
really automate some of those headaches or tough work.
And so they can focus on the work that matters.
HAL VARIAN: Let me put in a good word about standards.
Because when you look at companies, how do they grow?
They grow through acquisition.
When they grow through acquisition, you end up with
data silos everywhere.
And data silos are the enemy of big data.
And the amazing thing about Google, because of the work
that Theo and his team do, is we have no
data silos at Google.
Now that's not 100% true, of course, but when we bring an
acquisition in, we spend a lot of time trying to get their
data infrastructure aligned with our own internal
And what it means is, you can basically pick an engineer off
of one project and move them on to another project,
completely at the other side of the company.
And they're productive in the first week because of having
that standardized infrastructure that we have.
And that is not something that most companies have the luxury
of dealing with.
The biggest problem that most companies face in data
management is trying to get this interoperation among the
different legacy systems.
You know, there's this old line, how did God create the
world in only six days?
And the answer is, he didn't have a legacy
system to worry about.
So everybody in the business faces, how's
that going to be solved?
That's my question.
How do you solve that?
GUSTAV HORN: I think you're right.
There are a lot of heterogeneous databases and a
lot of things that need to be stitched together.
And I think that big data--
again from the Hadoop prospective-- . there are lots
of connectors out there-- from Flume, from Scoop.
And I think that's key.
You'll find that a lot of these big database companies
are having to embrace open source.
They're having to embrace Hadoop, because if they don't
embrace it, they're going to become roadkill.
So they're looking for ways to monetize it, from consulting
services and things like that.
And also how then can they play in this market and become
leaders in this market, so they retain
their customer base.
Because the bottom line is, the Oracles of the world, the
SAPs, these people make money through selling licenses.
Hadoop is a license killer.
So that's going to directly impact their ability to be
profitable from a stock market perspective.
They need to find ways to innovate that allow them to
keep that trajectory.
And then the other thing I would say is, that a lot of
times the biggest problem I've found in industry, when I go
meeting with big customers or potential customers, is that
they don't know where to start.
They have a huge data problem, not just a big data problem.
They have data everywhere and silos in different corners of
And they don't have one person who is competent enough from a
technical perspective to know how to move forward.
They have individual islands or teams that are looking at
how they can move forward.
And the real strength in big data and big data analytics is
the heterogeneous nature of the data.
That's one of the key strengths
of this entire industry--
is the fact that you want to stitch together all of these
different data sources, and then be able to find those
correlations amongst them.
It doesn't do anybody any good to do a structured database in
Hadoop, and you're just doing the same old thing.
What's the benefit?
There is no benefit.
The benefit is when you're able to combine all of these
sources into one place and you find that
needle in the haystack.
Or you're able to better understand your customer.
Because fundamentally, all of these things
are customer driven.
I don't care whether it's Google.
I don't care whether it's VMware.
If the customer isn't happy, they're not
going to come back.
They're not going to like your website.
They're not going to like your product.
So the bottom line is, how can you find ways to modify what
you're doing to make it better for the customer.
And if you're able to find those needles because you can
stitch together all of these different sources--
including social media, including global search
engines and global communities--
and find out what people are doing, you'll find out those
subtle differences that really become the real game changer.
And that's really what big data is about.
CHARLES FAN: Yeah and I think another way I'll dissect the
big data, is that it can be looked at as four layers of
From the very top is the big data applications.
And to the second layer, which is big data analytics--
the various machine learning and other
algorithms you can apply.
The third layer is the big data management--
the query engines and so on, that you can query the data.
And the bottom layer is the data infrastructure--
the storage, and so on where you store the data.
I think to the question, the more bottom the layer, I think
it's closer to standardization.
I think there is, maybe to Theo's comment, there probably
can be a unified big data store, where all the bits, all
the CRAP, eventually end up somewhere.
There's a sync, a common sync for all the CRAP.
And they come into here.
I think right now we should still allow various different
ways for them to be queried.
Even in our Hadoop system, some people like to use Pick,
some people to use Hide, some people like to just do H-based
direct on HDFS.
Some people like to, Dremel is another way you can
interact with it.
And I'm sure there are new innovations coming out of
Google, out of everywhere in the ecosystem.
And like in [INAUDIBLE].
When I talk about standardization chaos,
sometimes I'll go back to the history--
for me, it's Chinese history.
Where, for those of you who have read the Chinese book
called "The Romance of Three Kingdoms," where the first
line of the novel is, "After unification it's chaos.
After chaos, it's unification. "
And it's describing how often of all the warlords fighting
chaos, inevitably somebody's struggle will emerge
and unify the land.
And that will be your emperor.
And also inevitably, whether it's after he gets old or,
whether he dies and his kids get weak, that it will fall
back into chaos.
And this is traditional dynasties that repeat about a
That's 4,000 years of Chinese history.
And I think that can apply to the history of anywhere else.
As well, it can apply to the data processing, the data
Where we are in this period, going from a more unified SQL
interface, a more unified data management query engines, to a
more diversified world.
But I would predict in ten years, there will be leading
standards or ad hoc standards-- de facto
standards-- that's going to emerge where the majority of
the big data problem going to be solved in that way.
THEO VASSILAKIS: Yeah, I agree with that.
I don't know if it'll be in the form of a W3C standard or
something like that, but I think that's a little bit the
dynamic that Hal was referring to inside of Google.
That after n years of fighting with all of the different
varieties of things, people kind of said, well we
understand now that it's not the purpose of our team here
over in maps to really build that entire stack.
Because now that we know what all that entire stack entails,
we realize that it's really far too big for us
to do on our own.
And so we're willing to concentrate further up the
stack in the parts that we really care about.
And that then led lots of groups of Google to look
around and say, OK well, what is a piece of technology that
exists, and is reasonably mature.
And a lot of people will use it, and it
gives us this advantage.