Placeholder Image

字幕表 動画を再生する

  • LEE FLEMING: Good evening.

  • I am really pleased to welcome you all to "Leaders in Big

  • Data" hosted by Google and the Fung Institute of Engineering

  • Leadership at UC Berkeley.

  • I'm Lee Fleming.

  • I'm director of the Institute and this is a Ikhlaq Sidhu,

  • chief scientist and co-founder.

  • The first and most important thing is to thank Google for

  • hosting the event.

  • So thank you very, very much.

  • There's a couple people in particular, Irena Coffman and

  • Gail Hernandez--

  • thank you-- and also Arnav Anant, our entrepreneur in

  • residence at the Fung Institute.

  • So here's Arnav.

  • AUDIENCE: A lot of work.

  • LEE FLEMING: Huge amount of work.

  • The Fung Institute-- we were founded about two years ago.

  • And the intent is to do research and pedagogical

  • development in topics of engineering leadership.

  • We have our degree, the Master's of Engineering--

  • professional Master's of Engineering M. Eng. program--

  • mainly around the Institute.

  • We also have ties though across the campus, as you'll

  • see shortly.

  • This is our intent to have a series of talks on topics of

  • interest to engineering leaders.

  • As it turns out, this Wednesday we

  • have our next talk.

  • It's sponsored by [? Thai ?] and the Fung Institute.

  • And the topic is entrepreneurship--

  • being an entrepreneur within your firm.

  • And fittingly, we have representatives from Google,

  • and Cisco, and SAP.

  • That's Wednesday.

  • Consult the Fung website or the [? Thai ?] website for

  • details on that.

  • So besides enjoying a good discussion tonight, we have an

  • ulterior motive, as you can probably tell.

  • We're trying to advertise all of our fantastic programs in

  • big data at Cal.

  • Now, whether you're interested in computation, or inference,

  • or application, or some combination of those things,

  • we've got the right program for you.

  • As I mentioned, the professional Masters of

  • Engineering, or M. Eng., across all the different

  • engineering departments--

  • one year degree.

  • We have another one-year degree in the stats

  • department-- a professional degree.

  • There's a two-year degree in the Information School.

  • And finally, there's the Haas MBA.

  • Tonight we've got people from all these programs.

  • You can find their tables, ask them questions, and hopefully

  • we'll see you see at Cal soon.

  • And we also have an additional executive and other programs

  • associated with each of those departments

  • and schools as well.

  • Ikhlaq will now introduce our speakers.

  • IKHLAQ SIDHU: OK, thanks.

  • So let me see.

  • LEE FLEMING: Just slide this here.

  • IKHLAQ SIDHU: All right.

  • Welcome, I want to also thank a couple of people.

  • One is [? Claus Nickoli ?], who is not here at the moment,

  • but to you in the ether, he's just not at the meeting.

  • But he's our host here, and so thank you.

  • You guys can tell him that I thanked him.

  • And also, many of you I've seen here are basically

  • friends, and so thanks for coming.

  • It's good to see you again.

  • This is an event on big data.

  • And so I'm going to give you a little data on

  • who is speaking today--

  • who is here.

  • And the way I think of this is, what we've got is three

  • perspectives of big data from leading firms--

  • from people who represent leading firms in the area.

  • And so let's start with NetApp.

  • We've got Gustav Horn.

  • He is a senior consulting engineer with 25 years of

  • experience.

  • And he's built some of the largest enterprise-class

  • Hadoop systems in the world-- on the planet.

  • And from Google, Theodore Vassilakis, and he's a

  • principal engineer at Google.

  • He's ahead of the team that works on data analytics.

  • And he's been responsible for numerous contributions to

  • Google in terms [? about ?] search, and the visualization

  • and representation of the results.

  • And from VMware, Charles Fan, who's senior VP of strategic

  • R&D. He co-founded Rainfinity and was CTO of the company

  • prior to its acquisition by EMC in 2005.

  • And our distinguished set of speakers is moderated by our

  • distinguished moderator, Hal Varian.

  • He is chief economist here at Google.

  • He's an emeritus professor at UC Berkeley and the founding

  • dean of the School of Information.

  • So with that, there's hardly anything more I

  • could possibly say.

  • Come on up Hal and take it away.

  • HAL VARIAN: Thank you.

  • I'm very impressed with the turnout tonight, seeing as

  • you're missing both the debate and the baseball game.

  • But at least it eliminates a difficult

  • choice for many people.

  • I will say that I'm going to follow the same rules as the

  • presidential debates.

  • So no kicking, biting, scratching, or bean balls are

  • allowed during this performance.

  • We're going to talk about foreign policy, wasn't that

  • the agreement?

  • No.

  • All right.

  • In any event, what I thought we'd would do is, we'd have

  • each person talk for about five minutes, lay out their

  • theme, where they're coming from, what their perspective

  • is on big data.

  • And I will take some notes, and then ask some questions,

  • get a conversation going.

  • And I think we'll have a little time at the end for

  • some questions from the floor.

  • So, take it away.

  • THEO VASSILAKIS: Sure.

  • So, should I start, Hal?

  • HAL VARIAN: Yes.

  • THEO VASSILAKIS: All right.

  • Well, hey it's a real pleasure to be here.

  • Thank you guys also, and thank you guys for coming.

  • It's a huge, huge audience.

  • Just a couple of words.

  • As you heard, my name is Theo.

  • I lead some of our analytical systems.

  • So I'm responsible--

  • well, actually up until two weeks ago, I was responsible

  • for a stack that had parallel data warehousing components,

  • query engines, pieces like Dremel, and Tenzing systems

  • that let you query this data, and

  • visualization layers on top.

  • And that's one of the many, many systems at Google that I

  • think, outside, one would think of as

  • big-data type of systems.

  • And so I'll try to give you my perspective at least on the

  • Google view of big data.

  • And hopefully someone will cut me off when it's time.

  • I think I'll probably go for five minutes.

  • This could take a while.

  • AUDIENCE: [INAUDIBLE]

  • THEO VASSILAKIS: All right, sounds good.

  • Thank you.

  • I think, as you guys know, Google's business is primarily

  • about taking data and organizing the world's

  • information, and making it universally

  • accessible and useful.

  • So a lot of what the company does is really about sucking

  • in data-- whether it be the web, whether it be the imagery

  • from Street View, or satellite imagery, or maps information,

  • or Android pings, or you name it.

  • And then transforming it into usable forms.

  • So really, Google is kind of a big data

  • machine in some sense.

  • And I think the term big data came into

  • currency relatively recently.

  • And we all said, yeah, OK, that speaks to what we do.

  • Because we don't really have a word for it.

  • We just kind of knew that the data was large.

  • But just to try to put maybe more structure on to that, I

  • think the Google view on a lot of "what is big data

  • processing" kind of splits up into probably what I would

  • call ingestion type of processes--

  • things like the crawlers, things like all those Street

  • View cars running through all the streets of the world.

  • And then goes into transaction processing systems, where

  • perhaps we capture data through interactions on a lot

  • of our web properties, or a lot of the web properties that

  • we partner with.

  • This means people clicking on search, or people interacting

  • with docs, or people interacting with maps.

  • All generate many, many clicks and many, many interactions

  • that then become transactional big data.

  • Of course, that also includes people using let's say Google

  • Analytics on their sites to measure traffic on their

  • properties, which then generates huge volumes of

  • pings into Google--

  • many tens of thousands of QPS of pings.

  • So that's kind of the second big component.

  • And then probably the third component is the processing

  • side of all of that.

  • The process side includes things like map [? reduce, ?]

  • analysis, generating insights from that data--

  • maybe in the form of building machine learning models.

  • Maybe in the form of building, for example, Zeitgeist top

  • queries that can then be served out to the world to

  • say, hey here is what people are searching for.

  • Maybe in the form of engrams of all the books that Google

  • scanned over many, many years of its ingestion processes.

  • But it's really baking all of that information and then

  • presenting it in some usable form, either through a system

  • such as our ad system that takes models and decides what

  • ads to show, or in a more direct

  • form such as the engrams.

  • Just to say, OK, here are those three broad classes--

  • ingestion, transaction processing, and analytical

  • processing.

  • To dig a little bit deeper into each of those areas, I

  • would say the ingestion processes, especially the very

  • large scale ingestion processes, are

  • highly custom systems.

  • If you think about our web crawlers, if you think about

  • the Street View cars, if you think about maps stitching, or

  • satellite imagery stitching--

  • those are very, very custom processes that I think, at

  • least to this date, don't have a clear analog

  • in the general industry.

  • And maybe this is something that you guys might address or

  • might see differently than how I see the version.

  • They're still highly-specialized systems

  • that produce very large images.

  • And they're very high performance, very complex

  • systems that are run by dedicated engineering teams.

  • The transaction processing systems or the storage systems

  • are things like the Google File System.

  • These are things like Big Table.

  • These are things like Megastore.

  • Those are the ones that we've actually published papers

  • about and that are now reasonably well

  • known in the industry--

  • have evolved a little bit past the purely custom stage, where

  • they're fairly general purpose.

  • And there was a time at Google where actually most people did

  • their own storage in some form or another, until these

  • GFS-like systems evolved to the point where they were good

  • enough that more than one team could use them.

  • And actually, that evolution had many steps in which, for

  • example, everybody ran their own GFS.

  • And so maybe the ads team had their own GFS cells, and the

  • search team maybe had their own GFS cells.

  • And in time, the systems matured to the point where

  • actually we could have a centrally-managed file system.

  • And I think recently you may have seen, we've now talked

  • about this global file system called Spanner which takes

  • that to yet another level of transactions and global

  • availability.

  • And then the third step, which is I think still in a

  • relatively immature stage compared to some of the

  • storage systems, is the analysis.

  • And I think a lot of people know about MapReduce and some

  • of the systems that have been built on top of that.

  • So for example, Flume is the way of chaining MapReduces in

  • a more programmer-friendly way so that you don't end up with