Placeholder Image

字幕表 動画を再生する

  • ROBERT CROWE: I'm Robert Crowe.

  • And we are here today to talk about production pipelines, ML

  • pipelines.

  • So we're not going to be talking about ML modeling

  • too much or different architectures.

  • This is really all focused about when you have a model

  • and you want to put it into production so that you can

  • offer a product or a service or some internal service

  • within your company, and it's something

  • that you need to maintain over the lifetime

  • of that deployment.

  • So normally when we think about ML,

  • we think about modeling code, because it's

  • the heart of what we do.

  • Modeling and the results that we get from the amazing models

  • that we're producing these days, that's

  • the reason we're all here, the results we can produce.

  • It's what papers are written about, for the most part,

  • overwhelmingly.

  • The majority are written about architectures and results

  • and different approaches to doing ML.

  • It's great stuff.

  • I love it.

  • I'm sure you do too.

  • But when you move to putting something into production,

  • you discover that there are a lot of other pieces

  • that are very important to making that model that you

  • spent a lot of time putting together

  • available and robust over the lifetime of a product

  • or a service that you're going to offer out to the world

  • so that they can experience really

  • the benefits of the model that you've worked on.

  • And those pieces are what TFX is all about.

  • In machine learning, we're familiar with a lot

  • of the issues that we have to deal with,

  • things like where do I get labeled data.

  • How do I generate the labels for the data that I have.

  • I may have terabytes of data, but I need labels for them?

  • Does my label cover the feature space

  • that I'm going to see when I actually

  • run inference against it?

  • Is my dimensionality-- is it minimized?

  • Or can I do more to try to simplify

  • my set, my feature vector, to make my model more efficient?

  • Have I got really the predictive information in the data

  • that I'm choosing?

  • And then we need to think about fairness as well.

  • Are we are we serving all of the customers

  • that we're trying to serve fairly, no matter where they

  • are, or what religion they are, what language they speak,

  • what demographic they might be because you

  • want to serve those people as well as you can?

  • You don't want to unfairly disadvantage people.

  • And we may have rare conditions too, especially in things

  • like health care where we're making

  • a prediction that's going to be pretty important

  • to someone's life.

  • And it maybe on a condition that occurs very rarely.

  • But a big one when you go into production

  • is understanding the data lifecycle.

  • Because once you've gone through that initial training

  • and you've put something into production,

  • that's just the start of the process.

  • You're now going to try to maintain that over a lifetime,

  • and the world changes.

  • Your data changes.

  • Conditions in your domain change.

  • Along with that, you're doing now production software

  • deployment.

  • So you have all of the normal things

  • that you have to deal with any software deployment, things

  • like scalability.

  • Will I need to scale up?

  • Is my solution ready to do that?

  • Can I extend it?

  • Is it something that I can build on?

  • Modularity, best practices, testability.

  • How do I test an ML solution?

  • And security and safety, because we

  • know there are attacks for ML models

  • that are getting pretty sophisticated these days.

  • Google created TFX for us to use.

  • We created it because we needed it.

  • It was not the first production ML framework that we developed.

  • We've actually learned over many years

  • because we have ML all over Google

  • taking in billions of inference requests

  • really on a planet scale.

  • And we needed something that would

  • be maintainable and usable at a very large production

  • scale with large data sets and large loads over a lifetime.

  • So TFX has evolved from earlier attempts.

  • And it is now what most of the products and services at Google

  • use.

  • And now we're also making it available to the world

  • as an open-source product available to you now

  • to use for your production deployments.

  • It's also used by several of our partners

  • and just companies that have adopted TFX.

  • You may have heard talks from some of these at the conference

  • already.

  • And there's a nice quote there from Twitter,

  • where they did an evaluation.

  • They were coming from a Torch-based environment,

  • looked at the whole suite or the whole ecosystem of TensorFlow,

  • and moved everything that they did to TensorFlow.

  • One of the big contributors to that

  • was the availability of TFX.

  • The vision is to provide a platform for everyone to use.

  • Along with that, there's some best practices and approaches

  • that we're trying to really make popular in the world, things

  • like strongly-typed artifacts so that when

  • your different components produce artifacts

  • they have a strong type.

  • Pipeline configuration, workflow execution,

  • being able to deploy on different platforms,

  • different distributed pipeline platforms using

  • different orchestrators, different underlying execution

  • engines--

  • trying to make that as flexible as possible.

  • There are some horizontal layers that

  • tie together the different components in TFX.

  • And we'll talk about components here in a little bit.

  • And we have a demo as well that will show you some of the code

  • and some of the components that we're talking about.

  • The horizontal layers-- an important one there is metadata

  • storage .

  • So each of the components produce and consume artifacts.

  • You want to be able to store those.

  • And you may want to do comparisons across months

  • or years to see how did things change, because change becomes

  • a central theme of what you're going to do in a production

  • deployment.

  • This is a conceptual look at the different parts of TFX.

  • On the top, we have tasks--

  • a conceptual look at tasks.

  • So things like ingesting data or training a model

  • or serving the model.

  • Below that, we have libraries that are available, again,

  • as open-source components that you can leverage.

  • They're leveraged by the components within TFX

  • to do much of what they do.

  • And on the bottom row in orange, and a good color for Halloween,

  • we have the TFX components.

  • And we're going to get into some detail about how your data will

  • flow through the TFX pipeline to go from ingesting data

  • to a finished trained model on the other side.

  • So what is a component?

  • A component has three parts.

  • This is a particular component, but it could be any of them.

  • Two of those parts, the driver and publisher,

  • are largely boilerplate code that you could change.

  • You probably won't.

  • A driver consumes artifacts and begins the execution

  • of your component.

  • A publisher takes the output from the component,

  • puts it back into metadata.

  • The executor is really where the work is

  • done in each of the components.

  • And that's also a part that you can change.

  • So you can take an existing component,

  • override the executor in it, and produce

  • a completely different component that

  • does completely different processing.

  • Each of the components has a configuration.

  • And for TFX, that configuration is written in Python.

  • And it's usually fairly simple.

  • Some of the components are a little more complex.

  • But most of them are just a couple of lines of code

  • to configure.

  • The key essential aspect here that I've alluded to

  • is that there is a metadata store.

  • The component will pull data from that store

  • as it becomes available.

  • So there's a set of dependencies that determine which artifacts

  • that component depends on.

  • It'll do whatever it's going to do.

  • And it's going to write the result back into metadata.

  • Over the lifetime of a model deployment,

  • you start to build a metadata store that

  • is a record of the entire lifetime of your model.

  • And the way that your data has changed,

  • the way your model has changed, the way your metrics have

  • changed, it becomes a very powerful tool.

  • Components communicate through the metadata store.

  • So an initial component will produce an artifact,

  • put it in the metadata store.

  • The components that depend on that artifact

  • will then read from the metadata store

  • and do whatever they're going to do,

  • and put their result into it, and so on.

  • And that's how we flow through the pipeline.

  • So the metadata store I keep talking about.

  • What is it?

  • What does it contain?

  • There's really three kinds of things that it contains.

  • Trained models or just artifacts themselves.

  • They could be trained models, they could be data sets,

  • they could be metrics, they could be splits.

  • There's a number of different types of objects

  • that are in the metadata store.

  • Those are grouped into execution records.

  • So when you execute the pipeline,

  • that becomes an execution run.

  • And the artifacts that are associated with that run

  • are grouped under that execution run.

  • So again, when you're trying to analyze what's

  • been happening with your pipeline,

  • that becomes very important.

  • Also, the lineage of those artifacts--

  • so which artifact was produced by which component,

  • which consumed which inputs, and so on.

  • So that gives us some functionality

  • that becomes very powerful over the lifetime of a model.

  • You can find out which data a model

  • was trained on, for example.

  • If you're comparing the results of two different model

  • trainings that you've done, tracing it

  • back to how the data changed can be really important.

  • And we have some tools that allow you to do that.

  • So TensorBoard for example will allow

  • you to compare the metrics from say a model that you trained

  • six months ago and the model that you just trained

  • now to try to understand.

  • I mean, you could see that it was different, buy why--

  • why was it different.

  • And warm-starting becomes very powerful too,

  • especially when you're dealing with large amounts of data that

  • could take hours or days to process,

  • being able to pull that data from cache.

  • If the inputs haven't changed, rather than rerunning

  • that component every time becomes a very powerful tool

  • as well.

  • So there's a set of standard components

  • that are shipped with TFX.

  • But I want you to be aware from the start

  • that you are not limited to those standard components.

  • This is a good place to start.

  • It'll get you pretty far down the road.

  • But you will probably have needs-- you may or may not--

  • where you need to extend the components that are available.

  • And you can do that.

  • You can do that in a couple of different ways.

  • This is sort of the canonical pipeline that we talk about so

  • on the left, we're ingesting our data.

  • We flow through, we split our data,

  • we calculate some statistics against it.

  • And we'll talk about this in some detail.

  • We then make sure that we don't have problems with our data,

  • and try to understand what types our features are.

  • We do some feature engineering, we train.

  • This probably sounds familiar.

  • If you've ever been through an ML development process,

  • this is mirroring exactly what you always do.

  • Then you're going to check your metrics across that.