Placeholder Image

字幕表 動画を再生する

  • LUCASZ KAISER: Hi, my name is Lucasz Kaiser,

  • and I want to tell you in this final session

  • about Tensor2Tensor, which is a library we've

  • built on top of TensorFlow to organize the world's models

  • and data sets.

  • So I want to tell you about the motivation,

  • and how it came together, and what you can do with it.

  • But also if you have any questions

  • in the meantime anytime, just ask.

  • And only if you've already used Tensor2Tensor, in that case,

  • you might have even more questions.

  • But the motivation behind this library

  • is-- so I am a researcher in machine learning.

  • I also worked from production [INAUDIBLE] models,

  • and research can be very annoying.

  • It can be very annoying to researchers,

  • and it's even more annoying to people

  • who put it into production, because the research works

  • like this.

  • You have an idea.

  • You want to try it out.

  • It's machine learning, and you think,

  • well, I will change something in the model, it will be great.

  • It will solve physics problems, or translation, or whatever.

  • So we have this idea, and you're like, it's so simple.

  • I just need to change one tweak, but then, OK, I

  • need to get the data.

  • Where was it?

  • So we search it online, you find it,

  • and it's like, well, so I need to preprocess it.

  • You implement some data reading.

  • You download the model that someone else did.

  • And it doesn't give the result at all

  • that someone else wrote in the paper.

  • It's worse.

  • It works 10 times slower.

  • It doesn't train at all.

  • So then you start tweaking it.

  • Turns out, someone else had this postscript

  • that preprocessed the data in a certain way that

  • improved the model 10 times.

  • So you add that.

  • Then it turns out your input pipeline is not performing,

  • because it doesn't put data on GPU or CPU or whatever.

  • So you tweak that.

  • Before you start with your research idea,

  • you've spent half a year on just reproducing

  • what's been done before.

  • So then great.

  • Then you do your idea.

  • It works.

  • You write the paper.

  • You submit it.

  • You put it in the repo on GitHub,

  • which has a README file that says,

  • well, I downloaded the data from there,

  • but this link has already gone by two days

  • after he made the repo.

  • And then I applied.

  • And you describe all these 17 tweaks,

  • but maybe you forgot one option that was crucial.

  • Well, and then there is the next paper and the next research,

  • and the next person comes and does the same.

  • So it's all great except the production team, at some point,

  • they get like, well, we should put it into production.

  • It's a great result. And then they

  • need to track this whole path, redo all of it,

  • and try to get the same.

  • So it's a very difficult state of the world.

  • And it's even worse because there are different hardware

  • configurations.

  • So maybe something that trained well on a CPU

  • does not train on a GPU, or maybe you need an 8 GPU setup,

  • and so on and so forth.

  • So the idea behind Tensor2Tensor was,

  • let's make a library that has at least a bunch

  • of standard models for standard tasks that includes

  • the data and the preprocessing.

  • So you really can, on a command line, just say,

  • please get me this data set and this model, and train it,

  • and make it so that we can have regression tests and actually

  • know that it will train, and that it will not break with

  • TensorFlow 1.10.

  • And that it will train both on the GPU and on a TPU,

  • and on a CPU--

  • to have it in a more organized fashion.

  • And the thing that prompted Tensor2Tensor,

  • the thing why I started it, was machine translation.

  • So I worked with the Google Translate team

  • on launching neural networks for translation.

  • And this was two years ago, and this was amazing work.

  • Because before that, machine translation

  • was done in this way like--

  • it was called phrase-based machine translation.

  • So if you find some alignments of phrases,

  • then you translate the phrases, and then you

  • try to realign the sentences to make them work.

  • And the results in machine translation

  • are normally measured in terms of something

  • called the BLEU score.

  • I will not go into the details of what it was.

  • It's like the higher the better.

  • So for example, for English-German translation,

  • the BLEU score that human translators get is about 30.

  • And the best phrase-based-- so non-neural network,

  • non-deep-learning-- systems were about 20, 21.

  • And it's been, really, a decade of research at least,

  • maybe more.

  • So when I was doing a PhD, if you got one BLEU score up,

  • you would be a star.

  • It was good PhD.

  • If you went from 21 to 22, it would be amazing.

  • So then the neural networks came.

  • And the early LSTMs in 2015, they were like 19.5, 20.

  • And we talked to the Translate team,

  • and they were like, you know, guys, it's fun.

  • It's interesting, because it's simpler in a way.

  • You just train the network on the data.

  • You don't have all the--

  • no language-specific stuff.

  • It's a simpler system.

  • But it gets worse results, and who knows

  • if it will ever get better.

  • But then the neural network research moved on,

  • and people started getting 21, 22.

  • So the Translate team, together with Brain, where I work,

  • made the big effort to try to make a really large LSTM

  • model, which is called the GNMT, the Google Neural Machine

  • Translation.

  • And indeed it was a huge improvement.

  • It got to 25.

  • BLEU, later-- we added mixtures of experts, it even got to 26.

  • So they were amazed.

  • It launched in production, and well, it

  • was like a two-year effort to take the papers,

  • scale them up, launch it.

  • And to get these really good results,

  • you really needed a large network.

  • So as an example why this is important,

  • or why this was important for Google is--

  • so you have a sentence in German here,

  • which is like, "problems can never

  • be solved with the same way of thinking that caused them."

  • And this neural translator translates the sentence kind

  • of the way it should--

  • I doubt there is a much better translation--

  • while the phrase-based translators, you can see,

  • "no problem can be solved from the same consciousness

  • that they have arisen."

  • It kind of shows how the phrase-based method works.

  • Every word or phrase is translated correctly,

  • but the whole thing does not exactly add up.

  • You can see it's a very machiney way,

  • and it's not so clear what it is supposed to say.

  • So the big advantage of neural networks

  • is they train on whole sentences.

  • They can even train on paragraphs.

  • They can be very fluent.

  • Since they take into account the whole context at once,

  • it's a really big improvement.

  • And if you ask people to score translations,

  • this really starts coming close--

  • or at least 80% of the distance to what human translators do,

  • at least on newspaper language-- not poetry.

  • [CHUCKLING]

  • We're nowhere near that.

  • So it was great.

  • We got the high BLEU scores.

  • We reduced the distance to human translators.

  • It turned out the one system can handle

  • different languages, and sometimes even

  • multilingual translations.

  • But there were problems.

  • So one problem is the training time.

  • It took about a week on a setup of 64 to 128 GPUs.

  • And all the code for that was done specifically

  • for this hardware setup.

  • So it was distributed training where

  • everything in the machine learning pipeline

  • was tuned for the hardware.

  • Well, because we knew we will train on this data

  • center on this hardware.

  • So why not?

  • Well, the problem is batch sizes and learning rates,

  • they come together.

  • You can not tune them separately.

  • And then you add tricks.

  • Then you tweak some things in the model

  • that are really good for this specific setup,

  • for this specific learning grade or batch size.

  • This distributed setup was training asynchronously.

  • So there were delayed gradients.

  • It's a regular [? ISO, ?] so you decrease dropout.

  • You start doing parts of the model

  • specifically for a hardware setup.

  • And then you write the paper.

  • We did write a paper.

  • It was cited.

  • But nobody ever outside of Google

  • managed to reproduce this, get the same result

  • with the same network, because we can give you

  • our hyperparameters, but you're running on a different hardware

  • setup.

  • You will not get the same result.

  • And then, in addition to the machine learning setup,

  • there is the whole will tokenization pipeline, data

  • preparation pipeline.

  • And even though these results are on the public data,

  • the whole pre-processing is also partially Google.

  • It doesn't matter much.

  • But it really did not allow other people

  • to build on top of this work.

  • So it launched, it was a success for us,

  • but in the research sense, we felt that it

  • came short a little bit.

  • Because for one, I mean, you'd need a huge hardware setup

  • to train it.

  • And on the other hand, even if you had the hardware setup,

  • or if you got it on cloud and wanted to invest in it,

  • there would still be no way for you to just do it.

  • And that was the prompt, why I thought,

  • OK, we need to make a library for the next time

  • we build a model.

  • So the LSTMs were like the first wave of sequence models

  • with the first great results.

  • But I thought, OK, the next time when we come build a model,

  • we need to have a library that will ensure it works at Google

  • and outside, that will make sure when you train on one GPU,

  • you get a worse result, but we know what it is.

  • We can tell you, yes, you're on the same setup.

  • Just scale up.

  • And it should work on cloud so you can just,

  • if you want better result, get some money,

  • pay for larger hardware.

  • But it should be tested, done, and reproducible outside.

  • And the need-- so the Tensor2Tensor library started

  • with the model called Transformer,

  • which is the next generation of sequence models.

  • It's based on self-attentional layers.

  • And we designed this model.

  • It got even better results.

  • It got 28.4 BLEU.

  • Now we are on par with BLEU with human translators.

  • So this metric is not good anymore.

  • It just means that we need better metrics.

  • But this thing, it can train in one day on an 8 GPU machine.

  • So you can just get it.

  • Get an 8 GPU machine.