字幕表 動画を再生する 英語字幕をプリント LUCASZ KAISER: Hi, my name is Lucasz Kaiser, and I want to tell you in this final session about Tensor2Tensor, which is a library we've built on top of TensorFlow to organize the world's models and data sets. So I want to tell you about the motivation, and how it came together, and what you can do with it. But also if you have any questions in the meantime anytime, just ask. And only if you've already used Tensor2Tensor, in that case, you might have even more questions. But the motivation behind this library is-- so I am a researcher in machine learning. I also worked from production [INAUDIBLE] models, and research can be very annoying. It can be very annoying to researchers, and it's even more annoying to people who put it into production, because the research works like this. You have an idea. You want to try it out. It's machine learning, and you think, well, I will change something in the model, it will be great. It will solve physics problems, or translation, or whatever. So we have this idea, and you're like, it's so simple. I just need to change one tweak, but then, OK, I need to get the data. Where was it? So we search it online, you find it, and it's like, well, so I need to preprocess it. You implement some data reading. You download the model that someone else did. And it doesn't give the result at all that someone else wrote in the paper. It's worse. It works 10 times slower. It doesn't train at all. So then you start tweaking it. Turns out, someone else had this postscript that preprocessed the data in a certain way that improved the model 10 times. So you add that. Then it turns out your input pipeline is not performing, because it doesn't put data on GPU or CPU or whatever. So you tweak that. Before you start with your research idea, you've spent half a year on just reproducing what's been done before. So then great. Then you do your idea. It works. You write the paper. You submit it. You put it in the repo on GitHub, which has a README file that says, well, I downloaded the data from there, but this link has already gone by two days after he made the repo. And then I applied. And you describe all these 17 tweaks, but maybe you forgot one option that was crucial. Well, and then there is the next paper and the next research, and the next person comes and does the same. So it's all great except the production team, at some point, they get like, well, we should put it into production. It's a great result. And then they need to track this whole path, redo all of it, and try to get the same. So it's a very difficult state of the world. And it's even worse because there are different hardware configurations. So maybe something that trained well on a CPU does not train on a GPU, or maybe you need an 8 GPU setup, and so on and so forth. So the idea behind Tensor2Tensor was, let's make a library that has at least a bunch of standard models for standard tasks that includes the data and the preprocessing. So you really can, on a command line, just say, please get me this data set and this model, and train it, and make it so that we can have regression tests and actually know that it will train, and that it will not break with TensorFlow 1.10. And that it will train both on the GPU and on a TPU, and on a CPU-- to have it in a more organized fashion. And the thing that prompted Tensor2Tensor, the thing why I started it, was machine translation. So I worked with the Google Translate team on launching neural networks for translation. And this was two years ago, and this was amazing work. Because before that, machine translation was done in this way like-- it was called phrase-based machine translation. So if you find some alignments of phrases, then you translate the phrases, and then you try to realign the sentences to make them work. And the results in machine translation are normally measured in terms of something called the BLEU score. I will not go into the details of what it was. It's like the higher the better. So for example, for English-German translation, the BLEU score that human translators get is about 30. And the best phrase-based-- so non-neural network, non-deep-learning-- systems were about 20, 21. And it's been, really, a decade of research at least, maybe more. So when I was doing a PhD, if you got one BLEU score up, you would be a star. It was good PhD. If you went from 21 to 22, it would be amazing. So then the neural networks came. And the early LSTMs in 2015, they were like 19.5, 20. And we talked to the Translate team, and they were like, you know, guys, it's fun. It's interesting, because it's simpler in a way. You just train the network on the data. You don't have all the-- no language-specific stuff. It's a simpler system. But it gets worse results, and who knows if it will ever get better. But then the neural network research moved on, and people started getting 21, 22. So the Translate team, together with Brain, where I work, made the big effort to try to make a really large LSTM model, which is called the GNMT, the Google Neural Machine Translation. And indeed it was a huge improvement. It got to 25. BLEU, later-- we added mixtures of experts, it even got to 26. So they were amazed. It launched in production, and well, it was like a two-year effort to take the papers, scale them up, launch it. And to get these really good results, you really needed a large network. So as an example why this is important, or why this was important for Google is-- so you have a sentence in German here, which is like, "problems can never be solved with the same way of thinking that caused them." And this neural translator translates the sentence kind of the way it should-- I doubt there is a much better translation-- while the phrase-based translators, you can see, "no problem can be solved from the same consciousness that they have arisen." It kind of shows how the phrase-based method works. Every word or phrase is translated correctly, but the whole thing does not exactly add up. You can see it's a very machiney way, and it's not so clear what it is supposed to say. So the big advantage of neural networks is they train on whole sentences. They can even train on paragraphs. They can be very fluent. Since they take into account the whole context at once, it's a really big improvement. And if you ask people to score translations, this really starts coming close-- or at least 80% of the distance to what human translators do, at least on newspaper language-- not poetry. [CHUCKLING] We're nowhere near that. So it was great. We got the high BLEU scores. We reduced the distance to human translators. It turned out the one system can handle different languages, and sometimes even multilingual translations. But there were problems. So one problem is the training time. It took about a week on a setup of 64 to 128 GPUs. And all the code for that was done specifically for this hardware setup. So it was distributed training where everything in the machine learning pipeline was tuned for the hardware. Well, because we knew we will train on this data center on this hardware. So why not? Well, the problem is batch sizes and learning rates, they come together. You can not tune them separately. And then you add tricks. Then you tweak some things in the model that are really good for this specific setup, for this specific learning grade or batch size. This distributed setup was training asynchronously. So there were delayed gradients. It's a regular [? ISO, ?] so you decrease dropout. You start doing parts of the model specifically for a hardware setup. And then you write the paper. We did write a paper. It was cited. But nobody ever outside of Google managed to reproduce this, get the same result with the same network, because we can give you our hyperparameters, but you're running on a different hardware setup. You will not get the same result. And then, in addition to the machine learning setup, there is the whole will tokenization pipeline, data preparation pipeline. And even though these results are on the public data, the whole pre-processing is also partially Google. It doesn't matter much. But it really did not allow other people to build on top of this work. So it launched, it was a success for us, but in the research sense, we felt that it came short a little bit. Because for one, I mean, you'd need a huge hardware setup to train it. And on the other hand, even if you had the hardware setup, or if you got it on cloud and wanted to invest in it, there would still be no way for you to just do it. And that was the prompt, why I thought, OK, we need to make a library for the next time we build a model. So the LSTMs were like the first wave of sequence models with the first great results. But I thought, OK, the next time when we come build a model, we need to have a library that will ensure it works at Google and outside, that will make sure when you train on one GPU, you get a worse result, but we know what it is. We can tell you, yes, you're on the same setup. Just scale up. And it should work on cloud so you can just, if you want better result, get some money, pay for larger hardware. But it should be tested, done, and reproducible outside. And the need-- so the Tensor2Tensor library started with the model called Transformer, which is the next generation of sequence models. It's based on self-attentional layers. And we designed this model. It got even better results. It got 28.4 BLEU. Now we are on par with BLEU with human translators. So this metric is not good anymore. It just means that we need better metrics. But this thing, it can train in one day on an 8 GPU machine. So you can just get it. Get an 8 GPU machine.