字幕表 動画を再生する 英語字幕をプリント A mildly fun thing to do when you're bored is start the beginning of a text message, and then use only the suggested words to finish it. "In five years I will see you in the morning and then you can get it." The technology behind these text predictions is called a “language model": a computer program that uses statistics to guess the next word in a sentence. And in the past year, other, newer language models have gotten really, weirdly good at generating text that mimics human writing. "In five years, I will never return to this place. He felt his eye sting and his throat tighten." The program completely made this up. It's not taken from anywhere else and it's not using a template made by humans. For the first time in history, computers can write stories. The only problem is that it's easier for machines to write fiction than to write facts. Language models are useful for a lot of reasons. They help “recognize speech” properly when sounds are ambiguous in speech-to-text applications. And they can make translations more fluent when a word in one language maps to multiple words in another. But if you asked language models to simply generate passages of text, the results never made much sense. SHANE: And so the kinds of things that made sense to do were like generating single words or very short phrases. For years, Janelle Shane has been experimenting with language generation for her blog AI Weirdness. Her algorithms have generated paint colors, "Bull Cream” Halloween costumes, “Sexy Michael Cera” And pick-up lines. "You look like a thing and I love you.” But this is what she got in 2017 when she asked for longer passages, like the first lines of a novel: SHANE:The year of the island is discovered the Missouri of the galaxy like a teenage lying and always discovered the year of her own class-writing bed ...It makes no sense. Compare that to this opening line from a newer language model called GPT-2. SHANE: It was a rainy, drizzling day in the summer of 1869. And the people of New York, who had become accustomed to the warm, kissable air of the city, were having another bad one. JOSS: It's like it's getting better at bullsh*tting us. SHANE: Yes, yes, it is very good at generating scannable, readable bullsh*t. Going from word salad to pretty passable prose took a new approach in the field of natural language processing. Typically, language tasks have required carefully structured data. You need thousands of correct examples to train the program. For translation you need a bunch of samples of the same document in multiple languages. For spam filters, you need emails that humans have labeled as spam. For summarization, you need full documents plus their human-written summaries. Those data sources are limited and can take a lot of work to collect. But if the task is to simply guess the next word in a sentence, the problem comes with its own solution. So the training data can be any human-written text, no labeling required. This is called “self-supervised learning.” That's what makes it easy and inexpensive to gather data, which means you can use a LOT of it. Like all of Wikipedia, or 11,000 books, or 8 million web sites. With that amount of data, plus serious computing resources, and a few tweaks to the architecture and size of the algorithms, these new language models build vast mathematical maps of how every word correlates with every other word, all without being explicitly told any of the rules of grammar or syntax. That gives them fluency with whatever language they're trained on, but it doesn't mean they know what's true or false. To get language models to generate true stories, like summarizing documents or answering questions accurately, it takes extra training. The simplest thing to do without much more work is just generate passages of text, which are both superficially coherent and also false. GEITGEY: So give me any headline that you want a fake news story for. JOSS: Scientists discover Flying Horse. Adam Geitgey is a software developer who created a fake news website populated entirely with generated text. He used a language model called Grover, which was trained on news articles from 5,000 publications. “More than 1,000 years ago, archaeologists unearthed a mysterious flying animal in France and hailed it the 'Winged Horse of Afzel' or 'Horse of Wisdom'” GEITGEY: This is amazing, right? Like this is crazy. JOSS: So crazy. GEITGEY: "The animal, which is the size of a horse, was not easy." If we just Google that. Like there's nothing. JOSS::It doesn't exist anywhere. GEITGEY: And I don't want to say this is perfect. But just from a longer term point of view of what people were really excited about three years ago versus what people can do now, like this is just like a huge, huge leap. If you read closely, you can see that the model is describing a creature that is somehow both “mouse-like” and “the size of a horse.” That's because it doesn't actually know what it's talking about. It's simply mimicking the writing style of a news reporter. These models can be trained to write in the voice of any source, like a twitter feed, “I'd like to be very clear about one thing. shrek is not based on any actual biblical characters. not even close.” Or whole subreddits. “I found a potato on my floor.” “A lot of people use the word 'potato' as an insult to imply they are not really a potato, they just 'looked like' one.” “I don't mean insult, I mean as in as in the definition of the word potato.” “Fair enough. The potato has been used in various ways for a long time.” But we may be entering a time when AI-generated text isn't so funny anymore. “Islam has taken the place of Communism as the chief enemy of the West.” Researchers have shown that these models can be used to flood government websites with fake public comments about policy proposals, post tons of fake business reviews, argue with people online, and generate extremist and racist posts that can make fringe opinions seem more popular than they really are. GEITGEY: It's all about like taking something you could do and then just increasing the scale of it, making it more scalable and cheaper. The good news is that some of the developers who built these language models also built ways to detect much of the text generated through their models. But it's not clear who has the responsibility to fake-check the internet. And as bots become even better mimics - with faces like ours, voices like ours, and now our language, those of us made of flesh and blood may find ourselves increasingly burdened with not only detecting what's fake, but also proving that we're real.