  • Let me show you something.

  • To be precise,

  • I'm going to show you nothing.

  • This was the world 540 million years ago.

  • Pure, endless darkness.

  • It wasn't dark due to a lack of light.

  • It was dark because of a lack of sight.

  • Although sunshine did filter 1,000 meters

  • beneath the surface of ocean,

  • a light permeated from hydrothermal vents to seafloor,

  • brimming with life,

  • there was not a single eye to be found in these ancient waters.

  • No retinas, no corneas, no lenses.

  • So all this light, all this life went unseen.

  • There was a time that the very idea of seeing didn't exist.

  • It [had] simply never been done before.

  • Until it was.

  • So for reasons we're only beginning to understand,

  • trilobites, the first organisms that could sense light, emerged.

  • They're the first inhabitants of this reality that we take for granted.

  • First to discover that there is something other than oneself.

  • A world of many selves.

  • The ability to see is thought to have ushered in Cambrian explosion,

  • a period in which a huge variety of animal species

  • entered fossil records.

  • What began as a passive experience,

  • the simple act of letting light in,

  • soon became far more active.

  • The nervous system began to evolve.

  • Sight turning to insight.

  • Seeing became understanding.

  • Understanding led to actions.

  • And all these gave rise to intelligence.

  • Today, we're no longer satisfied with just nature's gift of visual intelligence.

  • Curiosity urges us to create machines to see just as intelligently as we can,

  • if not better.

  • Nine years ago, on this stage,

  • I delivered an early progress report on computer vision,

  • a subfield of artificial intelligence.

  • Three powerful forces converged for the first time.

  • Aa family of algorithms called neural networks.

  • Fast, specialized hardware called graphic processing units,

  • or GPUs.

  • And big data.

  • Like the 15 million images that my lab spent years curating called ImageNet.

  • Together, they ushered in the age of modern AI.

  • We've come a long way.

  • Back then, just putting labels on images was a big breakthrough.

  • But the speed and accuracy of these algorithms just improved rapidly.

  • The annual ImageNet challenge, led by my lab,

  • gauged the performance of this progress.

  • And on this plot, you're seeing the annual improvement

  • and milestone models.

  • We went a step further

  • and created algorithms that can segment objects

  • or predict the dynamic relationships among them

  • in these works done by my students and collaborators.

  • And there's more.

  • Recall last time I showed you the first computer-vision algorithm

  • that can describe a photo in human natural language.

  • That was work done with my brilliant former student, Andrej Karpathy.

  • At that time, I pushed my luck and said,

  • "Andrej, can we make computers to do the reverse?"

  • And Andrej said, "Ha ha, that's impossible."

  • Well, as you can see from this post,

  • recently the impossible has become possible.

  • That's thanks to a family of diffusion models

  • that powers today's generative AI algorithm,

  • which can take human-prompted sentences

  • and turn them into photos and videos

  • of something that's entirely new.

  • Many of you have seen the recent impressive results of Sora by OpenAI.

  • But even without the enormous number of GPUs,

  • my student and our collaborators

  • have developed a generative video model called Walt

  • months before Sora.

  • And you're seeing some of these results.

  • There is room for improvement.

  • I mean, look at that cat's eye

  • and the way it goes under the wave without ever getting wet.

  • What a cat-astrophe.

  • (Laughter)

  • And if past is prologue,

  • we will learn from these mistakes and create a future we imagine.

  • And in this future,

  • we want AI to do everything it can for us,

  • or to help us.

  • For years I have been saying

  • that taking a picture is not the same as seeing and understanding.

  • Today, I would like to add to that.

  • Simply seeing is not enough.

  • Seeing is for doing and learning.

  • When we act upon this world in 3D space and time,

  • we learn, and we learn to see and do better.

  • Nature has created this virtuous cycle of seeing and doing

  • powered byspatial intelligence.”

  • To illustrate to you what your spatial intelligence is doing constantly,

  • look at this picture.

  • Raise your hand if you feel like you want to do something.

  • (Laughter)

  • In the last split of a second,

  • your brain looked at the geometry of this glass,

  • its place in 3D space,

  • its relationship with the table, the cat

  • and everything else.

  • And you can predict what's going to happen next.

  • The urge to act is innate to all beings with spatial intelligence,

  • which links perception with action.

  • And if we want to advance AI beyond its current capabilities,

  • we want more than AI that can see and talk.

  • We want AI that can do.

  • Indeed, we're making exciting progress.

  • The recent milestones in spatial intelligence

  • is teaching computers to see, learn, do

  • and learn to see and do better.

  • This is not easy.

  • It took nature millions of years to evolve spatial intelligence,

  • which depends on the eye taking light,

  • project 2D images on the retina

  • and the brain to translate these data into 3D information.

  • Only recently, a group of researchers from Google

  • are able to develop an algorithm to take a bunch of photos

  • and translate that into 3D space,

  • like the examples we're showing here.

  • My student and our collaborators have taken a step further

  • and created an algorithm that takes one input image

  • and turn that into 3D shape.

  • Here are more examples.

  • Recall, we talked about computer programs that can take a human sentence

  • and turn it into videos.

  • A group of researchers in University of Michigan

  • have figured out a way to translate that line of sentence

  • into 3D room layout, like shown here.

  • And my colleagues at Stanford and their students

  • have developed an algorithm that takes one image

  • and generates infinitely plausible spaces

  • for viewers to explore.

  • These are prototypes of the first budding signs of a future possibility.

  • One in which the human race can take our entire world

  • and translate it into digital forms

  • and model the richness and nuances.

  • What nature did to us implicitly in our individual minds,

  • spatial intelligence technology can hope to do

  • for our collective consciousness.

  • As the progress of spatial intelligence accelerates,

  • a new era in this virtuous cycle is taking place in front of our eyes.

  • This back and forth is catalyzing robotic learning,

  • a key component for any embodied intelligence system

  • that needs to understand and interact with the 3D world.

  • A decade ago,

  • ImageNet from my lab

  • enabled a database of millions of high-quality photos

  • to help train computers to see.

  • Today, we're doing the same with behaviors and actions

  • to train computers and robots how to act in the 3D world.

  • But instead of collecting static images,

  • we develop simulation environments powered by 3D spatial models

  • so that the computers can have infinite varieties of possibilities

  • to learn to act.

  • And you're just seeing a small number of examples

  • to teach our robots

  • in a project led by my lab called Behavior.

  • We're also making exciting progress in robotic language intelligence.

  • Using large language model-based input,

  • my students and our collaborators are among the first teams

  • that can show a robotic arm performing a variety of tasks

  • based on verbal instructions,

  • like opening this drawer or unplugging a charged phone.

  • Or making sandwiches, using bread, lettuce, tomatoes

  • and even putting a napkin for the user.

  • Typically I would like a little more for my sandwich,

  • but this is a good start.

  • (Laughter)

  • In that primordial ocean, in our ancient times,

  • the ability to see and perceive one's environment

  • kicked off the Cambrian explosion of interactions with other life forms.

  • Today, that light is reaching the digital minds.

  • Spatial intelligence is allowing machines

  • to interact not only with one another,

  • but with humans, and with 3D worlds,

  • real or virtual.

  • And as that future is taking shape,

  • it will have a profound impact to many lives.

  • Let's take health care as an example.

  • For the past decade,

  • my lab has been taking some of the first steps

  • in applying AI to tackle challenges that impact patient outcome

  • and medical staff burnout.

  • Together with our collaborators from Stanford School of Medicine

  • and partnering hospitals,

  • we're piloting smart sensors

  • that can detect clinicians going into patient rooms

  • without properly washing their hands.

  • Or keep track of surgical instruments.

  • Or alert care teams when a patient is at physical risk,

  • such as falling.

  • We consider these techniques a form of ambient intelligence,

  • like extra pairs of eyes that do make a difference.

  • But I would like more interactive help for our patients, clinicians

  • and caretakers, who desperately also need an extra pair of hands.

  • Imagine an autonomous robot transporting medical supplies

  • while caretakers focus on our patients

  • or augmented reality, guiding surgeons to do safer, faster

  • and less invasive operations.

  • Or imagine patients with severe paralysis controlling robots with their thoughts.

  • That's right, brainwaves, to perform everyday tasks

  • that you and I take for granted.

  • You're seeing a glimpse of that future in this pilot study from my lab recently.

  • In this video, the robotic arm is cooking a Japanese sukiyaki meal

  • controlled only by the brain electrical signal,

  • non-invasively collected through an EEG cap.

  • (Applause)

  • Thank you.

  • The emergence of vision half a billion years ago

  • turned a world of darkness upside down.

  • It set off the most profound evolutionary process:

  • the development of intelligence in the animal world.

  • AI's breathtaking progress in the last decade is just as astounding.

  • But I believe the full potential of this digital Cambrian explosion

  • won't be fully realized until we power our computers and robots

  • with spatial intelligence,

  • just like what nature did to all of us.

  • It's an exciting time to teach our digital companion

  • to learn to reason

  • and to interact with this beautiful 3D space we call home,

  • and also create many more new worlds that we can all explore.

  • To realize this future won't be easy.

  • It requires all of us to take thoughtful steps

  • and develop technologies that always put humans in the center.

  • But if we do this right,

  • the computers and robots powered by spatial intelligence

  • will not only be useful tools

  • but also trusted partners

  • to enhance and augment our productivity and humanity

  • while respecting our individual dignity

  • and lifting our collective prosperity.

  • What excites me the most in the future

  • is a future in which that AI grows more perceptive,

  • insightful and spatially aware,

  • and they join us on our quest

  • to always pursue a better way to make a better world.

  • Thank you.

  • (Applause)

