字幕表 動画を再生する
[MUSIC PLAYING]
THORSTEN KURTH: Hello, and thank you, everybody, for attending
the afternoon sessions.
My name is Thorsten Kurth.
And I'm an application performance specialist
at NERSC.
And my day-to-day work is helping
scientists optimize their codes for contemporary supercomputer
systems.
Today I'm going to talk about a project I care about
because it combines three different things I'm
excited about.
This is big computers, so exascale.
It's deep learning.
And it's climate change because it
will affect everybody, every one of us sooner or later.
So this is a team effort.
And I want to thank, at this point, everybody
in this collaborative effort between NERSC, Nvidia, UC
Berkeley, and Oak Ridge for making this a success.
So thank you at this point.
So I want to talk about our extreme weather phenomena.
So why are they important?
They're important because they can
incur a lot of damage and loss of life
and these kind of things.
For example, 2017, the damage to the US economy
was about 200 billion for the combined extreme weather
events.
So these can be hurricanes, or tropical cyclones,
and, for example, atmospheric rivers
because they can cause heavy flooding and major disruption.
So we won't understand these events better.
But what a typical climate data analysis is-- so for example,
you have these simulations, which
look into the future up to 100 years.
You run different models and get these.
So on your left, you see the output of the simulations.
And they basically contain 14 million observables
for a three-hour interval.
And then you have like 100 years worth of that.
And what people usually do when you look at the IPCC report,
for example, or in popular magazines,
they boil it down to a couple of numbers.
For example, temperature rise, sea level rise,
these kind of things.
However, if the temperature increases by one degrees
or two, that matters.
But it might not matter to you if you
live in the middle of the Sahara, right?
It might matter to you, though, if you
are in different regions of the globe-- and also the sea level
rise.
So the thing is now, what you want
to do is you want to have a geospatial analysis of climate
change.
So how does climate change impact your life
where you live?
So we want to answer things like,
will there be more hurricanes, for example?
And if yes, will they be more intense?
Will they make more landfalls?
If they stay over the sea, it's usually not
as bad as when they hit the coastline.
And for atmospheric rivers, for example,
50% of all rain in California is due to atmospheric rivers.
So it's an important question to ask
if we will get more water, like more rain, due to this.
And you, for example, think about forest fires,
like the campfire last year we had in the Bay Area.
We had a hard time breathing for two weeks.
It's really a question if you get more or fewer of these.
And this is really dependent on these atmospheric rivers,
for example.
So insurance industry-- for example, water planners--
a lot of different people need to know
what they need to get up for.
So how can we do this?
So we have this high fidelity climate simulations.
And what we, for example, can start with--
picking out the these events.
For example, hurricanes and atmospheric rivers.
Let's start with these.
And image segmentation techniques
can offer pixel-level resolution.
So they can do a per-pixel classification
to pick these events out and then correlate them
geospatially with the underlying region, for example.
And deep learning, as you know, is very successful in here
because, for example, the whole autonomous driving industry
is doing that day in, day out.
And there's a lot of research going on in this direction.
So the data set we have is of 20 terabytes.
So we have like 400 terabyte in storage.
But for this work, we use 20 terabytes of it.
And what I call an image here is more like a tensor.
It's a three-dimensional tensor of this 1152 times 768 times
16.
And the channels are not RGB.
They present observables like wind speed,
temperature, pressure, for different altitudes,
and these kind of things.
So they're general observables.
We have free classes.
So background, which is not nothing interesting going on.
Then the tropical cyclones, or hurricanes,
and the atmospheric rivers.
Fortunately, these events are still rare in the future.
So 95% of the pixels are background,
which is good for us.
But it's harder to train a model on that
because of this high imbalance.
And another thing which makes it different from the classical,
let's say, streets in segmentation
is that all the objects here are--
so first, there's a lot of stuff going on in the background.
It's not static or slow moving.
And also the objects themselves, they change rapidly
in size and shape, right?
So even when you look at this image, this satellite image
from the hurricane, even as an expert, you don't know actually
where you want to say, like where this hurricane starts
or ends, right?
So the labels are pretty fuzzy.
So talking about that, how did we get those?
Of course, the best would be using human annotated labels.
But for that data, we didn't have that at the time.
We are currently working on that, though.
So for this effort, we use some algorithmic labeling,
which is an old school approach in the sense
that it's basically based on future engineering
together with some thresholding to get the binary masks.
One can say, OK, why don't you do the predictions
with these algorithms, then?
Because you have a lot of shortcomings in this algorithm.
So they are regional dependent.
Even for different thresholds get vastly different labels.
So however, they're still good enough
to fit in a network with it.
And it can pick up better features,
as I will show you later.
So for image segmentation architecture,
we picked DeepLab version 3+ variant.
So it was developed by Google.
And basically, it has an-- as all
these segmentation network has an encoder, which
extracts the features.
And the decoder part, which then makes the predictions
and the skip connections in order
to feed the features at different levels
from the encoder stage into the decoder
to improve the prediction quality.
So the original DeepLab had a [INAUDIBLE] interpolation
as a decoder.
And we replaced this with a fully deconvolution decoder.
I think the original choice was made for training reasons
because it's easier to train the [INAUDIBLE] interpolater
because it doesn't have a lot of weights.
So our model has 44.7 million parameters.
And the training cost for a single step
on a single sample--
so forward, backward-- is 14.4 teraflop,
which is 14.4 times 10 to the 12 floating point operations.
And on a modern GPU, like this Nvidia V100,
you can only fit two batches in half precision
or one in single precision on the GPU.
So what you need to do is you need to train it in parallel.
And we took a purely data parallel approach here.
So we used Horovod for this.
So Horovod is basically a framework
which hooks into the TensorFlow graph in synchronous fashion.
And reduces tensors across all the workers
as they are ready to be reduced.
It does this using MPI.
So it provides MPI callback function.
MPI is Message Passing Interface.
It's a very common framework for exchanging messages
between different-- in a distributed memory
system such as HPC systems.
The good thing is that since a lot of people in HPC use it,
it's very highly optimized usually
for these supercomputers.
You're still, of course, responsible for sharding
your data set, distribute the data,
and all these kind of things.
So we ran on the Summit supercomputer system.
So this is the number one supercomputer in the world.
So there's this top 500 list, which is updated twice a year.
So this is the system at Oak Ridge National Laboratory.
It consists of 4,600 nodes.
It has two Power CPUs in them and six Nvidia V100
GPUs with Tensor Cores.
They are connected using his high speed NVLink interconnect,
which is very nice.
So we can do all reductions within the node
very efficiently.
And it also features 800 gigabyte
of nonvolatile memory per note, which is quite cool because you
can stage part of your data set into that
and read it almost with a DRAM speed.
So it's almost as fast as reading it from main memory,
but it's much bigger.
So the network is pretty fast and low latency.
And what I want to point out here,
though, is that we talk a lot about exascale computing, so
capability of 10 to the 18 floating point operations
per second in double precision.
So this is the next generation of systems
want to deploy or develop and deploy.
But really look at it.
If you can stick with half precision,
so if you can basically have an application which
can utilize half precision almost for most
of the computations, you have an exascale system available
right now.
So it's there.
It's in Oak Ridge.
You can just go and use it.
So there are some performance optimizations necessary,
of course.
So when you think about deep learning,
you have to optimize the whole pipeline, right?
Starting from like the data--
where do you read it from?
Where to stage it in?
Then how do you feed it efficiently to accelerator,
right?
The accelerator is so fast that you
need to feed them efficiently that they don't
stall waiting for that data.
For the computational part, you want
to minimize the data organization, for example.
And the reductions also need to be very efficient, right?
Because you want to reduce the gradients at a very, very
high frequency.
One thing we also use was some overlapping
or grading pipelining or asynchronous
approach you call it where you do not reduce the gradients--
they do not compute the fresh gradients
and produce them and then integrate them.
But instead, you come on the GPU.
You compute fresh gradients.
And then on the CPU, you read all to the gradients
from the last step from a buffer.
Reduce those asynchronously to the competition
of the new gradients.
And integrate them into the model.
So by that you can overlap these two steps very nicely.
So this is a plot for the performance we got.
So you see, the throughput metric of images per second,
or call it samples per second, versus the number of GPUs,
if you divide it by 6, you get the number of nodes.
And the other y-axis is basically
a translation of this image throughput metric
into a more HPC metric of petaflops
per second-- so 10 to the 15 operations per second.
So what you see is the FP32.
So the single precision points are blues.
So I don't want to talk about these.
What you can see that the FP16, so the half precision
performance much, much better, right?
So the Tensor Cores can, in theory,
deliver 125 teraflops per card.
And that is what you see is vast performance difference.
The dashed line represents the ideal case.
in the ideal case, where you don't
have any lost due to communication,
you would be basically on this line.
So we are a bit below with the solid red line but not
far things.
I think it's 70-something percent, 79%
scanning efficiency.
And also what you see that the lacked version--
so where you can basically overlap
the computation of the communication very nicely--
it's very crucial to do this here
because the GPUs are so fast that they really
need to wait for it or reduce otherwise.
So and after we saw this, we thought, OK, we
can go to a couple more nodes.
But we might not still hit the exaflop mark,
which is this 1,000 petaflops per second.
So we restructured the decoder a little bit,
and not like from the predictive power.
But we removed some additional data transpositions.
And we ran it on a couple of more nodes
and actually got there.
So the performance number we got at that scale
was 1.13 exaflops in FP16.
So half precision on 27,360 GPU.
And that is so far the biggest deep learning calculation
I'm aware of.
So this is the training loss.
This is on a slightly lower scale.
We don't have this full history for the big scale.
However, what you can see--
the case I to make here is that the select version,
although it's partially asynchronous,
but it's like predictable asynchronous in a way
that the network at the beginning is a bit unstable.
So basically the training [INAUDIBLE] grows.
So it oscillates heavily.
But then when you just wait long enough,
it will outperform the unlagged version.
So that, of course, is not true for
every arbitrary like deep learning network.
But for us, it's definitely true.
And I think it's definitely worth
a try if you have a problem like that.
So talking about the results, I have a video for this.
So on the left-hand side, you see the predicted weather
patterns by the model.
In the right-hand side, you see the ground truth.
So I have three things to say.
So first, there's some qualitative agreement and also
quantitative agreement, which is satisfactory.
What you also see is that there are more
predicted events than actually in the labels.
And that is mainly because the aggressive thresholding,
sometimes forgets to label stuff.
So when you maybe show some of these samples
where we overpredict atmospheric rivers, for example,
to experts, they say, yes.
Actually, the model picked up an atmospheric river which was not
present in the ground truth.
And then you can also see that the ground truth,
you see the video is flickering.
And this is because--
there's like a frame before and after where it, for example,
picked up an atmospheric river but a frame
in between where it did not.
But of course, it should be continuous.
It should not be like this.
So the model actually predicts something
which is much more continuous and much more smooth.
Even if it did not--
the temporal dependence into account.
So that is quite interesting.
So my conclusions are--
so TensorFlow is one of the first applications
which reached exascale performance, although only
in FP16.
But still it's remarkable.
And I think this is a community achievement.
And HPC systems are suitable for these workloads.
Of course, there are some insufficiencies--
for example, the file system.
So we needed this large, [INAUDIBLE] storage in order
to feed the data efficiently.
If you try to read from a distributed file system,
it's very bad because HPC file systems are optimized
for writing large chunks of data but not doing random reads, OK?
So if you want to design HPC system in the future, which
is very suitable for deep learning,
you need to take this into account.
So this is also a very important.
And also, we want to talk to storage people
to help us to develop better distributed storage which
can cope with these workflows better.
This work was awarded the ACM Gordon Bell
prize at the last supercomputing conference.
This price usually awarded for an interesting and challenging
science problem for which you need massive amounts of compute
to solve it.
And then you can show that you actually
use this massive amount of compute
efficiently to solve it.
So this is the paper link.
Thank you very much for your attention.
[MUSIC PLAYING]