字幕表 動画を再生する
WAHID BHIMJI: OK, so I'm Wahid.
I'm not actually part of Google.
I'm at Lawrence Berkeley National Lab.
And I'm sort of going to tell you
how we're using TensorFlow to perform
deep learning for the fundamental sciences
and also using high-performance computing.
OK, so the fundamental sciences--
particle physics, cosmology, and lots
of other things I'll explain in a minute--
make heavy use of high-performance computing
at the center I work at, which is
part of the Department of Energy,
initially for simulation and data analysis,
or traditionally.
But progress in deep learning and tools like TensorFlow
have really enabled the use of higher-dimensional data
and through deep learning enable the possibility
of new discoveries, faster computation, and actually
whole new approaches.
And I'm going to talk about that here,
illustrating with a few examples of stuff
we're running at NERSC.
So what is NERSC?
It's the high-performance computing center
for the Department of Energy Office for Science,
which means we support the whole breadth of science
that the DoE does, which actually includes things like
not just cosmology or what you might think of as energy
research, like batteries and so forth, but also
materials and climate and genomics and things like this.
So we have a huge range of users and a vast range
of projects across a whole variety of science.
And we have big machines.
Cori, our latest machine was number five, and, in fact,
the highest by P flops in the US when it was installed
a couple of years ago.
But things dropped down those numbers,
and now it's number 10 in the top 500.
OK, so we see the use of AI now across the whole science
domain.
So you probably can't see this slide very well,
but this is a take on an industry part of machine
learning, sort of splitting it into supervised learning
and unsupervised learning, and classification and regression,
and so forth.
And there's kind of examples here
that we see across the sciences.
But I'll be mostly talking about particle physics and cosmology
because actually that's my background.
So I'm more comfortable with those examples.
So what we're trying to do in these fields
is really uncover the secrets of the universe.
So this is a sort of evolution from the Big
Bang to the present day.
And there's planets and galaxies and stuff.
And they've all been influenced by, for example,
dark matter and dark energy over this evolution.
So obviously our understanding of this
has come a long way in recent years.
But there's still plenty of mysteries and things
that we don't know about, like the very things I just
mentioned, like dark matter.
What is the nature of dark matter?
And what is the relationship between particle physics that
explains extremely well the very small, and even
everything around us, but yet, breaks down
at cosmological scales?
So in order to answer those kind of questions,
we have huge complex instruments,
such as the planned LSST telescope on the left,
which is going to look at the very big in resolutions
that currently I'm presenting.
And the ATLAS Detector at the LHC
on the right, which is at the Large Hadron Collider,
so at the Swiss, French border, it's
a detector the size of a building.
There's little people there.
You can sort of see them.
And it has hundreds of millions of channels of electronics
to record collisions that occur in the middle every 25
nanoseconds, so a huge stream of data.
So both of these experiments have vast streams of data.
Really the ATLAS experiment has processed exabytes
of data over its time.
And this has to be filtered through a process of data
analysis and so forth.
And so if you get like high-resolution images or very
detailed detector outputs, the first stage
is to kind of simplify these, maybe
build catalogs of objects in the sky,
such as stars and galaxies, or in the case of particle physics
to kind of combine these into what particles might have been
produced, and these lines or tracks and deposits that
have occurred in the detector.
So this obviously also involves a large amount of computing.
So computing fits in here.
But computing also fits in because the way
that these analyses are done is to compare with simulated data,
in this case cosmology simulations that
are big HPC simulations done for different types of universes
that might have existed depending
on different cosmology parameters.
And in the particle physics case,
you do extremely detailed simulations
because of the precision you require
in terms of how the detector would
have reacted to, for example, a particle coming in here.
So here you've got all this kind of showering
of what would have happened inside the detector.
And then from each of these, you might
produce summary statistics and compare one to the other
and what are secrets of the universe,
I guess, such as the nature of dark matter,
or new particles at the LHC.
OK, so you might have guessed there's
many areas where deep learning can help with this.
So one is classification to, for example,
find those physics objects that I showed identified
in collisions at the LHC.
Or, indeed, just to directly find
from the raw data what was interesting and what
was not interesting.
Another way you might use it is regression to, for example,
find what kind of energies were deposited inside the detector
or what were the physics parameters that
were responsible for stuff that you saw in the images
at the telescopes.
Another is sort of clustering feature detection
in a more unsupervised way, where
you might want to look for anomalies in the data,
either because these are signs of new physics
or because they're actually problems with the instruments.
And then another last way, perhaps,
is to generate data to replace the full physics simulations
that I just described, which are, as I mentioned,
extremely computationally expensive.
So I'll give a few examples across these domains, OK?
So the first is classification.
So there you're trying to answer the question,
is this new physics?
For example, supersymmetry, which
is also a dark matter candidate.
So I could give you the answer later
whether that's actually new physics or not.
So here the idea is kind of be--
there's several papers now that are exploiting
this kind of idea is to take the detector,
like the ATLAS detector, and sort of unroll--
so it's a cylindrical detector with many cylindrical layers,
and to take that and unroll it into an image where
you have phi along here, which is
the angle around this direction, and eta this way, which
is a physicist's way of describing
the forwardness of the deposit.
And then just simply put it in an image,
and then you get the ability to use
all of the kind of developments in image recognition,
such as convolutional neural networks.
So here's what is now a relatively simple sort
of convolutional network, several convolution and pooling
layers, and then fully connected layers.
And we've exploited this at scales,
either taking 64 by 64 images to represent
this, or at the 224 by 224, sort of closer
to the resolution of the detector.
And then you can also have multiple channels
which correspond to different layers
of this cylindrical detector.
And we saw that it kind of works.
So here you have a rock curve of the two class probability,
so false positive rate.
So the first thing to notice is that you
need a very, very high rejection of the physics
that you already know about because that vastly dominates.
So actually, the data's even been prefiltered before this.
So you need a low false positive rate.
And generally, of course, in these rock curves,
higher this way is better.
So this point represents the physics selections
that were traditionally used in this analysis.
These curves here, like incorporating those higher
level physics variables.
But in shallow neural net-- in shallow machine
learning approaches.
And then the blue line shows that you
can gain quite a lot, and this from using
convolution neural networks, which gives you access.
It's not just the technique of deep learning,
but also being able to use all the data that's available.
And then, similarly, there's another boost
from using three channels, which corresponds
to the other detectors.
OK, so I'd just like to caveat a few of these results.
So even though there's obvious gains to be had here,
these analyses are still not currently used
in the Large Hadron Collider analyses.
And part of the reason for that is
because this is trained on simulation, whereas--
and it might pick up on very small aspects of the simulation
that differ from the real data.
So there's several work that's carrying on
from this to look at how to incorporate
real data into what's done.
But also, I think methodological developments
in how to interrogate these models,
and really discover what they're learning
would be kind of useful for the field.
OK, so taking a regression problem now,
the kind of question you might ask
is, what possible universe would look like this?
So this is an image from the Sloan Digital Sky Survey.
So here the Earth's in the middle,
and increasing redshift takes you out this way.
So this is like the older universe over here.
And you can kind of see.
And the histogram is a structure of galaxies' density.
So there's like more-- you can see
that structure sort of appears as the universe evolves.
And that kind of evolution of structure
tells you something about the cosmology parameters
that were involved.
So can you actually regress those parameters
from looking at these kind of distributions?
So the idea here was to take a method that
was developed by these people in CMU, which is to use, again,
a convolutional neural network.
But here, running on a 3D--
this is actually simulated data, so 3D distribution
of dark matter in the universe.
And these can be large datasets.
So part of the work we did here was to kind of scale this up
and run on extremely large data and across the kind of machines
we have at Cori on 8,000 CPU nodes,
which is kind of some of the largest scale
that TensorFlow has been run on in a data parallel fashion,
and then being able to predict cosmology parameters that
went into this simulation in a matter of minutes because
of the scale of computation used.
So, again, it's a fairly standard network,
but this time in 3D.
And actually, there was quite a bit
of work to get this to work well on CPU, for example.
Because that's-- sorry, I should have mentioned, actually,
that the machine that we have is primarily composed of Intel
Knights Landing CPU.
So that's another area that we help develop.
OK, so again, it kind of works.
And so here they like to plot the true value
of the parameter that's being regressed
against the predicted value.
So you hope that it would match up along the line.
And, indeed, the points which come
from a run of this regression do actually
lie across this line in all the cases.
Now, the actual crosses come from the larger-scale run.
So the points come from a 2,000 node run of the network,
and the crosses come from an 8,000 node run.
So the sort of caveat here is that the 8,000 node run doesn't
actually do as well as the 2,000 node run, which kind of
indicates a point, which is another thing we're working on
about convergence of these neural networks
when running at a large distributed scale.
So I'll come back to that a bit later as well.
OK, so then the last example I have here is around generation.
So, as I mentioned, simulations went
into that previous analysis, and they
go into all of the kind of analyses
that are done at these--
for these kind of purposes.
But in order to get some of these simulations,
it actually takes two weeks of computational time
on a large supercomputer of the scale of Cori,
which is kind of a huge amount of resource
to be giving over to these.
And for different versions of the cosmology
that you want to generate, you need
to do more of these simulations.
And part of the output that you might get from these
is, this is a 2D mass map, which corresponds
to the actual galaxy distribution
that you might observe from data.
So this is what you would want to compare with data.
So the question is, is it possible to augment your data
set and generate different cosmologies
in a kind of fast simulation that doesn't require running
this full infrastructure?
And the way that we tried to do that
was to use generative adversarial networks that
were mentioned briefly in the last talk, where you have
two networks that work in tandem with each other
and that are optimized together, but against each other,
and one is trying to tell the difference between.
So the advantage here is that we have some real examples
from the full simulation to compare with.
And the discriminator can take those and say,
does the generator do any good job of producing fake maps?
And we used a pretty standard DCGAN architecture.
And there was a few modifications in this paper
to make that work.
But at the time last year there weren't many people
trying to apply this.
One of the other applications that was being done at the time
is actually to the particle physics
problem from other people at Berkeley Lab.
And, again, it works.
And so the top plot is a validation set
of images that weren't used in the training.
And the bottom is the generated images.
I mean, the first time cosmologists
saw this, they were kind of pretty well surprised
that they couldn't tell the difference between them,
because they weren't really expecting
us to be able to do so well that they wouldn't
be able to tell by eye.
But you certainly can't tell by eye.
But one of the advantages of working with this in science,
as opposed to celebrity faces, is
that we do actually have good metrics for determining
whether we've done well enough.
So the top right is the power spectrum,
which is something often used by cosmologists,
but it's just the Fourier transform
of a two-point correlation.
So it sort of represents the Gaussian fluctuations
in the plot.
And here you can see the black is the validation,
and the pink is the GAN.
And so it not only agrees on the mean, this kind of middle line,
but it also captures the distribution of this variable.
And it's not just two-point correlations.
But the plot on the right shows something
called the Minkowski functional, which
is a form with a three-point correlation.
So even non-Gaussian structures in these maps are reproduced.
And the important point, I guess,
is that you could just sample from these distributions
and reproduce those well.
But all this was trained on was to reproduce the images.
And it got these kind of structures, the physics that's
important, right?
OK, so this is very promising.
But obviously, the holy grail of this
is really to be able to do this for different values
of the initial cosmology, sort of parameterized generation,
and that's what we're working towards now.
OK, so I mentioned throughout that we
have these big computers.
So another part of what we're trying to do
is use extreme computing scales in a data parallel way
to train these things faster, so not just
different hyperparameters, which, of course,
people also do, is to train different hyperparameters
on different nodes in our computer,
but just to train also one model quicker.
And we've done that for all of these examples
here so that the one on the left is the LHC
CNN, which I described first.
So I should say also that we have a large Cray
machine that has an optimized high-performance network.
But that network has been particularly optimized
for HPC simulations and stuff and MPI-based architectures.
So really it's been a huge gain for this kind of work
that these plug-ins now exist that allow
you to do MPI-distributed training, such as Horovod
from Uber.
And also Cray have developed a machine-learning plug-in here.
So we used that for this problem.
And you can basically see that all the lines kind of
follow the ideal scaling up to thousands of nodes.
And so you can really process more data faster.
So there's quite a bit of engineering work
that goes into this, of course.
And this is shown in the papers here.
For the LHC CNN, there's still a gap between the ideal.
And that's partly because it's not
such a computationally intensive model.
So I/O and stuff becomes more important.
For the CosmoGAN on the right, that's the GAN example.
And there it is a bit more computationally intensive.
So it does follow the scaling curve here.
And CosmoFlow I just put in the middle here.
I mean, here we show that I/O was important.
But we were able to exploit something we have,
which is this burst buffer, which is a layer of SSDs that
sits on the high-speed network.
And we were much more able to scale up
to the scale of the full machine, 8,000 nodes using
that than the shared disc space file system.
OK, so another area that I just wanted to briefly mention
at the end is that we have these fancy supercomputers,
but as I mentioned, this takes a lot of engineering.
And it was projects that we work particularly with.
Something that we really want to do
is allow people to use the supercomputer scale of kind
of deep learning via Jupyter notebooks,
which is really how scientists prefer to interact
with our machines these ways.
So we do provide JupyterHub at NERSC.
But generally what people are running
sits on a dedicated machine outside the main compute
infrastructure.
But what we did here was to enable them to run stuff
actually on the supercomputing machine, either
using ipyparallel or Dask, which are tools used for distributed
computing, but interfacing with this via the notebook,
and where a lot of the heavy computation
actually occurs with the MPI backend in Horovod and stuff.
So we were able to show that you can scale to large scales
here without adding any extra overhead from being
able to interact.
And then, of course, you can add nice Jupyter things,
like widgets and buttons, so that you
can run different hyperparameter trials and sort of click
on them and display them in the pane there.
OK, so I just have my conclusion now.
Basically deep learning, particularly
in combination with high-performance computing,
but productive software like TensorFlow,
can really accelerate science.
And we've seen that in various examples.
I only mentioned a few here.
But it requires developments, not only in methods.
So we have other projects where we're
working with machine-learning researchers
to develop new methods.
Also, there's different ways of applying these, of course.
And we're well-placed to do that.
And also well-placed to do some of the computing work.
But it can really benefit from collaboration, I think,
between scientists and the industry, which is sort
of better represented here.
And we've had a good relationship with Google there.
But I think we can also do that with others.
So basically my last slide is a call to help.
If you have any questions, but also
if you have ideas or collaborations
or want to work with us on these problems,
that would be good to hear about.
Thanks.
[APPLAUSE]
AUDIENCE: [INAUDIBLE]
WAHID BHIMJI: Yeah.
Well, what was I going to say about supersymmetry?
AUDIENCE: [INAUDIBLE]
[LAUGHS]
WAHID BHIMJI: Well, yeah, I don't think it's real.
But no, I guess I was going to say whether or not--
so this example isn't supersymmetry.
I can tell you that.
But yeah, part of the point of this network, I guess,
is it vastly improves our sensitivity to supersymmetry.
So hopefully it could help answer the question
if it's real or not.
But yeah, certainly there's no current evidence
at the Large Hadron Collider, so that's why we keep looking.
And also I would say that some of these approaches
also might help you look in different ways, for example,
not being so sensitive to the model
that theorists have come up with.
So this approach really trains on simulated samples
of a particular model that you're looking out for.
But some of the ways that we're trying to extend this
is to be more of a sort of anomaly detection,
looking for things that you might not have expected.
AUDIENCE: [INAUDIBLE]
WAHID BHIMJI: Yeah, so the thing with these big collaborations
that work at the LHC is they're very sensitive about you
working on the real data.
So, for example, someone who's coming from outside
can't just apply this model on the real data.
You have to work within the collaboration.
So the reason why they don't use things
like this in the collaboration at the moment
is partly just because of the kind of rigor
that goes into cross-validating and checking
that all these models are correct.
And that just takes a bit of time for it to percolate.
There's a little bit of a skepticism
amongst certain people about new ideas, I guess.
And so I think the fact that this has now
been demonstrated in several studies sort of
adds a bit of mitigation to that.
But then the other reason, I think,
is that there's another practical reason, which
is that this was exploiting kind of the full raw data.
And apart from-- there are sort of practical reasons
why you wouldn't want to do that,
because these are large hundreds of petabyte datasets.
And so filtering, in terms of these high-level physics
variables, is not only done because they
think they know the physics, but also
because of practicality of storing the data.
So those are a bunch of reasons.
But then I gave another sort of technical reason, which
is that you might be more sensitive to this,
to mismodeling in the training data.
And they really care about systematic uncertainties
and so forth.
And those are difficult to model when
you don't know what it is you're trying to model for, I guess,
because you don't know what the network necessarily picked up
on.
And so there's various ways to mitigate
some of these technical challenges,
including sort of mixing of data with simulated data,
and also using different samples that don't necessarily
have all the same modeling effects, and things like that.
So there's a part that's cultural,
but there's also a part that's technical.
And so I think we can try and address the technical issues.
AUDIENCE: [INAUDIBLE]
WAHID BHIMJI: Not directly this model.
But yeah, there is certainly a huge range of work.
So we mostly work with the Department of Energy science.
And there's projects across all of here.
So, I mean, we can only work in depth,
I guess, with a few projects.
So we have a sort of few projects
where we work in depth.
But really part of what our group does at NERSC
is also to make these tools like TensorFlow working
at scale on the computer available to the whole science
community.
And so we're having more and more training events,
and the community are picking up--
different communities are picking this up more
by themselves.
And that really allows--
so there certainly will be groups
working with other parts of physics, I think.
AUDIENCE: [INAUDIBLE]
WAHID BHIMJI: Why is deep learning, right?
I mean, it's not really the only approach.
So like this actual diagram doesn't just
refer to deep learning.
So some of these projects around the edge
here are not using deep learning.
My talk was just about a few deep learning examples.
So it's not necessarily the be all and end all.
I think one of the advantages is that there
is a huge amount of development work in methods in things
like convolution neural networks and stuff that we can exploit.
Relatively off the shelf, they're
already available in tools like TensorFlow and stuff.
So I mean, I think that's one advantage.
The other advantage is there is quite a large amount of data.
There are nonlinear features and stuff
that can be captured by these more detailed models
and so forth.
So I don't know.
It's not necessarily the only approach.
AUDIENCE: [INAUDIBLE]
WAHID BHIMJI: Yeah, so, I mean, there certainly
are cases where the current models are being well tuned,
and they are right.
But there's many cases where you can get an advantage.
I mean, you might think that the physics sections here, they're
well tuned over certain years.
They know what the variables are to look for.
But there is a performance gain.
And it can just be from the fact that you're
exploiting more information in the event
that you were otherwise throwing away, because you thought
you knew everything that was important about those events,
but actually there was stuff that
leaked outside, if you like, the data that you were looking at.
So there are gains to be had I think, yeah.