字幕表 動画を再生する
The following content is provided under a Creative
Commons license.
Your support will help MIT OpenCourseWare
continue to offer high-quality educational resources for free.
To make a donation or to view additional materials
from hundreds of MIT courses, visit MIT OpenCourseWare
at ocw.mit.edu.
JULIAN SHUN: Today, we're going to talk
about multicore programming.
And as I was just informed by Charles, it's 2018.
I had 2017 on the slide.
So first, congratulations to all of you.
You turned in the first project's data.
Here's a plot showing the tiers that different groups reached
for the beta.
And this is in sorted order.
And we set the beta cutoff to be tier 45.
The final cutoff is tier 48.
So the final cutoff we did set a little bit aggressively,
but keep in mind that you don't necessarily
have to get to the final cutoff in order
to get an A on this project.
So we're going to talk about multicore processing today.
That's going to be the topic of the next project
after you finish the first project.
So in a multicore processor, we have a whole bunch
of cores that are all placed on the same chip,
and they have access to shared memory.
They usually also have some sort of private cache, and then
a shared last level cache, so L3, in this case.
And then they all have access the same memory controller,
which goes out to main memory.
And then they also have access to I/O.
But for a very long time, chips only had a single core on them.
So why do we have multicore processors nowadays?
Why did semiconductor vendors start
producing chips that had multiple processor
cores on them?
So the answer is because of two things.
So first, there's Moore's Law, which
says that we get more transistors every year.
So the number of transistors that you can fit on a chip
doubles approximately every two years.
And secondly, there's the end of scaling of clock frequency.
So for a very long time, we could just
keep increasing the frequency of the single core on the chip.
But at around 2004 to 2005, that was no longer the case.
We couldn't scale the clock frequency anymore.
So here's a plot showing both the number of transistors
you could fit on the chip over time,
as well as the clock frequency of the processors over time.
And notice that the y-axis is in log scale here.
And the blue line is basically Moore's Law,
which says that the number of transistors
you can fit on a chip doubles approximately every two years.
And that's been growing pretty steadily.
So this plot goes up to 2010, but in fact, it's
been growing even up until the present.
And it will continue to grow for a couple
more years before Moore's Law ends.
However, if you look at the clock frequency line,
you see that it was growing quite
steadily until about the early 2000s, and then at that point,
it flattened out.
So at that point, we couldn't increase the clock frequencies
anymore, and the clock speed was bounded
at about four gigahertz.
So nowadays, if you go buy a processor,
it's usually still bounded by around 4 gigahertz.
It's usually a little bit less than 4 gigahertz,
because it doesn't really make sense to push it all the way.
But you might find some processors
that are around 4 gigahertz nowadays.
So what happened at around 2004 to 2005?
Does anyone know?
So Moore's Law basically says that we
can fit more transistors on a chip
because the transistors become smaller.
And when the transistors become smaller,
you can reduce the voltage that's
needed to operate the transistors.
And as a result, you can increase the clock frequency
while maintaining the same power density.
And that's what manufacturers did until about 2004 to 2005.
They just kept increasing the clock frequency
to take advantage of Moore's law.
But it turns out that once transistors become
small enough, and the voltage used
to operate them becomes small enough,
there's something called leakage current.
So there's current that leaks, and we're
unable to keep reducing the voltage while still having
reliable switching.
And if you can't reduce the voltage anymore,
then you can't increase the clock frequency
if you want to keep the same power density.
So here's a plot from Intel back in 2004
when they first started producing multicore processors.
And this is plotting the power density versus time.
And again, the y-axis is in log scale here.
So the green data points are actual data points,
and the orange ones are projected.
And they projected what the power density
would be if we kept increasing the clock
frequency at a trend of about 25% to 30% per year,
which is what happened up until around 2004.
And because we couldn't reduce the voltage anymore,
the power density will go up.
And you can see that eventually, it
reaches the power density of a nuclear reactor, which
is pretty hot.
And then it reaches the power density of a rocket nozzle,
and eventually you get to the power
density of the sun's surface.
So if you have a chip that has a power density
equal to the sun's surface--
well, you don't actually really have a chip anymore.
So basically if you get into this orange region,
you basically have a fire, and you can't really
do anything interesting, in terms of performance
engineering, at that point.
So to solve this problem, semiconductor vendors
didn't increased the clock frequency anymore,
but we still had Moore's Law giving us
more and more transistors every year.
So what they decided to do with these extra transistors
was to put them into multiple cores,
and then put multiple cores on the same chip.
So we can see that, starting at around 2004,
the number of cores per chip becomes more than one.
And each generation of Moore's Law
will potentially double the number of cores
that you can fit on a chip, because it's doubling
the number of transistors.
And we've seen this trend up until about today.
And again, it's going to continue for a couple
more years before Moore's Law ends.
So that's why we have chips with multiple cores today.
So today, we're going to look at multicore processing.
So I first want to introduce the abstract multicore
architecture.
So this is a very simplified version,
but I can fit it on this slide, and it's a good example
for illustration.
So here, we have a whole bunch of processors.
They each have a cache, so that's
indicated with the dollar sign.
And usually they have a private cache as well as
a shared cache, so a shared last level cache, like the L3 cache.
And then they're all connected to the network.
And then, through the network, they
can connect to the main memory.
They can all access the same shared memory.
And then usually there's a separate network for the I/O
as well, even though I've drawn them as a single network here,
so they can access the I/O interface.
And potentially, the network will also
connect to other multiprocessors on the same system.
And this abstract multicore architecture
is known as a chip multiprocessor, or CMP.
So that's the architecture that we'll be looking at today.
So here's an outline of today's lecture.
So first, I'm going to go over some hardware challenges
with shared memory multicore machines.
So we're going to look at the cache coherence protocol.
And then after looking at hardware,
we're going to look at some software solutions
to write parallel programs on these multicore machines
to take advantage of the extra cores.
And we're going to look at several concurrency
platforms listed here.
We're going to look at Pthreads.
This is basically a low-level API
for accessing, or for running your code in parallel.
And if you program on Microsoft products,
the Win API threads is pretty similar.
Then there's Intel Threading Building Blocks,
which is a library solution to concurrency.
And then there are two linguistic solutions
that we'll be looking at--
OpenMP and Cilk Plus.
And Cilk Plus is actually the concurrency platform
that we'll be using for most of this class.
So let's look at how caches work.
So let's say that we have a value in memory
at some location, and that value is--
let's say that value is x equals 3.
If one processor says, we want to load
x, what happens is that processor reads
this value from a main memory, brings it into its own cache,
and then it also reads the value, loads it
into one of its registers.
And it keeps this value in cache so
that if it wants to access this value again in the near future,
it doesn't have to go all the way out to main memory.
It can just look at the value in its cache.
Now, what happens if another processor wants to load x?
Well, it just does the same thing.
It reads the value from main memory,
brings it into its cache, and then also loads it
into one of the registers.
And then same thing with another processor.
It turns out that you don't actually
always have to go out to main memory to get the value.
If the value resides in one of the other processor's caches,
you can also get the value through the other processor's
cache.
And sometimes that's cheaper than going all the way out
to main memory.
So the second processor now loads x again.
And it's in cache, so it doesn't have
to go to main memory or anybody else's cache.
So what happens now if we want to store
x, if we want to set the value of x to something else?
So let's say this processor wants to set x equal to 5.
So it's going to write x equals 5
and store that result in its own cache.
So that's all well and good.
Now what happens when the first processor wants to load x?
Well, it seems that the value of x is in its own cache,
so it's just going to read the value of x there,
and it gets a value of 3.
So what's the problem there?
Yes?
AUDIENCE: The path is stale.
JULIAN SHUN: Yeah.
So the problem is that the value of x in the first processor's
cache is stale, because another processor updated it.
So now this value of x in the first processor's cache
is invalid.
So that's the problem.
And one of the main challenges of multicore hardware
is to try to solve this problem of cache coherence--
making sure that the values in different processors' caches
are consistent across updates.
So one basic protocol for solving this problem
is known as the MSI protocol.
And in this protocol, each cache line is labeled with a state.
So there are three possible states--
M, S, and I. And this is done on the granularity of cache lines.
Because it turns out that storing this information
is relatively expensive, so you don't
want to store it for every memory location.
So they do it on a per cache line basis.
Does anyone know what the size of a cache line
is, on the machines that we're using?
Yeah?
AUDIENCE: 64 bytes.
JULIAN SHUN: Yeah, so it's 64 bytes.
And that's typically what you see today on most Intel and AMD
machines.
There's some architectures that have different cache lines,
like 128 bytes.
But for our class, the machines that we're using
will have 64 byte cache lines.
It's important to remember that so that when you're doing
back-of-the-envelope calculations,
you can get accurate estimates.
So the three states in the MSI protocol are M, S, and I.
So M stands for modified.
And when a cache block is in the modified state,
that means no other caches can contain this block
in the M or the S states.
The S state means that the block is shared,
so other caches can also have this block in shared state.
And then finally, I mean the cache block is invalid.
So that's essentially the same as the cache block
not being in the cache.
And to solve the problem of cache coherency, when
one cache modifies a location, it
has to inform all the other caches
that their values are now stale, because this cache modified
the value.
So it's going to invalidate all of the other copies
of that cache line in other caches
by changing their state from S to I.
So let's see how this works.
So let's say that the second processor wants to store y
equals 5.
So previously, a value of y was 17, and it was in shared state.
The cache line containing y equals 17 was in shared state.
So now, when I do y equals 5, I'm
going to set the second processor's cache--
that cache line-- to modified state.
And then I'm going to invalidate the cache
line in all of the other caches that contain that cache line.
So now the first cache and the fourth cache
each have a state of I for y equals 17,
because that value is stale.
Is there any questions?
Yes?
AUDIENCE: If we already have to tell the other things to switch
to invalid, why not just tell them the value of y?
JULIAN SHUN: Yeah, so there are actually
some protocols that do that.
So this is just the most basic protocol.
So this protocol doesn't do it.
But there are some that are used in practice
that actually do do that.
So it's a good point.
But I just want to present the most basic protocol for now.
Sorry.
And then, when you load a value, you
can first check whether your cache line is in M or S state.
And if it is an M or S state, then you
can just read that value directly.
But if it's in the I state, or if it's not there,
then you have to fetch that block
from either another processor's cache
or fetch it from main memory.
So it turns out that there are many other protocols out there.
There's something known as MESI, the messy protocol.
There's also MOESI and many other different protocols.
And some of them are proprietary.
And they all do different things.
And it turns out that all of these protocols
are quite complicated, and it's very hard
to get these protocols right.
And in fact, one of the most earliest successes
of formal verification was improving some of these cache
[INAUDIBLE] protocols to be correct.
Yes, question?
AUDIENCE: What happens if two processors try to modify
one value at the same time
JULIAN SHUN: Yeah, so if two processors
try to modify the value, one of them has to happen first.
So the hardware is going to take care of that.
So the first one that actually modifies
it will invalidate all the other copies,
and then the second one that modifies the value
will again invalidate all of the other copies.
And when you do that--
when a lot of processors try to modify the same value,
you get something known as an invalidation storm.
So you have a bunch of invalidation messages
going throughout the hardware.
And that can lead to a big performance bottleneck.
Because each processor, when it modifies its value,
it has to inform all the other processors.
And if all the processors are modifying the same value,
you get this sort of quadratic behavior.
The hardware is still going to guarantee
that one of their processors is going to end up
writing the value there.
But you should be aware of this performance issue
when you're writing parallel code.
Yes?
AUDIENCE: So all of this protocol stuff
happens in hardware?
JULIAN SHUN: Yes, so this is all implemented in hardware.
So if you take a computer architecture class,
you'll learn much more about these protocols and all
of their variants.
So for our purposes, we don't actually
need to understand all the details of the hardware.
We just need to understand what it's doing at a high level
so we can understand when we have a performance bottleneck
and why we have a performance bottleneck.
So that's why I'm just introducing the most
basic protocol here.
Any other questions?
So I talked a little bit about the shared memory hardware.
Let's now look at some concurrency platforms.
So these are the four platforms that we'll be looking at today.
So first, what is a concurrency platform?
Well, writing parallel programs is very difficult.
It's very hard to get these programs to be correct.
And if you want to optimize their performance,
it becomes even harder.
So it's very painful and error-prone.
And a concurrency platform abstracts processor
cores and handles synchronization
and communication protocols.
And it also performs load balancing for you.
So it makes your lives much easier.
And so today we're going to talk about some
of these different concurrency platforms.
So to illustrate these concurrency platforms,
I'm going to do the Fibonacci numbers example.
So does anybody not know what Fibonacci is?
So good.
Everybody knows what Fibonacci is.
So it's a sequence where each number is the sum
of the previous two numbers.
And the recurrence is shown in this brown box here.
The sequence is named after Leonardo di Pisa, who was also
known as Fibonacci, which is a contraction of Bonacci,
son of Bonaccio.
So that's where the name Fibonacci came from.
And in Fibonacce's 1202 book, Liber Abaci,
he introduced the sequence--
the Fibonacci sequence-- to Western mathematics,
although it had been previously known
to Indian mathematicians for several centuries.
But this is what we call the sequence nowadays--
Fibonacci numbers.
So here's a Fibonacci program.
Has anyone seen this algorithm before?
A couple of people.
Probably more, but people didn't raise their hands.
So it's a recursive program.
So it basically implements the recurrence
from the previous slide.
So if n is less than 2, we just return n.
Otherwise, we compute fib of n minus 1,
store that value in x, fib of n minus 2, store that value in y,
and then return the sum of x and y.
So I do want to make a disclaimer
to the algorithms police that this is actually
a very bad algorithm.
So this algorithm takes exponential time,
and there's actually much better ways
to compute the end Fibonacci number.
There's a linear time algorithm, which
just computes the Fibonacci numbers from bottom up.
This algorithm here is actually redoing a lot of the work,
because it's computing Fibonacci numbers multiple times.
Whereas if you do a linear scan from the smallest numbers up,
you only have to compute each one once.
And there's actually an even better algorithm
that takes logarithmic time, and it's
based on squaring matrices.
So has anyone seen that algorithm before?
So a couple of people.
So if you're interested in learning more
about this algorithm, I encourage
you to look at your favorite textbook, Introduction
to Algorithms by Cormen, Leiserson, Rivest, and Stein.
So even though this--
[LAUGHTER]
Yes.
So even though this is a pretty bad algorithm,
it's still a good educational example,
because I can fit it on one slide
and illustrate all the concepts of parallelism
that we want to cover today.
So here's the execution tree for fib of 4.
So we see that fib of 4 is going to call fib of 3 and fib of 2.
Fib of 3 is going to call fib of 2, fib of 1, and so on.
And you can see that repeated computations here.
So fib of 2 is being computed twice, and so on.
And if you have a much larger tree--
say you ran this on fib of 40-- then
you'll have many more overlapping computations.
It turns out that the two recursive calls can actually
be parallelized, because they're completely independent
calculations.
So the key idea for parallelization
is to simultaneously execute the two recursive sub-calls to fib.
And in fact, you can do this recursively.
So the two sub-calls to fib of 3 can also
be executed in parallel, and the two sub-calls of fib of 2
can also be executed in parallel, and so on.
So you have all of these calls that
can be executed in parallel.
So that's the key idea for extracting parallelism
from this algorithm.
So let's now look at how we can use
Pthreads to implement this simple Fibonacci algorithm.
So Pthreads is a standard API for threading,
and it's supported on all Unix-based machines.
And if you're programming using Microsoft products,
then the equivalent is Win API threads.
And Pthreads is actually standard in ANSI and IEEE,
so there's this number here that specifies the standard.
But nowadays, we just call it Pthreads.
And it's basically a do-it-yourself concurrency
platform.
So it's like the assembly language
of parallel programming.
It's built as a library of functions
with special non-C semantics.
Because if you're just writing code in C,
you can't really say which parts of the code
should be executed in parallel.
So Pthreads provides you a library
of functions that allow you to specify concurrency
in your program.
And each thread implements an abstraction of a processor,
and these threads are then multiplexed
onto the actual machine resources.
So the number of threads that you create
doesn't necessarily have to match the number of processors
you have on your machine.
So if you have more threads than the number of processors
you have, then they'll just be multiplexing.
So you can actually run a Pthreads program
on a single core even though you have multiple threads
in the program.
They would just be time-sharing.
All the threads communicate through shared memory,
so they all have access to the same view of the memory.
And the library functions that Pthreads provides mask
the protocols involved in interthread coordination,
so you don't have to do it yourself.
Because it turns out that this is quite difficult to
do correctly by hand.
So now I want to look at the key Pthread functions.
So the first Pthread is pthread_create.
And this takes four arguments.
So the first argument is this pthread_t type.
This is basically going to store an identifier
for the new thread that pthread_create
will create so that we can use that thread
in our computations.
pthread_attr_t-- this set some thread attributes,
and for our purposes, we can just set it to null and use
the default attributes.
The third argument is this function
that's going to be executed after we create the thread.
So we're going to need to define this function that we
want the thread to execute.
And then finally, we have this void *arg argument,
which stores the arguments that are going to be passed
to the function that we're going to be executing.
And then pthread_create also returns an error status,
returns an integer specifying whether the thread creation
was successful or not.
And then there's another function called pthread_join.
pthread_join basically says that we
want to block at this part of our code
until this specified thread finishes.
So it takes as argument pthread_t.
So this thread identifier, and these thread identifiers,
were created when we called pthread_create.
It also has a second argument, status,
which is going to store the status
of the terminating thread.
And then pthread_join also returns an error status.
So essentially what this does is it
says to wait until this thread finishes before we continue on
in our program.
So any questions so far?
So here's what the implementation of Fibonacci
looks like using Pthreads.
So on the left, we see the original program that we had,
the fib function there.
That's just the sequential code.
And then we have all this other stuff
to enable it to run in parallel.
So first, we have this struct on the left, thread_args.
This struct here is used to store the arguments that
are passed to the function that the thread is going to execute.
And then we have this thread_func.
What that does is it reads the input
argument from this thread_args struct,
and then it sets that to i, and then it calls fib of i.
And that gives you the output, and then we store the result
into the output of the struct.
And then that also just returns null.
And then over on the right hand side,
we have the main function that will actually call the fib
function on the left.
So we initialize a whole bunch of variables
that we need to execute these threads.
And then we first check if n is less than 30.
If n is less than 30, it turns out
that it's actually not worth creating threads
to execute this program in parallel, because
of the overhead of thread creation.
So if n is less than 30, we'll just execute the program
sequentially.
And this idea is known as coarsening.
So you saw a similar example a couple of lectures
ago when we did coarsening for sorting.
But this is in the context of a parallel programming.
So here, because there are some overheads
to running a function in parallel,
if the input size is small enough,
sometimes you want to just execute it sequentially.
And then we're going to--
so let me just walk through this code,
since I have an animation.
So the next thing it's going to do
is it's going to marshal the input argument to the thread
so it's going to store the input argument n minus 1
in this args struct.
And then we're going to call pthread_create
with a thread variable.
For thread_args, we're just going to use null.
And then we're going to pass the thread_func
that we defined on the left.
And then we're going to pass the args structure.
And inside this args structure, the input is set to n minus 1,
which we did on the previous line.
And then pthread_create is going to give a return value.
So if the Pthread creation was successful,
then the status is going to be null, and we can continue.
And when we continue, we're going
to execute, now, fib of n minus 2 and store the result of that
into our result variable.
And this is done at the same time that fib of n minus 1
is executing.
Because we created this Pthread, and we
told it to call this thread_func function
that we defined on the left.
So both fib of n minus 1 and fib of n minus 2
are executing in parallel now.
And then we have this pthread_join,
which says we're going to wait until the thread
that we've created finishes before we move on, because we
need to know the result of both of the sub-calls
before we can finish this function.
And once that's done--
well, we first check the status to see if it was successful.
And if so, then we add the outputs of the argument's
struct to the result. So args.output will store
the output of fib of n minus 1.
So that's the Pthreads code.
Any questions on how this works?
Yeah?
AUDIENCE: I have a question about the thread function.
So it looks like you passed a void pointer,
but then you cast it to something else every time
you use that--
JULIAN SHUN: Yeah, so this is because
the pthread_create function takes
as input a void star pointer.
Because it's actually a generic function,
so it doesn't know what the data type is.
It has to work for all data types,
and that's why we need to cast it to avoid star.
When we pass it to pthread_create and then
inside the thread_func, we actually
do know what type of pointer that is, so then we cast it.
So does this code seem very parallel?
So how many parallel calls am I doing here?
Yeah?
AUDIENCE: Just one.
JULIAN SHUN: Yeah, so I'm only creating one thread.
So I'm executing two things in parallel.
So if I ran this code on four processors,
what's the maximum speed-up I could get?
AUDIENCE: [INAUDIBLE].
JULIAN SHUN: So the maximum speed-up I can get
is just two, because I'm only running two things in parallel.
So this doesn't recursively create threads.
It only creates one thread at the top level.
And if you wanted to make it so that this code actually
recursively created threads, it would actually
become much more complicated.
And that's one of the disadvantages of implementing
this code in Pthreads.
So we'll look at other solutions that will
make this task much easier.
So some of the issues with Pthreads
are shown on this slide here.
So there's a high overhead to creating a thread.
So creating a thread typically takes over 10
to the 4th cycles.
And this leads to very coarse-grained concurrency,
because your tasks have to do a lot of work
in order to amortize the costs of creating that thread.
There are something called thread pulls, which can help.
And the idea here is to create a whole bunch of threads
at the same time to amortize the costs of thread creation.
And then when you need a thread, you just
take one from the thread pull.
So the thread pull contains threads that
are just waiting to do work.
There's also a scalability issue with this code
that I showed on the previous slide.
The Fibonacci code gets, at most,
1.5x speed-up for two cores.
Why is it 1.5 here?
Does anyone know?
Yeah?
AUDIENCE: You have the asymmetry in the size of the two calls.
JULIAN SHUN: Yeah, so it turns out
that the two calls that I'm executing in parallel--
they're not doing the same amount of work.
So one is computing fib of n minus 1,
one is computing fib of n minus 2.
And does anyone know what the ratio between these two values
is?
Yeah, so it's the golden ratio.
It's about 1.6.
It turns out that if you can get a speed-up of 1.6,
then that's great.
But there are some overheads, so this code
will get about a 1.5 speed up.
And if you want to run this to take advantage of more cores,
then you need to rewrite this code,
and it becomes more complicated.
Third, there's the issue of modularity.
So if you look at this code here,
you see that the Fibonacci logic is not nicely encapsulated
within one function.
We have that logic in the fib function on the left,
but then we also have some of the fib logic on the right
in our main function.
And this makes this code not modular.
And if we want to build programs on top of this,
it makes it very hard to maintain,
if we want to just change the logic of the Fibonacci
function a little bit, because now we
have to change it in multiple places
instead of just having everything in one place.
So it's not a good idea to write code that's not modular,
so please don't do that in your projects.
And then finally, the code becomes
complicated because you have to actually move
these arguments around.
That's known as argument marshaling.
And then you have to engage in error-prone protocols
in order to do load balancing.
So if you recall here, we have to actually
place the argument n minus 1 into args.input
and we have to extract the value out of args.output.
So that makes the code very messy.
So why do I say shades of 1958 here?
Does anyone know what happened in 1958?
Who was around in 1958?
Just Charles?
So there was a first something in 1958.
What was it?
So turns out in 1958, we had the first compiler.
And this was the Fortran compiler.
And before we had Fortran compiler,
programmers were writing things in assembly.
And when you write things in assembly,
you have to do argument marshaling,
because you have to place things into the appropriate registers
before calling a function, and also move things around when
you return from a function.
And the nice thing about the first compiler
is that it actually did all of this argument marshaling
for you.
So now you can just pass arguments to a function,
and the compiler will generate code
that will do the argument marshaling for us.
So having you do this in Pthreads
is similar to having to write code in assembly,
because you have to actually manually marshal
these arguments.
So hopefully, there are better ways to do this.
And indeed, we'll look at some other solutions that will make
it easier on the programmer.
Any questions before I continue?
So we looked at Pthreads.
Next, let's look at Threading Building Blocks.
So Threading Building Blocks is a library solution.
It was developed by Intel.
And it's implemented as a C++ library that runs on top
of native threads.
So the underlying implementation uses threads,
but the programmer doesn't deal with threads.
Instead, the programmer specifies tasks,
and these tasks are automatically load-balanced
across the threads using a work-stealing algorithm
inspired by research at MIT--
Charles Leiserson's research.
And the focus of Intel TBB is on performance.
And as we'll see, the code written using TBB
is simpler than what you would have
to write if you used Pthreads.
So let's look at how we can implement Fibonacci using TBB.
So in TBB, we have to create these tasks.
So in the Fibonacci code, we create this fib task class.
And inside the task, we have to define this execute function.
So the execute function is the function
that performs a computation when we start the task.
And this is where we define the Fibonacci logic.
This task also takes as input these arguments parameter, n
and sum.
So n is the input here and sum is the output.
And in TBB, we can easily create a recursive program
that extracts more parallelism.
And here, what we're doing is we're recursively creating
two child tasks, a and b.
That's the syntax for creating the tasks.
And here, we can just pass the arguments to FibTask
instead of marshaling the arguments ourselves.
And then what we have here is a set_ref_count.
And this basically is the number of tasks
that we have to wait for plus one, so plus one for ourselves.
And in this case, we created two children tasks,
and we have ourselves, so that's 2 plus 1.
And then after that, we start task b using the spawn(b) call.
And then we do spawn_and_wait_for_all
with a as the argument.
In this place, he says, we're going to start task a,
and then also wait for both a and b
to finish before we proceed.
So this spawn_and_wait_for_all call
is going to look at the ref count that we set above
and wait for that many tasks to finish before it continues.
And after both and a and b have completed,
then we can just sum up the results
and store that into the sum variable.
And here, these tasks are created recursively.
So unlike the Pthreads implementation
that was only creating one thread at the top level,
here, we're actually recursively creating more and more tasks.
So we can actually get more parallelism
from this code and scale to more processors.
We also need this main function just to start up the program.
So what we do here is we create a root task,
which just computes fib of n, and then we call
spawn_root_and_wait(a).
So a is the task for the root.
And then it will just run the root task.
So that's what Fibonacci looks like in TBB.
So this is much simpler than the Pthreads implementation.
And it also gets better performance,
because we can extract more parallelism
from the computation.
Any questions?
So TBB also has many other features in addition to tasks.
So TBB provides many C++ templates to express common
patterns, and you can use these templates on different data
types.
So they have a parallel_for, which
is used to express loop parallelism.
So you can loop over a bunch of iterations in parallel.
They also have a parallel_reduce for data aggregation.
For example, if you want to sum together
a whole bunch of values, you can use a parallel_reduce
to do that in parallel.
They also have pipeline and filter.
That's used for software pipelining.
TBB provides many concurrent container classes,
which allow multiple threads to safely access and update
the items in a container concurrently.
So for example, they have hash tables, trees, priority cues,
and so on.
And you can just use these out of the box,
and they'll work in parallel.
You can do concurrent updates and reads
to these data structures.
TBB also has a variety of mutual exclusion library functions,
such as locks and atomic operations.
So there are a lot of features of TBB,
which is why it's one of the more popular concurrency
platforms.
And because of all of these features,
you don't have to implement many of these things by yourself,
and still get pretty good performance.
So TBB was a library solution to the concurrency problem.
Now we're going to look at two linguistic solutions--
OpenMP and Cilk.
So let's start with OpenMP.
So OpenMP is a specification by an industry consortium.
And there are several compilers available that support OpenMP,
both open source and proprietary.
So nowadays, GCC, ICC, and Clang all
support OpenMP, as well as Visual Studio.
And OpenMP is-- it provides linguistic extensions to C
and C++, as well as Fortran, in the form of compiler pragmas.
So you use these compiler pragmas
in your code to specify which pieces of code
should run in parallel.
And OpenMP also runs on top of native threads,
but the programmer isn't exposed to these threads.
OpenMP supports loop parallelism,
so you can do parallel for loops.
They have task parallelism as well as pipeline parallelism.
So let's look at how we can implement Fibonacci in OpenMP.
So this is the entire code.
So I want you to compare this to the Pthreads implementation
that we saw 10 minutes ago.
So this code is much cleaner than the Pthreads
implementation, and it also performs better.
So let's see how this code works.
So we have these compiler pragmas,
or compiler directives.
And the compiler pragma for creating a parallel task
is omp task.
So we're going to create an OpenMP task for fib
of n minus 1 as well as fib of n minus 2.
There's also this shared pragma, which
specifies that the two variables in the arguments
are shared across different threads.
So you also have to specify whether variables
are private or shared.
And then the pragma omp wait just
says we're going to wait for the preceding task
to complete before we continue.
So here, it's going to wait for fib of n minus 1
and fib of n minus 2 to finish before we
return the result, which is what we want.
And then after that, we just return x plus y.
So that's the entire code.
And OpenMP also provides many other pragma directives,
in addition to task.
So we can use a parallel for to do loop parallelism.
There's reduction.
There's also directives for scheduling and data sharing.
So you can specify how you want a particular loop
to be scheduled.
OpenMP has many different scheduling policies.
They have static parallelism, dynamic parallelism, and so on.
And then these scheduling directives also
have different grain sizes.
The data sharing directives are specifying whether variables
are private or shared.
OpenMP also supplies a variety of synchronization constructs,
such as barriers, atomic updates, mutual exclusion,
or mutex locks.
So OpenMP also has many features,
and it's also one of the more popular solutions
to writing parallel programs.
As you saw in the previous example,
the code is much simpler than if you
were to write something using Pthreads or even TBB.
This is a much simpler solution.
Any questions?
Yeah?
AUDIENCE: So with every compiler directive,
does it spawn a new [INAUDIBLE] on a different processor?
JULIAN SHUN: So this code here is actually
independent of the number of processors.
So there is actually a scheduling algorithm
that will determine how the tasks get
mapped to different processors.
So if you spawn a new task, it doesn't necessarily put it
on a different processor.
And you can have more tasks than the number
of processors available.
So there's a scheduling algorithm
that will take care of how these tasks get
mapped to different processors, and that's
hidden from the programmer.
Although you can use these scheduling
pragmas to give hints to the compiler
how it should schedule it.
Yeah?
AUDIENCE: What is the operating system [INAUDIBLE]
scheduling [INAUDIBLE]?
JULIAN SHUN: Underneath, this is implemented
using Pthreads, which has to make operating system calls to,
basically, directly talk to the processor cores
and do multiplexing and so forth.
So the operating system is involved at a very low level.
So the last concurrency platform that we'll be looking at today
is Cilk.
We're going to look at Cilk Plus, actually.
And the Cilk part of Cilk Plus is a small set of linguistic
extensions to C and C++ that support fork-join parallelism.
So for example, the Fibonacci example
uses fork-join parallelism, so you
can use Cilk to implement that.
And the Plus part of Cilk Plus supports vector parallelism,
which you had experience working with in your homeworks.
So Cilk Plus was initially developed by Cilk Arts,
which was an MIT spin-off.
And Cilk Arts was acquired by Intel in July 2009.
And the Cilk Plus implementation was
based on the award-winning Cilk multi-threaded language that
was developed two decades ago here at MIT by Charles
Leiserson's research group.
And it features a provably efficient
work-stealing scheduler.
So this scheduler is provably efficient.
You can actually prove theoretical bounds on it.
And this allows you to implement theoretically efficient
algorithms, which we'll talk more about in another lecture--
algorithm design.
But it provides a provably efficient
work-stealing scheduler.
And Charles Leiserson has a very famous paper
that has a proof of that this scheduler is optimal.
So if you're interested in reading about this,
you can talk to us offline.
Cilk Plus also provides a hyperobject library
for parallelizing code with global variables.
And you'll have a chance to play around with hyperobjects
in homework 4.
The Cilk Plus ecosystem also includes
useful programming tools, such as the Cilk Screen Race
Detector.
So this allows you to detect determinacy races
in your program to help you isolate bugs and performance
bottlenecks.
It also has a scalability analyzer called Cilk View.
And Cilk View will basically analyze the amount of work
that your program is doing, as well as
the maximum amount of parallelism
that your code could possibly extract from the hardware.
So that's Intel Cilk Plus.
But it turns out that we're not actually
going to be using Intel Cilk Plus in this class.
We're going to be using a better compiler.
And this compiler is based on Tapir/LLVM.
And it supports the Cilk subset of Cilk Plus.
And Tapir/LLVM was actually recently developed at MIT
by T. B. Schardl, who gave a lecture last week, William
Moses, who's a grad student working with Charles,
as well as Charles Leiserson.
So talking a lot about Charles's work today.
And Tapir/LLVM generally produces
better code, relative to its base compiler,
than all other implementations of Cilk out there.
So it's the best Cilk compiler that's available today.
And they actually wrote a very nice paper
on this last year, Charles Leiserson and his group.
And that paper received the Best Paper Award
at the annual Symposium on Principles and Practices
of Parallel Programming, or PPoPP.
So you should look at that paper as well.
So right now, Tapir/LLVM uses the Intel Cilk Plus runtime
system, but I believe Charles's group has plans to implement
a better runtime system.
And Tapir/LLVM also supports more general features
than existing Cilk compilers.
So in addition to spawning functions,
you can also spawn code blocks that are not
separate functions, and this makes
writing programs more flexible.
You don't have to actually create a separate function
if you want to execute a code block in parallel.
Any questions?
So this is the Cilk code for Fibonacci.
So it's also pretty simple.
It looks very similar to the sequential program,
except we have these cilk_spawn and cilk_synv
statements in the code.
So what do these statements do?
So cilk_spawn says that the named child function, which
is the function that is right after this cilk_spawn
statement, may execute in parallel with the parent
caller.
The parent caller is the function
that is calling cilk_spawn.
So this says that fib of n minus 1
can execute in parallel with the function that called it.
And then this function is then going to call fib of n minus 2.
And fib of n minus 2 and fib of n minus 1
now can be executing in parallel.
And then cilk_sync says that control cannot pass this point
until all of the spawn children have returned.
So this is going to wait for fib of n minus 1
to return before we go to the return statement
where we add up x and y.
So one important thing to note is
that the Cilk keywords grant permission
for parallel execution, but they don't actually force or command
parallel execution.
So even though I said cilk_spawn here,
the runtime system doesn't necessarily
have to run fib of n minus 1 in parallel with fib of n minus 2.
I'm just saying that I could run these two things in parallel,
and it's up to the runtime system
to decide whether or not to run these things in parallel,
based on its scheduling policy.
So let's look at another example of Cilk.
So let's look at loop parallelism.
So here we want to do a matrix transpose,
and we want to do this in-place.
So the idea here is we want to basically swap
the elements below the diagonal to its mirror
image above the diagonal.
And here's some code to do this.
So we have a cilk_for.
So this is basically a parallel for loop.
It goes from i equals 1 to n minus 1.
And then the inner for loop goes from j equals 0 up
to i minus 1.
And then we just swap a of i j with a of j i,
using these three statements inside the body of the
for loop.
So to execute a for loop in parallel,
you just have to add cilk underscore to the for keyword.
And that's as simple as it gets.
So this code is actually going to run in parallel
and get pretty good speed-up for this particular problem.
And internally, Cilk for loops are
transformed into nested cilk_spawn and cilk_sync calls.
So the compiler is going to get rid of the cilk_for
and change it into cilk_spawn and cilk_sync.
So it's going to recursively divide the iteration
space into half, and then it's going to spawn off one half
and then execute the other half in parallel with that,
and then recursively do that until the iteration
range becomes small enough, at which point
it doesn't make sense to execute it in parallel anymore,
so we just execute that range sequentially.
So that's loop parallelism in Cilk.
Any questions?
Yes?
AUDIENCE: How does it know [INAUDIBLE] something weird,
can it still do that?
JULIAN SHUN: Yeah, so the compiler
can actually figure out what the iteration space is.
So you don't necessarily have to be incrementing by 1.
You can do something else.
You just have to guarantee that all of the iterations
are independent.
So if you have a determinacy race
across the different iterations of your cilk_for loop,
then your result might not necessarily be correct.
So you have to make sure that the iterations are, indeed,
independent.
Yes?
AUDIENCE: Can you nest cilk_fors?
JULIAN SHUN: Yes, so you can nest cilk_fors.
But it turns out that, for this example,
usually, you already have enough parallelism
in the outer loop for large enough values of n,
so it doesn't make sense to put a cilk_for loop inside,
because using a cilk_for loop adds some additional overheads.
But you can actually do nested cilk_for loops.
And in some cases, it does make sense,
especially if there's not enough parallelism
in the outermost for loop.
So good question.
Yes?
AUDIENCE: What does the assembly code
look like for the parallel code?
JULIAN SHUN: So it has a bunch of calls to the Cilk runtime
system.
I don't know all the details, because I haven't
looked at this recently.
But I think you can actually generate
the assembly code using a flag in the Clang compiler.
So that's a good exercise.
AUDIENCE: Yeah, you probably want
to look at the LLVM IR, rather than the assembly,
to begin with, to understand what's going on.
It has three instructions that are not
in the standard LLVM, which were added to support parallelism.
Those things, when it's lowered into assembly,
each of those instructions becomes
a bunch of assembly language instructions.
So you don't want to mess with looking at it in the assembler
until you see what it looks like in the LLVM first.
JULIAN SHUN: So good question.
Any other questions about this code here?
OK, so let's look at another example.
So let's say we had this for loop
where, on each iteration i, we're
just incrementing a variable sum by i.
So this is essentially going to compute
the summation of everything from i equals 0 up to n minus 1,
and then print out the result.
So one straightforward way to try to parallelize this code
is to just change the for to cilk_for.
So does this code work?
Who thinks that this code doesn't work?
Or doesn't compute the correct result?
So about half of you.
And who thinks this code does work?
So a couple people.
And I guess the rest of the people don't care.
So it turns out that it's not actually necessarily going
to give you the right answer.
Because the cilk_for loop says you
can execute these iterations in parallel,
but they're all updating the same shared variable sum here.
So you have what's called a determinacy race, where
multiple processors can be writing to the same memory
location.
We'll talk much more about determinacy races
in the next lecture.
But for this example, it's not necessarily
going to work if you run it on more than one processor.
And Cilk actually has a nice way to deal with this.
So in Cilk, we have something known as a reducer.
This is one example of a hyperobject,
which I mentioned earlier.
And with a reducer, what you have to do
is, instead of declaring the sum variable just
has an unsigned long data type, what you do
is you use this macro called CILK_C_REDUCER_OPADD, which
specifies we want to create a reducer with the addition
function.
Then we have the variable name sum,
the data type unsigned long, and then the initial value 0.
And then we have a macro to register this reducer,
so a CILK_C_REGISTER_REDUCER.
And then now, inside this cilk_for loop,
we can increment the sum, or REDUCER_VIEW, of sum,
which is another macro, by i.
And you can actually execute this in parallel,
and it will give you the same answer
that you would get if you ran this sequentially.
So the reducer will take care of this determinacy race for you.
And at the end, when you print out this result,
you'll see that the sum is equal to the sum that you expect.
And then after you finish using the reducer,
you use this other macro called CILK_C_UNREGISTER_REDUCER(sum)
that tells the system that you're done using this reducer.
So this is one way to deal with this problem
when you want to do a reduction.
And it turns out that there are many other interesting
reduction operators that you might want to use.
And in general, you can create reduces for monoids.
And monoids are algebraic structures
that have an associative binary operation as well
as an identity element.
So the addition operator is a monoid,
because it's associative, it's binary,
and the identity element is 0.
Cilk also has several other predefined reducers,
including multiplication, min, max, and, or, xor, et cetera.
So these are all monoids.
And you can also define your own reducer.
So in fact, in the next homework,
you'll have the opportunity to play around with reducers
and write a reducer for lists.
So that's reducers.
Another nice thing about Cilk is that there's always
a valid serial interpretation of the program.
So the serial elision of a Cilk program
is always a legal interpretation.
And for the Cilk source code on the left,
the serial elision is basically the code
you get if you get rid of the cilk_spawn
and cilk_sync statements.
And this looks just like the sequential code.
And remember that the Cilk keywords grant permission
for parallel execution, but they don't necessarily
command parallel execution.
So if you ran this Cilk code using a single core,
it wouldn't actually create these parallel tasks,
and you would get the same answer
as the sequential program.
And this is-- in the serial edition-- is also
a correct interpretation.
So unlike other solutions, such as TBB and Pthreads,
it's actually difficult, in those environments,
to get a program that does what the sequential program does.
Because they're actually doing a lot of additional work
to set up these parallel calls and create these argument
structures and other scheduling constructs.
Whereas in Cilk, it's very easy just
to get this serial elision.
You just define cilk_spawn and cilk_sync to be null.
You also define cilk_for to be for.
And then this gives you a valid sequential program.
So when you're debugging code, and you
might first want to check if the sequential elision of your Cilk
program is correct, and you can easily
do that by using these macros.
Or actually, there's actually a compiler flag
that will do that for you and give you the equivalent C
program.
So this is a nice way to debug, because you
don't have to start with the parallel program.
You can first check if this serial program is correct
before you go on to debug the parallel program.
Questions?
Yes?
AUDIENCE: So does cilk_for--
does each iteration of the cilk_for its own task
that the scheduler decides if it wants to execute in parallel,
or if it executes in parallel, do all of the iterations
execute in parallel?
JULIAN SHUN: So it turns out that by default,
it groups a bunch of iterations together into a single task,
because it doesn't make sense to break it down
into such small chunks, due to the overheads of parallelism.
But there's actually a setting you
can do to change the grain size of the for loop.
So you could actually make it so that each iteration
is its own task.
And then, as you the scheduler will
decide how to map these different task
onto different processors, or even
if it wants to execute any of these tasks in parallel.
So good question.
So the idea in Cilk is to allow the programmer
to express logical parallelism in an application.
So the programmer just has to identify
which pieces of the code could be executed in parallel,
but doesn't necessarily have to determine which pieces of code
should be executed in parallel.
And then Cilk has a runtime scheduler
that will automatically map the executing
program onto the available processor cores' runtime.
And it does this dynamically using
a work-stealing scheduling algorithm.
And the work-stealing scheduler is
used to balance the tasks evenly across
the different processors.
And we'll talk more about the work-stealing scheduler
in a future lecture.
But I want to emphasize that unlike the other concurrency
platforms that we looked at today,
Cilk's work-stealing scheduling algorithm is theoretically
efficient, whereas the OpenMP and TBB schedulers are not
theoretically efficient.
So this is a nice property, because it will guarantee you
that the algorithms you write on top of Cilk
will also be theoretically efficient.
So here's a high-level illustration
of the Cilk ecosystem.
It's a very simplified view, but I did this
to fit it on a single slide.
So what you do is you take the Cilk source code,
you pass it to your favorite Cilk compiler--
the Tapir/LLVM compiler-- and this
gives you a binary that you can run on multiple processors.
And then you pass a program input to the binary,
you run it on however many processors you have,
and then this allows you to benchmark the parallel
performance of your program.
You can also do serial testing.
And to do this, you just obtain a serial elision of the Cilk
program, and you pass it to an ordinary C or C++ compiler.
It generates a binary that can only run on a single processor,
and you run your suite of serial regression tests
on this single threaded binary.
And this will allow you to benchmark the performance
of your serial code and also debug any issues
that might have arised when you were running
this program sequentially.
Another way to do this is you can actually just
compile the original Cilk code but run it
on a single processor.
So there's a command line argument
that tells the runtime system how many processors you
want to use.
And if you set that parameter to 1,
then it will only use a single processor.
And this allows you to benchmark the single threaded performance
of your code as well.
And the parallel program executing on a single core
should behave exactly the same way
as the execution of this serial elision.
So that's one of the advantages of using Cilk.
And because you can easily do serial testing using the Cilk
platform, this allows you to separate out
the serial correctness from the parallel correctness.
As I said earlier, you can first debug the serial correctness,
as well as any performance issues, before moving on
to the parallel version.
And another point I want to make is
that because Cilk actually uses the serial program
inside its task, it's actually good to optimize
the serial program even when you're
writing a parallel program, because optimizing
the serial program for performance
will also translate to better parallel performance.
Another nice feature of Cilk is that it
has this tool called Cilksan, which
stands for Cilk Sanitizer.
And Cilksan will detect any determinacy races
that you have in your code, which will significantly
help you with debugging the correctness
as well as the performance of your code.
So if you compile the Cilk code using the Cilksan flag,
it will generate an instrumented binary that, when you run,
it will find and localize all the determinacy races
in your program.
So it will tell you where the determinacy races occur,
so that you can go inspect that part of your code
and fix it if necessary.
So this is a very useful tool for benchmarking
your parallel programs.
Cilk also has another nice tool called Cilkscale.
Cilkscale is a performance analyzer.
It will analyze how much parallelism
is available in your program as well as the total amount
of work that it's doing.
So again, you pass a flag to the compiler that
will turn on Cilkscale, and it will generate
a binary that is instrumented.
And then when you run this code, it
will give you a scalability report.
So you'll find these tools very useful when
you're doing the next project.
And we'll talk a little bit more about these two tools
in the next lecture.
And as I said, Cilkscale will analyze how well your program
will scale to larger machines.
So it will basically tell you the maximum number
of processors that your code could possibly take advantage
of.
Any questions?
Yes?
AUDIENCE: What do you mean when you say runtime?
JULIAN SHUN: So I mean the scheduler-- the Cilk runtime
scheduler that's scheduling the different tasks when
you're running the program.
AUDIENCE: So that's included in the binary.
JULIAN SHUN: So it's linked from the binary.
It's not stored in the same place.
It's linked.
Other questions?
So let me summarize what we looked at today.
So first, we saw that most processors today
have multiple cores.
And probably all of your laptops have more than one core on it.
Who has a laptop that only has one core?
AUDIENCE: [INAUDIBLE].
JULIAN SHUN: When did you buy it?
Probably a long time ago.
So nowadays, obtaining high performance on your machines
requires you to write parallel programs.
But parallel programming can be very hard,
especially if you have the program directly
on the processor cores and interact with the operating
system yourself.
So Cilk is very nice, because it abstracts the processor cores
from the programmer, it handles synchronization
and communication protocols, and it also performs
provably good load-balancing.
And in the next project, you'll have a chance
to play around with Cilk.
You'll be implementing your own parallel screensaver,
so that's a very fun project to do.
And possibly, in one of the future lectures,
we'll post some of the nicest screensaver
that students developed for everyone to see.
OK, so that's all.