コンピュータが時々カウントできない理由 (Why Computers Can't Count Sometimes)

字幕表動画を再生する

This is a brilliant tweet.
But I don't want you to pay attention to the tweet.
It's good, sure, but I want you to watch the numbers that are underneath it.
That's a screen recording,
and the numbers are going up and down, all over the place.
They should steadily rise, but they don't.
There aren't that many people tapping 'like' by mistake and then taking it back.
So why can't Twitter just count?
You'll see examples like this all over the place.
On YouTube, subscriber and view counts sometimes rise and drop seemingly at random,
or they change depending on which device you're checking on.
Computers should be good at counting, right?
They're basically just overgrown calculators.
This video that you're watching,
whether it's on a tiny little phone screen or on a massive desktop display,
it is all just the result of huge amounts of math that turns
a compressed stream of binary numbers into amounts of electricity
that get sent to either a grid of coloured pixels or a speaker,
all in perfect time.
Just counting should be easy.
But sometimes it seems to fall apart.
And that's usually when there's a big, complicated system
with lots of inputs and outputs,
when something has to be done at scale.
Scaling makes things difficult. And to explain why,
we have to talk about race conditions, caching, and eventual consistency.
All the code that I've talked about in The Basics so far has been single-threaded,
because, well, we're talking about the basics.
Single-threaded means that it looks like a set of instructions
that the computer steps through one after the other.
It starts at the top, it works its way through, ignoring everything else,
and at the end it has Done A Thing.
Which is fine, as long as that's the only thread,
the only thing that the computer's doing,
and that it's the only computer doing it.
Fine for old machines like this,
but for complicated, modern systems, that's never going to be the case.
Most web sites are, at their heart, just a fancy front end to a database.
YouTube is a database of videos and comments.
Twitter is a database of small messages.
Your phone company's billing site is a database of customers and bank accounts.
But the trouble is that a single computer holding a single database can only deal with
so much input at once.
Receiving a request, understanding it, making the change, and sending the response back:
all of those take time,
so there are only so many requests that can fit in each second.
And if you try and handle multiple requests at once,
there are subtle problems that can show up.
Let's say that YouTube wants to count one view of a video.
It just has the job of adding one to the view count.
Which seems really simple, but it's actually three separate smaller jobs.
You have to read the view count,
you have to add one to it,
and then you have to write that view count back into the database.
If two requests come along very close to each other,
and they're assigned to separate threads,
it is entirely possible that the second thread
could read the view count
while the first thread is still doing its calculation.
And yeah, that's a really simple calculation, it's just adding one,
but it still takes a few ticks of a processor.
So both of those write processes would put the same number back into the database,
and we've missed a view.
On popular videos, there'll be collisions like that all the time.
Worst case, you've got ten or a hundred of those requests all coming in at once,
and one gets stuck for a while for some reason.
It'll still add just one to the original number that it read,
and then, much later,
it'll finally write its result back into the database.
And we've lost any number of views.
In early databases, having updates that collided like that could corrupt the entire system,
but these days things will generally at least keep working,
even if they're not quite accurate.
And given that YouTube has to work out not just views,
but ad revenue and money,
it has got to be accurate.
Anyway, that's a basic race condition:
when the code's trying to do two or more things at once,
and the result changes depending on the order they occur in,
an order that you cannot control.
One solution is to put all the requests in a queue,
and refuse to answer any requests until the previous one is completed.
That's how that single-threaded, single-computer programming works.
It's how these old machines work.
Until the code finishes its task and says "okay, I'm ready for more now",
it just doesn't accept anything else.
Fine for simple stuff, does not scale up.
A million-strong queue to watch a YouTube video doesn't sound like a great user experience.
But that still happens somewhere, for things like buying tickets to a show,
where it'd be an extremely bad idea to accidentally sell the same seat to two people.
Those databases have to be 100% consistent, so for big shows,
ticket sites will sometimes start a queue,
and limit the number of people accessing the booking site at once.
If you absolutely must count everything accurately, in real time, that's the best approach.
But for sites dealing with Big Data, like YouTube and Twitter,
there is a different solution called eventual consistency.
They have lots of servers all over the world,
and rather than reporting every view or every retweet right away,
each individual server will keep its own count,
bundle up all the viewcounts and statistics that it's dealing with,
and just it will just update the central system when there's time to do so.
Updates doesn't have to be hours apart,
they can just be minutes or even just seconds,
but having a few bundled updates that can be queued and dealt with individually
is a lot easier on the central system
than having millions of requests all being shouted at once.
Actually, for something on YouTube's scale,
that central database won't just be one computer:
it'll be several, and they'll all be keeping each other up to date,
but that is a mess we really don't want to get into right now.
Eventual consistency isn't right for everything.
On YouTube, if you're updating something like the privacy settings of a video,
it's important that it's updated immediately everywhere.
But compared to views, likes and comments, that's a really rare thing to happen,
so it's OK to stop everything, put everything else on hold,
spend some time sorting out that important change, and come back later.
But views and comments, they can wait for a little while.
Just tell the servers around the world to write them down somewhere, keep a log,
then every few seconds, or minutes, or maybe even hours for some places,
those systems can run through their logs,
do the calculations and update the central system once everyone has time.
All that explains why viewcounts and subscriber counts lag sometimes on YouTube,
why it can take a while to get the numbers sorted out in the end,
but it doesn't explain the up-and-down numbers you saw at the start in that tweet.
That's down to another thing: caching.
It's not just writing into the database that's bundled up. Reading is too.
If you have thousands of people requesting the same thing,
it really doesn't make sense to have them all hit the central system
and have it do the calculations every single time.
So if Twitter are getting 10,000 requests a second for information on that one tweet,
which is actually a pretty reasonable amount for them,
it'd be ridiculous for the central database to look up all the details and do the numbers every time.
So the requests are actually going to a cache,
one of thousands, or maybe tens of thousands of caches
sitting between the end users and the central system.
Each cache looks up the details in the central system once,
and then it keeps the details in its memory.
For Twitter, each cache might only keep them for a few seconds,
so it feels live but isn't actually.
But it means only a tiny fraction of that huge amount of traffic
actually has to bother the central database:
the rest comes straight out of memory on a system that is built
just for serving those requests,
which is orders of magnitude faster.
And if there's a sudden spike in traffic,
Twitter can just spin up some more cache servers,
put them into the pool that's answering everyone's requests,
and it all just keeps working without any worry for the database.
But each of those caches will pull that information at a slightly different time,
all out of sync with each other.
When your request comes in, it's routed to any of those available caches,
and crucially it is not going to be the same one every time.
They've all got slightly different answers,
and each time you're asking a different one.
Eventual consistency means that everything will be sorted out at some point.
We won't lose any data, but it might take a while before it's all in place.
Sooner or later the flood of retweets will stop, or your viewcount will settle down,
and once the dust has settled everything can finally get counted up.
But until then: give YouTube and Twitter a little leeway.
Counting things accurately is really difficult.
Thank you very much to the Centre for Computing History here in Cambridge,
who've let me film with all this wonderful old equipment,
and to all my proofreading team who made sure my script's right.