Placeholder Image

字幕表 動画を再生する

  • I'm glad you came to hear about the new parser coming to Node.

  • About who I am, I'm Fedor Indutny.

  • Here's my Twitter and GitHub handles which are the same.

  • And I write code at PayPal.

  • You might know me by this dark avatar I use on GitHub and Twitter and practically everywhere

  • else.

  • For today talk, the slides and the presentations are already up online.

  • If you would like, you can scan the QR code and open your browser and follow along as

  • I'm going to present the topic.

  • Which is llhttp.

  • The new HTTP protocol parser for Node.js.

  • This has deep roots into the history of Node.js.

  • And it would be hard if not impossible not to mention the history while describing.

  • Of course, Node.js is used to load for the front-end tooling.

  • But originally historically it started as the backend platform.

  • It was exclusively about building asynchronous HTTP servers.

  • And the creator of Node, HTTP parser is what started as node.

  • There's presentations on the importance and role and structure.

  • It's a power Node.js event loop and it's key ever since they set up the node.

  • And you might remember if there's a dependency, however, it is not.

  • As an HTTP parser, it has all the dependencies that Node ever had.

  • It leads all the dependencies in Node.

  • Initially it was inspired by mongrel, a Ruby web server with its own parser created by

  • web show.

  • And later, parts Nginx code introduced in HTTP parser, the

  • original one.

  • And, of course, the code itself.

  • The parser has been with us since 2009.

  • It's been more than ten years.

  • And in effect, Node itself has been introduced that the very conference ten years ago by

  • Ryan.

  • So, it's kind of a jubilee for both of those projects.

  • And I wrote another HTTP parser to replace the original parser.

  • So, why would anyone want to get rid of such a library?

  • Of course, it's a fantastic library.

  • It has been with us for ten years.

  • It should have been worked fine.

  • And it has many users inside of the Node.js community as well.

  • For example, Android proxy by Google.

  • They use it for parsing HTTP requests and it's quite a popular project as well.

  • What makes this parser great and why did it stay?

  • First, good enough performance.

  • It takes quite a bit of time to invoke a C function, and parser is written in C. It takes

  • way less time to parse the requests.

  • I'm going to elaborate on it a bit later in this presentation.

  • But for NodeJS purposes, it's very good performance.

  • Couldn't be better.

  • It also supports a lot of clients and servers that violate the HTTP specification.

  • There are way too many of them out on the Internet.

  • You would be surprised how many.

  • And, of course, it was very important for early adoption of NodeJS.

  • Because in 2009 there was even more such clients and servers out there.

  • Another point is that original parser has a lot of test suites.

  • So, over ten years Ryan, Ben and other maintainers of the project, including myself, have wrote

  • quite a comprehensive test suite that covers almost every aspect of HTTP specification.

  • So, the parser is welltested.

  • So, that was the good points of the original parser.

  • Now we come to other points.

  • Unfortunately, with the HTTP library the code became quite rigid.

  • It became impossible to move things around, to make significant changes to it.

  • And as a consequence of this, it became impossible to maintain this library efficiently.

  • Furthermore, as one of the maintainers of the project, I have to constantly relearn

  • and get familiar with the parts of the codebase that I was previously familiar with before.

  • And I did it on every pull request and still I wasn't sure if the code is going to run

  • in the way I expected it to run.

  • So, it could introduce some unexpected behavior.

  • Or maybe a security vulnerability as well.

  • Which is obviously bad.

  • It doesn't help either that most NodeJS users and developers are familiar with JavaScript

  • and are more comfortable with it than they are with C. So, there was not too many people

  • interested in working on this HTTP parser.

  • With all this in mind, several years ago I set out on a quest it make the library better

  • and maybe a bit faster in the process.

  • First attempts were quite conservative.

  • So, I tried to stay with existing code as much as possible.

  • And some of them were successful.

  • Like replacing the parser state machine with a cross, and using it consistently not only

  • improved the code, but also made it faster.

  • Which was nice.

  • Other attempts were a complete disaster.

  • I tried to move those states into a separated function and just called them from the loop

  • so each state would return the next state that was supposed to be executed and then

  • the loop would run the function for it.

  • This completely ruins the performance and I never contributed or sent it to the string.

  • From this attempt was success or failure, the conclusion was clear.

  • It was hard to make an improvement while staying with the existing codebase.

  • There was no requirement to use as much code as possible, it was no longer required to

  • make it in the same programming language as before.

  • Why not develop it in JavaScript or maybe TypeScript instead?

  • And, of course, JavaScript performance is quite great.

  • But you wouldn't be surprised that it's slower than C and takes a lot of effort to reach

  • comparable performances in the C libraries when you write programs in JavaScript.

  • Takes a lot of effort.

  • But it's possible.

  • So, I wanted to get this performance optimization out of consideration.

  • And also, I decided to just define the parser in TypeScript and still compile it down to

  • the C library.

  • So, the end result would be C library that was used in Node and other projects.

  • Which was great because existing users of HTTP parser this way would be able to transition

  • their code to the new parser and hopefully the process and the performance would not

  • degrade so much because in the end it's the same programming language.

  • It would have good chances of being the same speed.

  • So, llhttp is the next major version of HTTP parser.

  • It has the same core principles and similar API, which is almost identical.

  • And the way they work, is they scan one character at a time.

  • And during the process they change the internal state and they could add a header fields or

  • maybe header values and later on body.

  • I'm not sure if we're going to wait for this.

  • It's quite slow.

  • Okay.

  • Yeah.

  • I think you probably understand what it means now.

  • At least to some extent.

  • So, this can by the virtue of this one scan the parser can work without buffering anything

  • at all.

  • So, it doesn't allocate memory itself.

  • And it creates especially for request bodies because it could just need the slices of original

  • buffers that came from that work instead of allocating the coping data.

  • So, in the core principle of the HTTP parser, it's not copying.

  • That's important.

  • And as soon as any amount of data arrives via a request or a single byte of request,

  • HTTP parser is ready to process it.

  • And it will be possibly partial because it will be health requests.

  • And the header names, we are just seeing in this animation.

  • In the original version of the parser, this scan was quite naturally implemented foreloop

  • over the input.

  • Just going by the input bytebybyte and doing some syncs.

  • What it did, it was described by a huge which statement or all possible parsing state.

  • Whether it's a header name, header value, whether it's value of content lengths header,

  • it was all described by the switch statement, and they represented different states of the

  • state machine.

  • All of this lived in a single function or 1,000 lines of code which is quite a terrible

  • idea.

  • So, an obvious improvement would be to break this switch into pieces and make it such that

  • each piece has precise action.

  • It's sort of a unique philosophy, I guess.

  • So, it would be just exactly about doing one small thing at a time.

  • Go to statements would be used to jump between states.

  • There would not be as much need for this foreloop, at least not as much.

  • With all this in mind, how do I approach this process?

  • I developed a domainspecific language, DSL, and created a compiler around it, LL parse,

  • so, again, double L. And this compiler is used to describe the parsing states in terms

  • of these actions that they perform.

  • So, each state would have several actions assigned to them and they would perform them

  • and move on to the next state.

  • Because of this llparse is quite a general compiler, it can be used for other protocols

  • as well.

  • It works better for textual protocols, but I think it's useful.

  • Original parser suffered from a surplus of handwritten code.

  • I have selected a few actions that were repeated most in the original library and had DSL around

  • them.

  • The idea here is that I wanted the description of the new parser to be concise.

  • So, I wanted to write code with less lines and less signals than possible.

  • I wanted to move the most common iterations inside of the compiler so that the rest would

  • have to do the work all over again and the original parser.

  • Here were a few that the compiler supports.

  • One is match.

  • It takes a sequence, or a character of bytes and it tries to match them from the input.

  • For example, it could be keep alive, which is the value of the headernamed connection.

  • It's quite a common header and very important.

  • So, it could match this sequence.

  • And when it does, move to the next state by taking the reference to this state.

  • And other times the parser needs to check the next character in the incoming data without

  • actually consuming it.

  • And it could be used for this.

  • Takes a single character and mentions it and moves on without moving the input stream.

  • And speaking of headers, headers like lengths, they have internal values which frankly described

  • by strings, it's a contextual protocol.

  • The new parser has to be able to parse the integer strings.

  • And the way I decided to do it was to implement a select method in DSL which takes a map as

  • an input.

  • And this map has sequences or characters as keys.

  • And the integers as values.

  • So, it tries to map this sequence to the integers and just passes values along to the next state.

  • The next state could be storing integers inside some property of the structure, or maybe walking

  • a user code back with them.

  • Speaking of callbacks, there is one special type of callback.

  • It is very important in the life span of both original and the new HTTP parser.

  • During their executions they need ranges of data.

  • For example, header names, or headers are in this way.

  • And begin that we have a stream of data that comes to the parser, we have to be able to

  • mark some certain place inside of this stream as the beginning of this range.

  • And then at some other point later on we want to set it as an ending of the range.

  • So, between those beginning and ending, or you can see, our beginning and ending.

  • Between them the callback is going to be invoked for every byte.

  • And it's really, really useful for header names, header values and bodies, and other

  • things that could be needed that spans the ranges of input.

  • Of course, there are a couple of important actions that I have not needed in previous

  • slides that are actually mandatory to have in this state.

  • They call otherwise and skip to.

  • And those specify which states of the parser should be reached next if nothing else matches

  • inside of the current state.

  • So, in this example, the input would be A. The parser would move to the A state, and

  • for B, move to B and C or D or E or whatever is later.

  • It would move to some other state.

  • Skip to is quite similar.

  • It's the same thing but consumes the character from the input stream.

  • Otherwise, it does not change the input stream at all.

  • It just moves on.

  • So, that was a bit of description maybe too concise to be useful of DSL.

  • And with this DSL in mind, llhttp becomes a type Script program.

  • This program uses it to describe the actions and input as said before.

  • Because it's a TypeScript program, or JavaScript program, really, I could split it into several

  • sub modules and use them efficiently.

  • And each submodule could have the subparser.

  • This is what I use in HTTP.

  • I have a separate parser inside of it and can use it and run it separately and can be

  • used separately as well as a library if anyone wants it.

  • Llparse compile this is TypeScript program down to C. And that's the main action of it.

  • Know that because it uses a stable DSL, oh, sorry, just uses DSL, the parser doesn't need

  • to do any parsing.

  • It's done automatically by the JS engine.

  • So, V8 does it for us.

  • V8 does this itself internally.

  • Llparse builds a graph of state which I will try to show you.

  • It looks kind of terrible.

  • But I probably can zoom in.

  • Yeah.

  • So, yeah, hear, how it looks like in practice.

  • I can probably actually show you something more useful.

  • So, here on the right you see ACL buy check out this.

  • Mobile names supported by parser.

  • And the name is matched in this of the input.

  • It will store the integer and coding as a method inside those internal properties of

  • parsers.

  • It works this way more or less.

  • I guess with a that means is the kind of graph is looking awesome.

  • So, that's one of the reasons to have it.

  • And another reason to have it is that llparse can do static analysis on this graph.

  • Before in original parsers there was no way to reason about the states automatically.