Fastest and strongest web server architecture

PyFes 2012.11 Presentation material.

Introducing the meinheld architecture with a focus on system calls.

I'm writing a sample implementation in Pure Python to explain the architecture, but it works fine and parses HTTP requests so it's over 10000req / sec. If the flow is difficult to understand with event-driven code, it is a good idea to execute it while tracing, such as python -mtrace -t --ignore-module socket webserver1.py.

Premise

Today, I will talk about pursuing req / sec with the condition that it returns a simple response. For example, the nginx lua module just returns "hello". You have to think about something else, such as a server that delivers static files.

HTTP review

See rfc2616 for details

HTTP parsing with Python is not a system call, it becomes a bottleneck, so this article does not parse HTTP properly.

HTTP request

The HTTP request looks like this.

GET / HTTP/1.1
Host: localhost

POST /post HTTP/1.1
Host: localhost
Content-Type: application/x-www-form-urlencoded
Content-Length: 7

foo=bar

The first line is request-line, in the form of method URI HTTP-version. The URI can be an absolute URI that includes the host or an absolute path that does not include the host, but the absolute path is more common.

The request-header is from the second line to the blank line. Each line has the form field-name: field-value. field-name is not case sensitive.

Line breaks are CR LF from request-line to request-header followed by a blank line. It's a line feed code that you often see on Windows.

When method is POST etc., message-body is added after the blank line. The type of data that messeage-body is is specified in the Content-Type header, and its size is specified in the Content-Length header. You can omit Content-Length, but we'll talk about that later.

The server may be using a VirtualHost, so if the request-line is not an absolute URI, add a "Host" header and specify the host name.

HTTP response

The HTTP response looks like this.

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 5

Hello

It's almost the same as an HTTP request, except that the first line is status-line. The status line is in the form http-version status-code reason-phrase. It's like "200 OK" or "404 Not Found".

Web server basics

A web server is a TCP server that receives HTTP requests and returns HTTP responses.

Bind the TCP port and listen.
accpet to accept new connections from clients
Recv to receive the HTTP request.
Process the request
send and return an HTTP response.
Close to disconnect the TCP connection.

You may use read, readv, recvmsg instead of recv in 3. You may use write, writev, writemsg instead of send in 5.

This time we're going to dig into req / sec, so ignore 4. Actually, I have to support Keep-Alive, but I will ignore that as well this time.

Writing up to this point in Python looks like this. (Request is not parsed)

`webserver1.py`


import socket


def server():
    # 1: bind & listen
    server = socket.socket()
    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    server.bind(('', 8000))
    server.listen(100)
    while 1:
        # 2: accept new connection
        con, _ = server.accept()
        # 3: read request
        con.recv(32*1024)
        # 4: process request and make response
        # 5: send response
        con.sendall(b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 5\r
\r
hello""")
        # 6: close connection
        con.close()

if __name__ == '__main__':
    server()

Concurrency

The sample code above can only communicate with one client at a time. There are several basic ways to handle multiple connections in parallel, and a myriad of ways to combine them. The basic method is as follows.

Perform the processing after accept () in another thread or process

It is called a worker model. The advantage is that it is easy to dynamically adjust the number of worker threads / processes, but it requires the process of passing the connection from the accepted thread / process to the worker thread / process. It also puts a load on the context switch.

Do from accept () to close () in thread or process

It is called the prefork model. The process from accept () to close () remains simple, so hopefully you'll get maximum performance. However, if the number of threads / processes is small, it cannot handle more parallel numbers, and if it is large, the context switch will be overloaded.

Multiplex with epoll, select, kqueue, etc.

It is called an event-driven model. Accept when you can accept, recv when you can recv, and send when you can send. No context switch is required, but each requires its own system call call, which incurs that overhead.

Fastest architecture (I admit disagreement)

For a server that really only returns hello, it's best to create as many processes as there are cores with a simple prefork model.

`webserver2.py`


import multiprocessing
import socket
import time


def worker(sock):
    while 1:
        con, _ = sock.accept()
        con.recv(32 * 1024)
        con.sendall(b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 5\r
\r
hello""")
        con.close()


def server():
    sock = socket.socket()
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sock.bind(('', 8000))
    sock.listen(100)

    ncpu = multiprocessing.cpu_count()
    procs = []
    for i in range(ncpu):
        proc = multiprocessing.Process(target=worker, args=(sock,))
        proc.start()
        procs.append(proc)

    while 1:
        time.sleep(0.5)

    for proc in procs:
        proc.terminate()
        proc.join(1)


if __name__ == '__main__':
    server()

The gunicorn sync worker should have this architecture.

Strongest architecture (argument is ry)

The prefork server I mentioned earlier is the fastest if it's just hello, but the problem in practical use is that if it takes a long time to receive a request or send a response, the process will not be able to process the next request and the CPU will play. It is to end up.

Therefore, when using gunicorn's sync worker, it is recommended to put nginx on the front and deliver static files there, or buffer requests and responses.

However, if you use a two-stage configuration, the speed will be halved. Therefore, each process uses an event-driven model such as epoll to enable time-consuming transmission / reception processing.

Fast event driven program

If you make it event-driven like the following code, the system call for event-driven will be attached to all accept, read, and write, which will increase the overhead and slow down hello.

`webserver4.py`


import socket
import select

read_waits = {}
write_waits = {}

def wait_read(con, callback):
    read_waits[con.fileno()] = callback

def wait_write(con, callback):
    write_waits[con.fileno()] = callback

def evloop():
    while 1:
        rs, ws, xs = select.select(read_waits.keys(), write_waits.keys(), [])
        for rfd in rs:
            read_waits.pop(rfd)()
        for wfd in ws:
            write_waits.pop(wfd)()

class Server(object):
    def __init__(self, con):
        self.con = con

    def start(self):
        wait_read(self.con, self.on_acceptable)

    def on_acceptable(self):
        con, _ = self.con.accept()
        con.setblocking(0)
        Client(con)
        wait_read(self.con, self.on_acceptable)


class Client(object):
    def __init__(self, con):
        self.con = con
        wait_read(con, self.on_readable)

    def on_readable(self):
        data = self.con.recv(32 * 1024)
        self.buf = b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 6\r
\r
hello
"""
        wait_write(self.con, self.on_writable)

    def on_writable(self):
        wrote = self.con.send(self.buf)
        self.buf = self.buf[wrote:]
        if self.buf:
            wait_write(self.con, self.on_writable)
        else:
            self.con.close()


def serve():
    sock = socket.socket()
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sock.bind(('', 8000))
    sock.listen(128)
    server = Server(sock)
    server.start()
    evloop()

if __name__ == '__main__':
    serve()

In order to reduce the overhead, we will look for a place where wait_read and wait_write can be reduced.

First, the OS automatically starts TCP connections up to backlog (the number specified in the listen argument) without calling accept (). (Returns ACK / SYN to SYN), so when you accept () from the app, the TCP connection may actually have started and the request from the client may have been received. So, if you accept (), do recv () immediately without wait_read ().

When read () is finished, the response is sent, but this can also be sent immediately because the socket buffer should be empty at first. Let's stop waiting_write ().

`webserver5.py`


import socket
import select

read_waits = {}
write_waits = {}

def wait_read(con, callback):
    read_waits[con.fileno()] = callback

def wait_write(con, callback):
    write_waits[con.fileno()] = callback

def evloop():
    while 1:
        rs, ws, xs = select.select(read_waits.keys(), write_waits.keys(), [])
        for rfd in rs:
            read_waits.pop(rfd)()
        for wfd in ws:
            write_waits.pop(wfd)()

class Server(object):
    def __init__(self, con):
        self.con = con

    def start(self):
        wait_read(self.con, self.on_acceptable)

    def on_acceptable(self):
        try:
            while 1:
                con, _ = self.con.accept()
                con.setblocking(0)
                Client(con)
        except IOError:
            wait_read(self.con, self.on_acceptable)


class Client(object):
    def __init__(self, con):
        self.con = con
        self.on_readable()

    def on_readable(self):
        data = self.con.recv(32 * 1024)
        if not data:
            wait_read(self.con, self.on_readable)
            return
        self.buf = b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 6\r
\r
hello
"""
        self.on_writable()

    def on_writable(self):
        wrote = self.con.send(self.buf)
        self.buf = self.buf[wrote:]
        if self.buf:
            wait_write(self.con, self.on_writable)
        else:
            self.con.close()


def serve():
    sock = socket.socket()
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    sock.setblocking(0)
    sock.bind(('', 8000))
    sock.listen(128)
    server = Server(sock)
    server.start()
    evloop()

if __name__ == '__main__':
    serve()

This method also works great with prefork.

After accepting, it doesn't accept the next until it does everything it can, so you can hand over accept to another process that really doesn't have a job.

It can also mitigate thundering hard problems. The thundering hard problem is that when multiple processes call accept () in a prefork fashion, only one client connects and all processes are up and only one of them succeeds. A process in which accept () fails is completely lost. It is a disturbing sleep. If you can do this with a server of 100 processes on one core machine, it will not be accumulated.

As for accept (), the thundering hard problem has been completely resolved, as modern Linux now only returns accept for one process when a connection comes in. However, if you select () and then accept (), this problem reoccurs.

By limiting the number of processes to the number of CPU cores and performing the select before accept only when you are really free, the phenomenon of "select but cannot accept" will occur only when the CPU is really free. I can do it.

This only requires more system calls for accept () to close () than prefork with setblocking (0). By the way, in recent Linux, there is a system call called accept4, and you can also do setblocking (0) at the same time as accept.

I'm leaving the user space! Jojo!

The story so far is the fastest and strongest in the user space. Once you have implemented your web server in kernel space, you don't need to issue system calls.

https://github.com/KLab/recaro