PyFes 2012.11 Presentation material.
Introducing the meinheld architecture with a focus on system calls.
I'm writing a sample implementation in Pure Python to explain the architecture, but it works fine and parses HTTP requests so it's over 10000req / sec.
If the flow is difficult to understand with event-driven code, it is a good idea to execute it while tracing, such as python -mtrace -t --ignore-module socket webserver1.py
.
Today, I will talk about pursuing req / sec with the condition that it returns a simple response. For example, the nginx lua module just returns "hello". You have to think about something else, such as a server that delivers static files.
See rfc2616 for details
HTTP parsing with Python is not a system call, it becomes a bottleneck, so this article does not parse HTTP properly.
The HTTP request looks like this.
GET / HTTP/1.1
Host: localhost
POST /post HTTP/1.1
Host: localhost
Content-Type: application/x-www-form-urlencoded
Content-Length: 7
foo=bar
The first line is request-line, in the form of method URI HTTP-version
. The URI can be an absolute URI that includes the host or an absolute path that does not include the host, but the absolute path is more common.
The request-header is from the second line to the blank line. Each line has the form field-name: field-value
. field-name is not case sensitive.
Line breaks are CR LF from request-line to request-header followed by a blank line. It's a line feed code that you often see on Windows.
When method is POST etc., message-body is added after the blank line. The type of data that messeage-body is is specified in the Content-Type
header, and its size is specified in the Content-Length
header. You can omit Content-Length
, but we'll talk about that later.
The server may be using a VirtualHost, so if the request-line is not an absolute URI, add a "Host" header and specify the host name.
The HTTP response looks like this.
HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 5
Hello
It's almost the same as an HTTP request, except that the first line is status-line.
The status line is in the form http-version status-code reason-phrase
.
It's like "200 OK" or "404 Not Found".
A web server is a TCP server that receives HTTP requests and returns HTTP responses.
You may use read, readv, recvmsg instead of recv in 3. You may use write, writev, writemsg instead of send in 5.
This time we're going to dig into req / sec, so ignore 4. Actually, I have to support Keep-Alive, but I will ignore that as well this time.
Writing up to this point in Python looks like this. (Request is not parsed)
webserver1.py
import socket
def server():
# 1: bind & listen
server = socket.socket()
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(('', 8000))
server.listen(100)
while 1:
# 2: accept new connection
con, _ = server.accept()
# 3: read request
con.recv(32*1024)
# 4: process request and make response
# 5: send response
con.sendall(b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 5\r
\r
hello""")
# 6: close connection
con.close()
if __name__ == '__main__':
server()
The sample code above can only communicate with one client at a time. There are several basic ways to handle multiple connections in parallel, and a myriad of ways to combine them. The basic method is as follows.
It is called a worker model. The advantage is that it is easy to dynamically adjust the number of worker threads / processes, but it requires the process of passing the connection from the accepted thread / process to the worker thread / process. It also puts a load on the context switch.
It is called the prefork model. The process from accept () to close () remains simple, so hopefully you'll get maximum performance. However, if the number of threads / processes is small, it cannot handle more parallel numbers, and if it is large, the context switch will be overloaded.
It is called an event-driven model. Accept when you can accept, recv when you can recv, and send when you can send. No context switch is required, but each requires its own system call call, which incurs that overhead.
For a server that really only returns hello, it's best to create as many processes as there are cores with a simple prefork model.
webserver2.py
import multiprocessing
import socket
import time
def worker(sock):
while 1:
con, _ = sock.accept()
con.recv(32 * 1024)
con.sendall(b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 5\r
\r
hello""")
con.close()
def server():
sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('', 8000))
sock.listen(100)
ncpu = multiprocessing.cpu_count()
procs = []
for i in range(ncpu):
proc = multiprocessing.Process(target=worker, args=(sock,))
proc.start()
procs.append(proc)
while 1:
time.sleep(0.5)
for proc in procs:
proc.terminate()
proc.join(1)
if __name__ == '__main__':
server()
The gunicorn sync worker should have this architecture.
The prefork server I mentioned earlier is the fastest if it's just hello, but the problem in practical use is that if it takes a long time to receive a request or send a response, the process will not be able to process the next request and the CPU will play. It is to end up.
Therefore, when using gunicorn's sync worker, it is recommended to put nginx on the front and deliver static files there, or buffer requests and responses.
However, if you use a two-stage configuration, the speed will be halved. Therefore, each process uses an event-driven model such as epoll to enable time-consuming transmission / reception processing.
If you make it event-driven like the following code, the system call for event-driven will be attached to all accept, read, and write, which will increase the overhead and slow down hello.
webserver4.py
import socket
import select
read_waits = {}
write_waits = {}
def wait_read(con, callback):
read_waits[con.fileno()] = callback
def wait_write(con, callback):
write_waits[con.fileno()] = callback
def evloop():
while 1:
rs, ws, xs = select.select(read_waits.keys(), write_waits.keys(), [])
for rfd in rs:
read_waits.pop(rfd)()
for wfd in ws:
write_waits.pop(wfd)()
class Server(object):
def __init__(self, con):
self.con = con
def start(self):
wait_read(self.con, self.on_acceptable)
def on_acceptable(self):
con, _ = self.con.accept()
con.setblocking(0)
Client(con)
wait_read(self.con, self.on_acceptable)
class Client(object):
def __init__(self, con):
self.con = con
wait_read(con, self.on_readable)
def on_readable(self):
data = self.con.recv(32 * 1024)
self.buf = b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 6\r
\r
hello
"""
wait_write(self.con, self.on_writable)
def on_writable(self):
wrote = self.con.send(self.buf)
self.buf = self.buf[wrote:]
if self.buf:
wait_write(self.con, self.on_writable)
else:
self.con.close()
def serve():
sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind(('', 8000))
sock.listen(128)
server = Server(sock)
server.start()
evloop()
if __name__ == '__main__':
serve()
In order to reduce the overhead, we will look for a place where wait_read and wait_write can be reduced.
First, the OS automatically starts TCP connections up to backlog (the number specified in the listen argument) without calling accept (). (Returns ACK / SYN to SYN), so when you accept () from the app, the TCP connection may actually have started and the request from the client may have been received. So, if you accept (), do recv () immediately without wait_read ().
When read () is finished, the response is sent, but this can also be sent immediately because the socket buffer should be empty at first. Let's stop waiting_write ().
webserver5.py
import socket
import select
read_waits = {}
write_waits = {}
def wait_read(con, callback):
read_waits[con.fileno()] = callback
def wait_write(con, callback):
write_waits[con.fileno()] = callback
def evloop():
while 1:
rs, ws, xs = select.select(read_waits.keys(), write_waits.keys(), [])
for rfd in rs:
read_waits.pop(rfd)()
for wfd in ws:
write_waits.pop(wfd)()
class Server(object):
def __init__(self, con):
self.con = con
def start(self):
wait_read(self.con, self.on_acceptable)
def on_acceptable(self):
try:
while 1:
con, _ = self.con.accept()
con.setblocking(0)
Client(con)
except IOError:
wait_read(self.con, self.on_acceptable)
class Client(object):
def __init__(self, con):
self.con = con
self.on_readable()
def on_readable(self):
data = self.con.recv(32 * 1024)
if not data:
wait_read(self.con, self.on_readable)
return
self.buf = b"""HTTP/1.1 200 OK\r
Content-Type: text/plain\r
Content-Length: 6\r
\r
hello
"""
self.on_writable()
def on_writable(self):
wrote = self.con.send(self.buf)
self.buf = self.buf[wrote:]
if self.buf:
wait_write(self.con, self.on_writable)
else:
self.con.close()
def serve():
sock = socket.socket()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.setblocking(0)
sock.bind(('', 8000))
sock.listen(128)
server = Server(sock)
server.start()
evloop()
if __name__ == '__main__':
serve()
This method also works great with prefork.
After accepting, it doesn't accept the next until it does everything it can, so you can hand over accept to another process that really doesn't have a job.
It can also mitigate thundering hard problems. The thundering hard problem is that when multiple processes call accept () in a prefork fashion, only one client connects and all processes are up and only one of them succeeds. A process in which accept () fails is completely lost. It is a disturbing sleep. If you can do this with a server of 100 processes on one core machine, it will not be accumulated.
As for accept (), the thundering hard problem has been completely resolved, as modern Linux now only returns accept for one process when a connection comes in. However, if you select () and then accept (), this problem reoccurs.
By limiting the number of processes to the number of CPU cores and performing the select before accept only when you are really free, the phenomenon of "select but cannot accept" will occur only when the CPU is really free. I can do it.
This only requires more system calls for accept () to close () than prefork with setblocking (0). By the way, in recent Linux, there is a system call called accept4, and you can also do setblocking (0) at the same time as accept.
The story so far is the fastest and strongest in the user space. Once you have implemented your web server in kernel space, you don't need to issue system calls.
https://github.com/KLab/recaro
Recommended Posts