Trigger

――I made the contract for the optical line of my house 1 Gbps, but when I tried to download a large file (ISO of Linux) etc. by HTTP, the speed did not come out

why?

--There is a limit to the throughput that can be output with a single TCP connection. --How to check ~~ TCP receive window size ~~ for Linux (to be exact, kernel buffer size)

$cat /proc/sys/net/ipv4/tcp_rmem
4096    87380   6291456

From the left [min, default, max]

So the maximum throughput is

T_{max} = win / RTT

Therefore, if it is a single TCP connection communication with a server with RTT = 30ms in the default state ~~ About 87380 [byte] * 8/30 [ms] ≒ 23.3 [Mbps] only ~~ In fact, ** if there is no congestion ** the window size will get bigger and bigger, so it will be faster.

Of course, TCP has a window scale option to support wideband networks. The window size can be expanded up to 1 Gbyte (it is unknown whether it is actually used)

Theoretically speaking, if there are multiple connections, the bandwidth will be N times if N are bundled.

Existing tools

Wget and curl are famous as tools that can be used from the command line There is aria2 etc. that can use multiple connections Use explosive downloader aria2, which is several times faster than curl and wget --Qiita When establishing multiple connections, do not overload the other server

Creating an HTTP client for the subject

HTTP has a Range Request RFC 7233 — HTTP / 1.1: Range Requests Implementation is in popular Python for the time being

About Range Request

--For example, suppose you request a 1000 byte file --If you send a GET request with'Range: bytes = 0-499' in the header, --Add'Content-Range: bytes 0-499 / 1000'to the response header and return only the first 500 bytes of the file in the body. --Status code is '206 Partial Content'

However, in some cases the server does not accept Range headers.

Use this feature to request different parts of a file from multiple TCP connections at the same time

Multiplexing

Python has a module called selectors that can handle select system calls at a higher level (in the standard library!) 18.4. selectors — High level I / O multiplexing — Python 3.6.1 documentation This guy monitors and multiplexes multiple sockets Use like this

#Imagine a connection with two TCP echo servers, A and B
import selectors
import socket

#Omission
sel = selectors.DefaultSelectors()
sock_A = socket.create_connection(address_A)
sock_B = socket.create_connection(address_B)

sel.resister(sock_A, selectors.EVENT_READ)
sel.resister(sock_B, selectors.EVENT_READ)


sock_B.sendall('Hello'.encode()) # send something to A
sock_B.sendall('Hello'.encode()) # send something to B

while True:
    events = sel.select()
    for key, mask in events:
        message = key.fileobj.recv(512)
        print(message.decode())

point

--Since it is not possible to keep all the pieces of the file that are returned separately in the memory, write them to the file sequentially from the place where the order is aligned. --It is not a good decision to continue using a poorly performing TCP connection, so evaluate each connection, discard the poorly performing connection, replace it with a new one, and resend the request.

Rough flow

Send an HTTP HEAD request to check the file size (using an existing HTTP library here)
Determine the total number of divisions and division size, and establish a connection
Send the initial request
Monitor the sockets with the selectors mentioned above, read sequentially from the sockets that became readable, and put each socket in the primary buffer.
When the contents of the primary buffer are long enough to be processed as an HTTP response, divide them into a header and a body.
Identify which part of the file the response corresponds to from the header and move from the primary buffer to the secondary buffer
Write to the file from the first in the secondary buffer, then delete it from the secondary buffer
Update the evaluation value of each connection, discard the connection judged to have low performance, and re-send the request to the newly established connection.
Repeat steps 4-8 until the entire file is complete

Implemented

https://github.com/johejo/rangedl There are still some bugs

How to use

Environment Python 3.6.1

$ pip install git+http://github.com/johejo/rangedl.git
$ rangedl [URL] -n [NUM_OF_CONNECTION] -s [SPLIT_SIZE_MB]

--By default, tqdm shows the progress bar. The progress bar is not displayed with the -p option. --For security reasons, the number of connections cannot exceed 10. --If the split size specified by the option is smaller than the value of'File size / Number of connections', the value of'File size / Number of connections' is forcibly set as the split size.

result

――Depending on the mood of the line, I was able to download at about 200Mbps. --When split_size is set to 1MB, the memory usage is about 30-80MB. Is it unavoidable that the CPU usage is high ...

HTTP split download guy made with Python