TL;DR
`runtime.LockOSThread ()`
when performing namespace-related operations of linux that are strongly linked to OS Thread. is there. [^ 1]Since managing and costing a VM for each tenant (200 ~) is large, I decided to create a mechanism to provide an HTTP (S) reverse proxy to tenants with conflicting address spaces.
Proof of Concept
Try running the code below.
package main
import (
"log"
"net"
"net/http"
"os"
"runtime"
"github.com/containernetworking/plugins/pkg/ns"
)
func main() {
nspath := os.Args[1]
addr := os.Args[2]
var err error
var l net.Listener
ns.WithNetNSPath(nspath, func(_ ns.NetNS) error {
l, err = net.Listen("tcp", addr)
return nil
})
runtime.UnlockOSThread()
if err != nil {
log.Fatal(err)
}
if err := http.Serve(l, nil); err != nil {
log.Fatal(err)
}
}
To run this code, prepare a container isolated on the network as shown below.
# build binary
go build -o nsproxy nsproxy.go
# setup environment
docker run -d --net none --name pause k8s.gcr.io/pause:3.1
ns=$(docker inspect --format '{{ .NetworkSettings.SandboxKey }}' pause)
# run program
sudo ./nsproxy "$ns" 127.0.0.1:8080 &
When this binary is run, it does not exist in the container's network namaspace (hereinafter referred to as netns) when it is operating as an HTTP server.
# ls -l /proc/1/ns/net #Initial netns information for host
lrwxrwxrwx 1 root root 0 Dec 24 21:42 /proc/1/ns/net -> 'net:[4026531984]'
# ls -l /proc/$(pgrep nsproxy)/task/*/ns/net #The nsproxy process is on the host netns
lrwxrwxrwx 1 root root 0 Dec 24 21:42 /proc/4377/task/4377/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4378/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4379/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4380/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4381/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4382/ns/net -> 'net:[4026531984]'
lrwxrwxrwx 1 root root 0 Dec 24 21:47 /proc/4377/task/4393/ns/net -> 'net:[4026531984]'
# ls -l /proc/$(docker inspect --format '{{.State.Pid}}' pause)/task/*/ns/net #netns information for container
lrwxrwxrwx 1 root root 0 Dec 24 21:50 /proc/3867/task/3867/ns/net -> 'net:[4026532117]'
However, if you use nsenter to enter the netns of the container, you can see that the http server is running at `` `127.0.0.1:8080```.
# nsenter --net=$(docker inspect --format '{{ .NetworkSettings.SandboxKey }}' pause) bash
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
# ss -ltn
State Recv-Q Send-Q Local Address:Port Peer Address:Port
LISTEN 0 128 127.0.0.1:8080 0.0.0.0:*
# curl http://127.0.0.1:8080 -v
* Expire in 0 ms for 6 (transfer 0x5627619e7f50)
* Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5627619e7f50)
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 24 Dec 2019 12:58:10 GMT
< Content-Length: 19
<
404 page not found
* Connection #0 to host 127.0.0.1 left intact
Let's see how much this method scales. Expand to have multiple listening ports.
package main
import (
"log"
"net"
"net/http"
"os"
"runtime"
"sync"
"github.com/containernetworking/plugins/pkg/ns"
)
func main() {
addr := os.Args[1]
var ls []net.Listener
for _, nspath := range os.Args[2:] {
ns.WithNetNSPath(nspath, func(_ ns.NetNS) error {
l, err := net.Listen("tcp", addr)
if err != nil {
log.Fatal(err)
}
ls = append(ls, l)
return nil
})
}
runtime.UnlockOSThread()
var wg sync.WaitGroup
for _, l := range ls {
wg.Add(1)
go func(l net.Listener){
err := http.Serve(l, nil)
if err != nil {
log.Print(err)
}
wg.Done()
}(l)
}
wg.Wait()
}
Prepare about 100 containers as shown below
#Create 100 containers
seq 1000 1999 | xargs -I '{}' -exec docker run -d --net none --name 'pause{}' k8s.gcr.io/pause:3.1
#Listen for 100 containers
sudo ./nsproxy 127.0.0.1:8080 $(docker inspect --format '{{.NetworkSettings.SandboxKey}}' pause{100..199} ) &
State immediately after the process starts operation
$ sudo cat /proc/$(pgrep nsproxy)/status
Name: nsproxy
Umask: 0022
State: S (sleeping)
Tgid: 17082
Ngid: 0
Pid: 17082
PPid: 17068
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 128
Groups: 0
NStgid: 17082
NSpid: 17082
NSpgid: 17068
NSsid: 3567
VmPeak: 618548 kB
VmSize: 561720 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 10980 kB
VmRSS: 10980 kB
RssAnon: 6608 kB
RssFile: 4372 kB
RssShmem: 0 kB
VmData: 161968 kB
VmStk: 140 kB
VmExe: 2444 kB
VmLib: 1500 kB
VmPTE: 140 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
Threads: 7
SigQ: 0/15453
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: ffffffffffc1feff
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list: 0-239
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 6
nonvoluntary_ctxt_switches: 0
Immediately after the start, it can be seen that RSS is quite lightweight, about 10980 kB.
It's not scary to touch the network namespace, so please try it. The CNI library itself is lightweight, so be sure to take a look at the implementation itself.
Recommended Posts