0%

起源

本项目原本的目的是用 GO 重构之前参与过的一个 OJ 的评测系统,目前完成了沙箱的部分分享一下。

需求

评测系统通常需要对提交的代码进行编译和运行。通常运行的算法代码并不需要特殊的权限和系统访问。沙箱需要限制住恶意代码对于评测系统运行的可能的破坏行为。

一个沙箱的实现包含了:

  • 安全: 沙箱内的程序不允许进行超出计算需求的系统访问。包括网络访问,未授权的文件系统访问。
  • 限制: 沙箱内的程序仅能使用限定的 CPU 时间和 内存
  • 快速: 运行时的额外开销小

实现选择

基于 seccomp ptrace + setrlimit 的沙箱

利用 linux 提供的 BPF seccomp,允许安全的 syscall (如 write 运行)。对于文件系统访问的 syscall (如 open),利用 linux 提供的 ptraceprocess_vm_readv 读取沙箱中程序系统调用参数的值获取文件访问的路径。得到路径之后用白名单的形式判断是否为恶意访问。CPU 时间和内存的限制由 setrlimit 设置。

优点:实现简单,不需要权限

缺点:对于每一个文件访问系统调用都需要上下文切换,大概 20% 额外开销

基于 unshare clonecgroup 的沙箱

得益于容器技术的进步,linux 的 cloneunshare 系统调用可以在与宿主机隔离的环境中运行程序的能力。利用 clone 系统调用创建新的 mount, IPC, net, pid, user, uts namespace。在运行程序之前通过 bind mountpivot_root 来隔离运行环境的文件系统。用 cgroupcpuacct.usage 轮询 和 memory.limit_in_bytes 实现CPU 时间和内存的限制。

优点:没有上下文切换的开销。cgroup 统计数据更准确

缺点:cgroup 需要 root 权限。unshare clone20 ms 的额外开销

改进后的容器池

为了减少创建容器的开销,用类似 linux daemon 的思想创建在容器中运行的 “客户端” 由在宿主机上运行的 “控制端” 控制。这样由 “客户端” 运行的程序就不需要重新创建容器的文件系统。跨进程通信使用了 socketpair 创建的 unix socket 并由 gob 编码。同时为了减少文件由 socket 内容传递所产生的多次复制, 利用 unix socket oob 可以传送文件描述符的特性直接传送文件 fd

Exec Command

UML

这样用类似 RPC 的方式实现了程序生命周期的控制。同时提供了 open, delete, reset command 进行文件操作。

优点:减少了创建容器的额外开销

缺点:需要 root 权限或者 privilleged docker

Sandbox的实现

Rest API 接口

基于容器化技术的沙箱需要用 root 权限或者 privilleged docker 来运行,但是评测系统的逻辑并不需要 root 权限。基于权限最小化原则,把沙箱单独拆分出来以 REST API 的形式提供服务。

Web 框架使用了 GIN。提供了文件的 CRUD /file 和运行单个或者多个程序(用管道链接标准输入输出)的 /run

ExecutorServer的实现

go get github.com/criyle/go-judge/cmd/executorserver && sudo ~/go/bin/executorserver

跨平台

设计好 REST API 接口后发现似乎并没有太大的平台相关性。借助于 duck interface 的特性,在 windows 上使用 low mandatory level token + JobObject 简单的实现了一个沙箱作为跨平台的一个验证。

跑分

go test -bench 测试大概有 +1ms (2.06ms - 0.99ms) 额外开销。用 postman 测试 REST API 大概 +5ms 额外延迟。

demo

用 Vue.js 和 WebSocket 简单糊了个小测试站。前端放在 heroku 上,后端部署在性能很菜的树莓派上。goj.ac

最后

原本的目的是为了写点代码来学习 GO 语言。然后在学习过程中重构了几次沙箱后学到了不少工程方面的设计知识,写一个库和用一个库的心态也有很大区别。

沙箱的部分实现也可以拿出来单独使用。

  • forkExec linux fork exec 核心库,用来创建容器和加载 Seccomp BPF
  • unixSocket linux unix socket 传递接收文件描述符
  • memFd linux memfd 创建内存文件
  • container linux 容器池的实现
  • envExec 在环境中运行单个或多个程序的定义和实现

It have been a long time after last post and the sandbox technology have been improved a lot. By combining unix socket and container pooling from vijos/jd4 and cgroup checking and unshare container from syzoj/judge-v3, the judge is able to safely run arbitrary code in isolated environment.

Design of judge-v3 and jd4

In the design of judge-v3, The daemon receives task from website and then pass into runner to execute. The runner would start the running task through simple-sandbox. The simple-sandbox run a program requires creating / destroying containers repeatedly. The container is created through a dedicated readonly rootfs with bind mounting output directories. The container is created through unshare and chroot and privilege is dropped by changing process user.

In the design of jd4, containers are created in advance through fork and unshare. File system is shared through read-only or read-write output bind mounts to the host file system. Child process inside the container is connected through a unix socket and controlled through a RPC interface. New process pid is passed by creating new unix socket as file inside the bind mount and the process pass credential through oob data. The lifetime of the new process is managed through the parent process.

Design of pre-forked GO-sandbox

By taking the design of both implementation, the go-sandbox chooses dumb, pre-forked and isolated runner. The file system of the sandbox is created through read-only bind mounting from host file system with 2 small tmpfs. Command is passed through a unix socket pair together with files and child pid. To ensure security, the privilege is dropped through changing process user or set capabilities.

The RPC interface between host and container daemon are:

  • ping: alive check
  • conf: set the running configuration (user / group)
  • open: open / create multiple files inside the container
  • delete: unlink / rmdir file inside the container
  • reset: remove all files under /w and /tmp
  • execve: execute and wait single process inside the container

Potential attacks

  • forever sleep: pooling through CPU usage and assumes at least 40% utilization
  • creates arbitrary files: readonly & tmpfs limited at 8MB
  • large executable: tmpfs limited at 8MB
  • c / c++ includes /etc/passwd: not bind mounted
  • network download: unshared net
  • overwrite /proc/1/exe: unprivileged
  • open /proc/1/fds/3: unprivileged
  • fork bomb: cgroup pid max
  • large memory: cgroup memory

Design of GO-judge

By taking the design from judge-v3, the go-judge have 2 layers. The client interface connect to the website to receive the tasks and test data. The first layer parses the task and data from the client interface and pass into the message queue interface for the runner. The runner receives run tasks from the queue and run through the go-sandbox and pass back the run results.

Conclusion

Long time not writing documents… I am too lazy…

Reimplement of UOJ run program in GO: go-judger. Start after I found libseccomp that uses seccomp filter introduced in linux 3.8 (2013). Since I have participated that project (uoj) only a little, I decided to try to do some contributions.

Original implements

The original run program restricted resources (CPU, memory, output) and file access by ptrace. Including following steps:

Setup up step after fork in child:

  1. Set resource limits by setrlimit
  2. Set environment variables
  3. Set input / output files
  4. execv

Tracing after fork in parent:

Setup ptrace options when trapped at execv

  1. wait4 at syscall entrance
  2. Check resource usage, wait status, signals and syscall black list to determine terminate or soft ban
  3. ptrace syscall enter syscall
  4. wait4 at syscall exits
  5. Set syscall return value
  6. ptrace syscall exit syscall

In this scenario, the traced process required to stop for each syscall and for both entrance and exiting. For harmless syscalls (e.g. brk, read), this introduces some resource overhead.

Reimplements

For the newly implemented seccomp BPF filter provided by libseccomp, this kind of syscall will handled by the kernel to avoid too much context switch. Also, for a single traced syscall, seccomp will only be triggered once.

Thus, the new implement becomes.

Setup step after fork in child:

  1. Set resource limits by setrlimit
  2. Set input / output files
  3. Load seccomp filter
  4. Stop itself by SIGSTOP
  5. execve with environment variables

Tracing after fork in parent:

Setup ptrace options when trapped by SIGSTOP

  1. wait4 at seccomp event
  2. Check resource usage, wait status, signals, and call syscall event handles. Handle determins whether to terminates or soft ban
  3. ptrace continue enter syscall

Notice that SIGSTOP before execve is required since if execve is traced but the ptrace option have not set up yet, ENOSYS will returned to execve. Safe syscalls was allowed by the filter so there is no ptrace event triggered by safe syscalls.

Also, by setting syscall number to -1 and return value to the register, the soft ban mechanism becomes much efficient.

With all that implemented, process_vm_readv is used to speed up copy syscall argument instead of ptrace peekdata.

Conclusion

In conclusion, by restrict CPU time, memory, output, syscalls and file access, run program is able to block potential attacks.

Since GO language does not provides official implements for fork for runtime duplication issue, it took some time to figure out the usage of raw syscall interface. Because after fork in child, I cannot call any go function, I did buffed the seccomp filter to allow load it after fork. Also, process_vm_readv is not provided so I wrote a wrapper for it.

GitHub: DiscordBilibiliBot

After about one month of development, this discord bot finally have the expected behavior, but it still need some polishment.

Reason for developing such a bot

There are lots of discord bots that plays YouTube wideos as audio in discord audio channel, but there were not decent ones for Bilibili. Since I am a fan of Vocaloid China mainly posted on Bilibili, I started writing this bot.

Read more »

Hello World

Adapted to using Hexo.

inline code

C++ Highlight

1
2
3
4
5
6
7
#include <iostream>

int main() {
int a, b;
cin >> a >> b;
cout << a + b << endl;
}

Python Highlight

1
2
a, b = map(int, input().split(' '))
print(a + b)