Fault Injection: Library — Zhiheng Lin's Second Brain

Fault Injection

3rd December 2021 at 11:10am

Fault Injection，中文「错误注入」，是指向一个系统注入错误，来测试整体系统的健壮性。

Fault injection 在不同的研究者中有不同的结构。chaostoolkit 将其大致分为：

Infrastructure fault injection，基础设施层面的，比如关注云环境（比如 AWS、Azure 或者 Google Cloud Engine）可用性的，关注容器调度、资源编排可用性的（比如 Kubernetes）
Applicatoin fault injection，应用程序层面的，比如调用某些第三方服务失败、延迟增长等

我目前主要关注的是应用程序层面的。实践过 Toxiproxy。

Gremlin 写过一篇 Chaos Engineering tools comparison 描述了它们理解的各个工具。我尝试过 Gremlin，但是想让它 blackhole 到 localhost 的请求，怎么配都没有生效。

Toxiproxy

有一些工具可以实现网络请求的错误注入。我使用过 Toxiproxy。

Toxiproxy 由两部分组成：

Server: 一个代理，提供 HTTP API 及配置文件供外部添加要访问的上下游、破坏规则等
CLI / client library: 与 server 进行交互

Toxiproxy 有两个典型的使用场景：

使用 Toxiproxy 的库，配合测试代码进行测试。比如下面代码，是针对 Toxiproxy 中定义的一个上流 mysql_master，将向它的请求都加上一秒延迟：
```
Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do
  Shop.first # this takes at least 1s
end
```
使用 Toxiproxy CLI 操纵 server，不使用库

下面演示一个场景，将一批请求的延时时间变大。

使用 Python 内置 HTTP Server 起一个文件 server：

onlyice@onlyice-pc1 ~/workspace/stories-json
> $ ls
chapter-1.json  chapter-2.json  chapter-3.json  chapter-4.json  chapter-5.json  story.json

> $ python -m http.server 8000
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

启动 Toxiproxy server，并使用 CLI 添加随机 200±50ms 延时：

# 在第一个终端中运行起 server
> $ toxiproxy-server

# 在另外的终端，新建一个代理，监听（`-l`, listen）8001 端口，并使其上游（`-u`, upstream）为 8000 端口
> $ toxiproxy-cli create story-server -l localhost:8001 -u localhost:8000
Created new proxy story-server
# 新建新的 “毒药”（toxic），设置延时为 200±50ms
> $ toxiproxy-cli toxic add story-server -t latency -a latency=200 -a jitter=50
Added downstream latency toxic 'latency_downstream' on proxy 'story-server'

使用 curl 分别测试 Toxiproxy proxy server 的访问速度和直接连接的访问速度。可以看到 Toxiproxy 的加入了延迟：

> $ curl -L --output /dev/null --silent --show-error --write-out 'lookup:        %{time_namelookup}\nconnect:       %{time_connect}\nappconnect:    %{time_appconnect}\npretransfer:   %{time_pretransfer}\nredirect:      %{time_redirect}\nstarttransfer: %{time_starttransfer}\ntotal:         %{time_total}\n' http://localhost:8000/story.json                                            
lookup:        0.001063
connect:       0.001496
appconnect:    0.000000
pretransfer:   0.001575
redirect:      0.000000
starttransfer: 0.003968
total:         0.004098
> $ curl -L --output /dev/null --silent --show-error --write-out 'lookup:        %{time_namelookup}\nconnect:       %{time_connect}\nappconnect:    %{time_appconnect}\npretransfer:   %{time_pretransfer}\nredirect:      %{time_redirect}\nstarttransfer: %{time_starttransfer}\ntotal:         %{time_total}\n' http://localhost:8001/story.json
lookup:        0.000810
connect:       0.001156
appconnect:    0.000000
pretransfer:   0.001217
redirect:      0.000000
starttransfer: 0.211397
total:         0.232331