nodePort的副作用

问题现象
问题原因
灵魂拷问
解决办法

问题现象

某天遇到了一个问题：访问某个web服务时，部分请求失败了，返回了Connection Refused。

这个web服务因为某些原因，网络是hostNetwork类型，跟宿主机是同一个网络namespace。登录到Node上去，可以看到监听socket还是在的，但是直接curl请求的时候，返回了Connection Refused。

为什么呢？通常Connection Refused表示监听Socket没打开，没有打开对应端口号，内核会返回icmp-port-unreachable，但是现在显然端口是打开了的。

问题原因

其实，除了端口未打开，内核会返回icmp-port-unreachable，如果命中了某些iptables规则，也可能会返回icmp-port-unreachable差错报文。上面的这个问题就是这个原因。

查看集群的所有svc，可以看到，恰好有一个nodePort类型的svc，其分配的端口号，与上述web服务的端口号相同。

查看iptables如下（示例）。

# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
KUBE-EXTERNAL-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes externally-visible service portals */
KUBE-FIREWALL  all  --  anywhere             anywhere

# iptables -L KUBE-EXTERNAL-SERVICES -t filter
Chain KUBE-EXTERNAL-SERVICES (1 references)
target     prot opt source    destination
REJECT     tcp  --  anywhere  anywhere    ADDRTYPE match dst-type LOCAL tcp dpt:30584 reject-with icmp-port-unreachable

iptables

上面是一张iptables在内核中不同阶段的图(来源)，可见，在真正上送给Local Process之前，需要先经过iptables的PREROUTING、INPUT阶段。

从上面的iptables规则来看，NodePort类型的svc，会在iptables的INPUT阶段增加规则，当匹配目的端口号为30584时，会直接返回icmp-port-unreachable差错报文。

注意，上面展示的是svc没有匹配到endpoint的情况，当Pod正常启动，svc匹配到endpoint，此时iptables规则会将报文转发给Pod，Local Process仍然无法接收到报文。由于这种情况客户端的请求不会直接失败，而是会交给Pod处理，因此更具迷惑性。

灵魂拷问

有心的读者会问：kubernetes这样设计，这不是给Local Process挖坑吗，内核在随机选择本地端口的时候，很可能会命中kubernetes svc的端口号呀。

其实kubernetes已经尽力了。

当创建nodePort类型的svc时，kubernetes（实际做事的是kube-proxy）除了会下发iptables规则，还会创建一个监听Socket，该Socket监听的端口号就是nodePort，因此：

当内核指定bind该端口号时，会返回端口已使用
当内核随机选择本地端口号时，不会命中该端口

因此，正常情况下，Local Process不会进坑。

但是，如果Local Process先启动，kube-proxy后启动，则会出现上文描述的情况。

此时，kube-proxy仍然会下发iptables规则，并且尝试bind该端口号，但会不成功，因为已经被Local Process占用了。

E0729 01:48:43.034098       1 proxier.go:1072] can't open "nodePort for default/nginx:" (:31325/tcp), skipping this nodePort: listen tcp :31325: bind: address already in use
E0729 01:49:13.064492       1 proxier.go:1072] can't open "nodePort for default/nginx:" (:31325/tcp), skipping this nodePort: listen tcp :31325: bind: address already in use
E0729 01:49:43.094846       1 proxier.go:1072] can't open "nodePort for default/nginx:" (:31325/tcp), skipping this nodePort: listen tcp :31325: bind: address already in use

但由于iptables已经下发，因此Local Process只能空守着端口号流眼泪，眼睁睁的看着报文被劫走。

解决办法

解决办法是配置用户namespace的resourcequota，配置 service.NodePort/service.LoadBalancer 的配额为0，用户如果创建nodePort或者LoadBalancer类型的svc，会被resourequota拒绝。

apiVersion: v1
kind: ResourceQuota
metadata:
  name: hellobaby
  namespace: hellobaby
spec:
  hard:
...
    services: "16"
    services.nodeports: "2"
    services.loadbalancers: "2"