直接现象:使用Go 1.13编译kube-controller-manager,运行一段时间后,发现controller不工作,查看日志,发现打印“http2: no cached connection was available”。

I0328 09:48:59.925056       1 round_trippers.go:383] GET https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager
I0328 09:48:59.925085       1 round_trippers.go:390] Request Headers:
I0328 09:48:59.925094       1 round_trippers.go:393]     User-Agent: kube-controller-manager/v1.11.1 (linux/amd64) kubernetes/b1b2997/leader-election
I0328 09:48:59.925102       1 round_trippers.go:393]     Accept: application/vnd.kubernetes.protobuf, */*
I0328 09:48:59.925148       1 round_trippers.go:408] Response Status:  in 0 milliseconds
E0328 09:48:59.925199       1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: http2: no cached connection was available
I0328 09:48:59.925216       1 leaderelection.go:190] failed to acquire lease kube-system/kube-controller-manager

使用1.13会有上面的现象。使用1.12版本没有问题。k8s版本是1.11。

从go代码查询“no cached connection was available”,可以看到是http2noCachedConnError错误。

// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}

var http2ErrNoCachedConn error = http2noCachedConnError{}

由于问题必现,所以不妨使用delve跟踪。

复现方法:配置kube-controller-manage连接1个API Server,当kube-controller-manag正常运行时,将API Server关闭,然后重新启动API Server,之后去查看kube-controller-manager的日志,会打印”http2: no cached connection was available”。

从代码上分析,http2ErrNoCachedConn 只在 src/net/http/h2_bundle.go 中会返回,那么判断可能是在函数getClientConn出错了。来看下Go源码的实现。

func (p *http2clientConnPool) getClientConn(req *Request, addr string, dialOnMiss bool) (*http2ClientConn, error) {
	if http2isConnectionCloseRequest(req) && dialOnMiss {
		// It gets its own connection.
		http2traceGetConn(req, addr)
		const singleUse = true
		cc, err := p.t.dialClientConn(addr, singleUse)
		if err != nil {
			return nil, err
		}
		return cc, nil
	}
	p.mu.Lock()
	for _, cc := range p.conns[addr] {
		if st := cc.idleState(); st.canTakeNewRequest {
			if p.shouldTraceGetConn(st) {
				http2traceGetConn(req, addr)
			}
			p.mu.Unlock()
			return cc, nil
		}
	}
	if !dialOnMiss {
		p.mu.Unlock()
		return nil, http2ErrNoCachedConn
	}

dlv attach pid后,尝试对getClientConn打断点,发现有2个getClientConn,一个在上面看到的在Go源码中,而另一个是在k8s源码的vendor目录下。那么,实际运行中,会调用到哪个呢?没关系,都打上断点。

(dlv) b getClientConn
Command failed: Location "getClientConn" ambiguous: net/http.(*http2clientConnPool).getClientConn, k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn…
(dlv) b net/http.(*http2clientConnPool).getClientConn
Breakpoint 3 (enabled) set at 0x7b6b93 for net/http.(*http2clientConnPool).getClientConn() .usr/local/go/src/net/http/h2_bundle.go:749
(dlv) b k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
Breakpoint 4 (enabled) set at 0x841a03 for k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn() .export/hubt/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55

继续执行,会进入断点,可以看到,调用的是k8s源码vendor目录下的getClientConn

堆栈1:

 0  0x0000000000851ff3 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55
 1  0x000000000085355e in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:255
 2  0x000000000086d985 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:345
 3  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:313
 4  0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/configure_transport.go:75
 5  0x00000000008161aa in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:486
 6  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 7  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 8  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 9  0x00000000007bf51a in net/http.(*Client).send

来看看它的实现。

var ErrNoCachedConn = errors.New("http2: no cached connection was available")

func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
	if isConnectionCloseRequest(req) && dialOnMiss {
		// It gets its own connection.
		const singleUse = true
		cc, err := p.t.dialClientConn(addr, singleUse)
		if err != nil {
			return nil, err
		}
		return cc, nil
	}
	p.mu.Lock()
	for _, cc := range p.conns[addr] {
		if cc.CanTakeNewRequest() {
			p.mu.Unlock()
			return cc, nil
		}
	}
	if !dialOnMiss {
		p.mu.Unlock()
		return nil, ErrNoCachedConn
	}
	call := p.getStartDialLocked(addr)
	p.mu.Unlock()
	<-call.done
	return call.res, call.err
}

可以看到,这里会返回 ErrNoCachedConn(从堆栈上看,rt为noDialH2RoundTripper,所以dialOnMiss值为false),而ErrNoCachedConn的值,的确是kube-controller-manager打印的字符串。

从前面的堆栈看,错误码会一路返回到 configure_transport.goRoundTrip。来看下这块代码。

func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
	res, err := rt.t.RoundTrip(req)
	if err == ErrNoCachedConn {
		return nil, http.ErrSkipAltProtocol
	}
	return res, err
}

从这里可以看到,如果返回的错误码是 noDialH2RoundTripper,会把它转为http.ErrSkipAltProtocol再返回。错误不是这里打印的。

继续往回看堆栈,在Go源码的roundTrip中,针对http.ErrSkipAltProtocol的确有特殊处理,针对这种错误不会直接返回报错。

继续单步跟踪下去,可以看到,在pconn的roundTrip中返回了ErrNoCachedConn,但在http2isNoCachedConnError判断时,认为不是同一个错误,没有执行t.removeIdleConn(pconn),并进入了 pconn.shouldRetryRequest,并最终返回了ErrNoCachedConn

// roundTrip implements a RoundTripper over HTTP.
func (t *Transport) roundTrip(req *Request) (*Response, error) {
...
	if t.useRegisteredProtocol(req) {
		altProto, _ := t.altProto.Load().(map[string]RoundTripper)
		if altRT := altProto[scheme]; altRT != nil {
			if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
				return resp, err
			}
		}
	}
...
	for {
...
      pconn, err := t.getConn(treq, cm)
		if err != nil {
			t.setReqCanceler(req, nil)
			req.closeBody()
			return nil, err
		}

		var resp *Response
		if pconn.alt != nil {
			// HTTP/2 path.
			t.setReqCanceler(req, nil) // not cancelable with CancelRequest
			resp, err = pconn.alt.RoundTrip(req)
		} else {
			resp, err = pconn.roundTrip(treq)
		}
		if err == nil {
			return resp, nil
		}
		if http2isNoCachedConnError(err) {
			t.removeIdleConn(pconn)
		} else if !pconn.shouldRetryRequest(req, err) {
			// Issue 16465: return underlying net.Conn.Read error from peek,
			// as we've historically done.
			if e, ok := err.(transportReadFromServerError); ok {
				err = e.err
			}
			return nil, err
		}

给shouldRetryRequest也打上断点(同样的,它也出现了2处)。

b net/http.(*persistConn).shouldRetryRequest
b k8s.io/kubernetes/vendor/golang.org/x/net/http2.shouldRetryRequest

在复现时查看到堆栈如下,打印下入参 err ,确实是ErrNoCachedConn

 0  0x0000000000816413 in net/http.(*persistConn).shouldRetryRequest
    at .usr/local/go/src/net/http/transport.go:569
 1  0x0000000000815ce9 in net/http.(*Transport).roundTrip
    at .usr/local/go/src/net/http/transport.go:543
 2  0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
    at .usr/local/go/src/net/http/roundtrip.go:17
 3  0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
    at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
 4  0x00000000007bfafa in net/http.send
    at .usr/local/go/src/net/http/client.go:250
 5  0x00000000007bf51a in net/http.(*Client).send
(dlv) p err
error(*errors.errorString) *{
	s: "http2: no cached connection was available",}

我们来看下shouldRetryRequest的实现。

func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
	if http2isNoCachedConnError(err) {
		// Issue 16582: if the user started a bunch of
		// requests at once, they can all pick the same conn
		// and violate the server's max concurrent streams.
		// Instead, match the HTTP/1 behavior for now and dial
		// again to get a new TCP connection, rather than failing
		// this request.
		return true
	}
...
	return false // conservatively

可以看到,如果 no cached conn,是会返回true,即表示应该会retry。但单独跟踪下去发现并没有retry,这样从前面的代码分析,会返回错误码到 http.send,也就是打印出来的日志。

所以,可以看到,问题是出在 http2isNoCachedConnError。前面有贴它的实现,再贴一下。

func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}

写个小代码检查下,ErrNoCachedConn 是否满足 http2isNoCachedConnError(https://play.golang.org/p/fB8ilA6-VPZ)。

package main

import (
	"errors"
	"fmt"
)

type http2noCachedConnError struct{}

func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}

func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }

// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	return ok
}
func main(){
	var ErrNoCachedConn = errors.New("http2: no cached connection was available")
	res := http2isNoCachedConnError(ErrNoCachedConn)
	fmt.Printf("result 1 %v\n", res)

	var err = http2noCachedConnError{}
	res = http2isNoCachedConnError(err)
	fmt.Printf("result 2 %v\n", res)

	if ErrNoCachedConn.Error() == err.Error() {
		fmt.Printf("same result!")
	}
}

可以看到,虽然他们的值是相同的,但 ErrNoCachedConn 并不符合 http2isNoCachedConnError 的要求,这样也就导致了上面一直报错,没有发起重连。

当然这个问题解决很简单,不要用Go 1.13版本编译,用Go 1.12版本就不会问题。 1.12版本的重连逻辑与1.13不一样,如何不同,有机会再分析。

当然,也可以修改k8s源码里vendor的实现,统一两边错误码的定义,这样可以使用后续版本的Go,不过最好不要改vendor。

事实上,Go在后续版本的实现里,也是这样做的,具体可以看这两个patch:

不过,这个patch考虑不是很周到,要是这样改,向前兼容会更好一些:

func http2isNoCachedConnError(err error) bool {
	_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
	if !ok {
		if err.Error() == http2ErrNoCachedConn.Error() {
			return true
		}
	}
	return ok
}

哈哈哈。