一个Go升级引起的血案
by 伊布
直接现象:使用Go 1.13编译kube-controller-manager
,运行一段时间后,发现controller不工作,查看日志,发现打印“http2: no cached connection was available”。
I0328 09:48:59.925056 1 round_trippers.go:383] GET https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager
I0328 09:48:59.925085 1 round_trippers.go:390] Request Headers:
I0328 09:48:59.925094 1 round_trippers.go:393] User-Agent: kube-controller-manager/v1.11.1 (linux/amd64) kubernetes/b1b2997/leader-election
I0328 09:48:59.925102 1 round_trippers.go:393] Accept: application/vnd.kubernetes.protobuf, */*
I0328 09:48:59.925148 1 round_trippers.go:408] Response Status: in 0 milliseconds
E0328 09:48:59.925199 1 leaderelection.go:234] error retrieving resource lock kube-system/kube-controller-manager: Get https://10.220.14.10:8443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager: http2: no cached connection was available
I0328 09:48:59.925216 1 leaderelection.go:190] failed to acquire lease kube-system/kube-controller-manager
使用1.13会有上面的现象。使用1.12版本没有问题。k8s版本是1.11。
从go代码查询“no cached connection was available”,可以看到是http2noCachedConnError错误。
// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type http2noCachedConnError struct{}
func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}
func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}
var http2ErrNoCachedConn error = http2noCachedConnError{}
由于问题必现,所以不妨使用delve跟踪。
复现方法:配置kube-controller-manage连接1个API Server,当kube-controller-manag正常运行时,将API Server关闭,然后重新启动API Server,之后去查看kube-controller-manager的日志,会打印”http2: no cached connection was available”。
从代码上分析,http2ErrNoCachedConn 只在 src/net/http/h2_bundle.go
中会返回,那么判断可能是在函数getClientConn
出错了。来看下Go源码的实现。
func (p *http2clientConnPool) getClientConn(req *Request, addr string, dialOnMiss bool) (*http2ClientConn, error) {
if http2isConnectionCloseRequest(req) && dialOnMiss {
// It gets its own connection.
http2traceGetConn(req, addr)
const singleUse = true
cc, err := p.t.dialClientConn(addr, singleUse)
if err != nil {
return nil, err
}
return cc, nil
}
p.mu.Lock()
for _, cc := range p.conns[addr] {
if st := cc.idleState(); st.canTakeNewRequest {
if p.shouldTraceGetConn(st) {
http2traceGetConn(req, addr)
}
p.mu.Unlock()
return cc, nil
}
}
if !dialOnMiss {
p.mu.Unlock()
return nil, http2ErrNoCachedConn
}
dlv attach pid后,尝试对getClientConn
打断点,发现有2个getClientConn
,一个在上面看到的在Go源码中,而另一个是在k8s源码的vendor目录下。那么,实际运行中,会调用到哪个呢?没关系,都打上断点。
(dlv) b getClientConn
Command failed: Location "getClientConn" ambiguous: net/http.(*http2clientConnPool).getClientConn, k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn…
(dlv) b net/http.(*http2clientConnPool).getClientConn
Breakpoint 3 (enabled) set at 0x7b6b93 for net/http.(*http2clientConnPool).getClientConn() .usr/local/go/src/net/http/h2_bundle.go:749
(dlv) b k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
Breakpoint 4 (enabled) set at 0x841a03 for k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn() .export/hubt/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55
继续执行,会进入断点,可以看到,调用的是k8s源码vendor目录下的getClientConn
。
堆栈1:
0 0x0000000000851ff3 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:55
1 0x000000000085355e in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/client_conn_pool.go:255
2 0x000000000086d985 in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:345
3 0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/transport.go:313
4 0x0000000000853cee in k8s.io/kubernetes/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/golang.org/x/net/http2/configure_transport.go:75
5 0x00000000008161aa in net/http.(*Transport).roundTrip
at .usr/local/go/src/net/http/transport.go:486
6 0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
at .usr/local/go/src/net/http/roundtrip.go:17
7 0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
8 0x00000000007bfafa in net/http.send
at .usr/local/go/src/net/http/client.go:250
9 0x00000000007bf51a in net/http.(*Client).send
来看看它的实现。
var ErrNoCachedConn = errors.New("http2: no cached connection was available")
func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
if isConnectionCloseRequest(req) && dialOnMiss {
// It gets its own connection.
const singleUse = true
cc, err := p.t.dialClientConn(addr, singleUse)
if err != nil {
return nil, err
}
return cc, nil
}
p.mu.Lock()
for _, cc := range p.conns[addr] {
if cc.CanTakeNewRequest() {
p.mu.Unlock()
return cc, nil
}
}
if !dialOnMiss {
p.mu.Unlock()
return nil, ErrNoCachedConn
}
call := p.getStartDialLocked(addr)
p.mu.Unlock()
<-call.done
return call.res, call.err
}
可以看到,这里会返回 ErrNoCachedConn
(从堆栈上看,rt为noDialH2RoundTripper,所以dialOnMiss
值为false),而ErrNoCachedConn
的值,的确是kube-controller-manager打印的字符串。
从前面的堆栈看,错误码会一路返回到 configure_transport.go
的 RoundTrip
。来看下这块代码。
func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
res, err := rt.t.RoundTrip(req)
if err == ErrNoCachedConn {
return nil, http.ErrSkipAltProtocol
}
return res, err
}
从这里可以看到,如果返回的错误码是 noDialH2RoundTripper
,会把它转为http.ErrSkipAltProtocol
再返回。错误不是这里打印的。
继续往回看堆栈,在Go源码的roundTrip
中,针对http.ErrSkipAltProtocol
的确有特殊处理,针对这种错误不会直接返回报错。
继续单步跟踪下去,可以看到,在pconn的roundTrip
中返回了ErrNoCachedConn
,但在http2isNoCachedConnError
判断时,认为不是同一个错误,没有执行t.removeIdleConn(pconn)
,并进入了 pconn.shouldRetryRequest
,并最终返回了ErrNoCachedConn
。
// roundTrip implements a RoundTripper over HTTP.
func (t *Transport) roundTrip(req *Request) (*Response, error) {
...
if t.useRegisteredProtocol(req) {
altProto, _ := t.altProto.Load().(map[string]RoundTripper)
if altRT := altProto[scheme]; altRT != nil {
if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
return resp, err
}
}
}
...
for {
...
pconn, err := t.getConn(treq, cm)
if err != nil {
t.setReqCanceler(req, nil)
req.closeBody()
return nil, err
}
var resp *Response
if pconn.alt != nil {
// HTTP/2 path.
t.setReqCanceler(req, nil) // not cancelable with CancelRequest
resp, err = pconn.alt.RoundTrip(req)
} else {
resp, err = pconn.roundTrip(treq)
}
if err == nil {
return resp, nil
}
if http2isNoCachedConnError(err) {
t.removeIdleConn(pconn)
} else if !pconn.shouldRetryRequest(req, err) {
// Issue 16465: return underlying net.Conn.Read error from peek,
// as we've historically done.
if e, ok := err.(transportReadFromServerError); ok {
err = e.err
}
return nil, err
}
给shouldRetryRequest也打上断点(同样的,它也出现了2处)。
b net/http.(*persistConn).shouldRetryRequest
b k8s.io/kubernetes/vendor/golang.org/x/net/http2.shouldRetryRequest
在复现时查看到堆栈如下,打印下入参 err ,确实是ErrNoCachedConn
。
0 0x0000000000816413 in net/http.(*persistConn).shouldRetryRequest
at .usr/local/go/src/net/http/transport.go:569
1 0x0000000000815ce9 in net/http.(*Transport).roundTrip
at .usr/local/go/src/net/http/transport.go:543
2 0x00000000007fc4d5 in net/http.(*Transport).RoundTrip
at .usr/local/go/src/net/http/roundtrip.go:17
3 0x0000000000ea9100 in k8s.io/kubernetes/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at .Users/hubaotao/Code/GoGo/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/transport/round_trippers.go:162
4 0x00000000007bfafa in net/http.send
at .usr/local/go/src/net/http/client.go:250
5 0x00000000007bf51a in net/http.(*Client).send
(dlv) p err
error(*errors.errorString) *{
s: "http2: no cached connection was available",}
我们来看下shouldRetryRequest
的实现。
func (pc *persistConn) shouldRetryRequest(req *Request, err error) bool {
if http2isNoCachedConnError(err) {
// Issue 16582: if the user started a bunch of
// requests at once, they can all pick the same conn
// and violate the server's max concurrent streams.
// Instead, match the HTTP/1 behavior for now and dial
// again to get a new TCP connection, rather than failing
// this request.
return true
}
...
return false // conservatively
可以看到,如果 no cached conn,是会返回true,即表示应该会retry。但单独跟踪下去发现并没有retry,这样从前面的代码分析,会返回错误码到 http.send
,也就是打印出来的日志。
所以,可以看到,问题是出在 http2isNoCachedConnError
。前面有贴它的实现,再贴一下。
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}
写个小代码检查下,ErrNoCachedConn
是否满足 http2isNoCachedConnError
(https://play.golang.org/p/fB8ilA6-VPZ)。
package main
import (
"errors"
"fmt"
)
type http2noCachedConnError struct{}
func (http2noCachedConnError) IsHTTP2NoCachedConnError() {}
func (http2noCachedConnError) Error() string { return "http2: no cached connection was available" }
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}
func main(){
var ErrNoCachedConn = errors.New("http2: no cached connection was available")
res := http2isNoCachedConnError(ErrNoCachedConn)
fmt.Printf("result 1 %v\n", res)
var err = http2noCachedConnError{}
res = http2isNoCachedConnError(err)
fmt.Printf("result 2 %v\n", res)
if ErrNoCachedConn.Error() == err.Error() {
fmt.Printf("same result!")
}
}
可以看到,虽然他们的值是相同的,但 ErrNoCachedConn
并不符合 http2isNoCachedConnError
的要求,这样也就导致了上面一直报错,没有发起重连。
当然这个问题解决很简单,不要用Go 1.13版本编译,用Go 1.12版本就不会问题。 1.12版本的重连逻辑与1.13不一样,如何不同,有机会再分析。
当然,也可以修改k8s源码里vendor的实现,统一两边错误码的定义,这样可以使用后续版本的Go,不过最好不要改vendor。
事实上,Go在后续版本的实现里,也是这样做的,具体可以看这两个patch:
不过,这个patch考虑不是很周到,要是这样改,向前兼容会更好一些:
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
if !ok {
if err.Error() == http2ErrNoCachedConn.Error() {
return true
}
}
return ok
}
哈哈哈。
Subscribe via RSS