Wednesday, 27 December 2017

Bank application paralyzed for 1 hour due to Kubernetes loophole, event details review and analysis

 It is worth noting that we had two major accidents last week, and many users were affected (sorry again). The first incident lasted nearly a week, affecting only our prepaid products, theie, Monzo Alpha and Beta. The second incident lasts 1.5 hours on Friday morning, affecting not only our prepaid products this time, but also our cash account. This article mainly introduces the latter. With the blogs I posted last year (https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/), you can learn more about our overall back office architecture design, But more important is to understand the next few components in our technology stack to play the role, in order to more in-depth understanding of this article.  Kubernetes is the administrative deployment system for all of our infrastructure. Monzo's backend is hundreds of microservices, already packaged Docker containers. Kubernetes is the manager of these containers, making sure they work properly on our AWS nodes.  etcd is a distributed database that stores information about which services are deployed, where they are running, and what state they are in, provided to Kubernetes. Kubernetes needs a stable connection to etcd to work properly. If etcd stops running, all of our services continue to run, but they will not be able to upgrade or shrink, scaling, etc.  linkrd is a piece of software that we use to manage back-end service communication connections. In our system, where thousands of network connections occur every second, the linkrd plays the role of routing and load for these connection tasks. To know where routing originated, he also relied on where he could receive updates to the Kubernetes service. Timeline Two weeks ago: The platform R & D team made changes to our etcd cluster, upgraded a new version, and expanded the cluster. In the past, this cluster consisted of only three nodes (one for each zone), and this time we upgraded to nine nodes (three nodes per zone). Because etcd's dependencies reach a distributed Quorum, this means that at this setting we can tolerate the loss of the entire zone plus a single node in another zone. This time it is planned to upgrade, and there is no design to arrange any downtime. We can confirm that the cluster is correct, but it is very important that another system error is triggered here. One day ago: One of our team developed a new feature for the current account, deployed a new interface in the production environment, but noticed what he was experiencing. As a precaution, they reduced their service to no running copy, but Kubernetes services still existed. 14:10: Engineers deploying change services need to process the current payment account. This is not uncommon and our engineers often do the following: In order to minimize the risk of change, we deliver these capabilities with a smaller, more granular, and more frequent pace using a repeatable, well-defined process. However, when the service deployment is complete, all requests to it begin to fail. At this point we start the current account of our site to start the payment failed. In the meantime, the prepaid card is not affected because it does not use the failed service. 14:12: We rolled back the published app. This is also the failure of the standard operation of the release process, when the interface is changed, they should ensure the forward compatibility throu
gh the rollback operation. However, in this case even the rollback operation, the error still exists, the payment still can not be successful. 14:16: We immediately announced an internal failure. Team members began to convene to determine the impact of the problem and start debugging. 14:18: The engineer determined that linkerd appeared to be in an unhealthy state and tried to use an internal tool to identify the single node in question and restart them. As mentioned earlier, linkerd is a system we use to manage the communication between back-end services. To know where to send a particular request, you need to get a logical name from the request, such as service.foo, and convert it to an IP address / port. In this case, linkerd did not receive Kubernetes updates on the new pods13 running on the network. Therefore, it tries to route the request to an IP address that no longer corresponds to the running process. 14:26: In our opinion, the best way to do this is to restart all the linkerd instances on the backend, hundreds of them, assuming they all have the same problem. When we run into problems, many engineers are trying to minimize the impact of customers on payments or receiving bank transfers by activating internal processes that are meant to provide backups. This means that most customers will still be able to use their card successfully, despite the constant volatility. 14:37: Replacement linkerd failed to start because Kubelets running on each of our nodes failed to retrieve the appropriate configuration from Kubernetes apiservers. At this point, we suspect Kubernetes or etcd has other issues and restarts the three apiserver processes. When done, replacing the linkerd instance will be able to start successfully. 15:13: All the linkerd pods are restarted, but the service handling thousands of requests per second now does not receive any traffic. At this point, the customer is completely unable to refresh the feed or balance in their Monzo application, and our internal COps ("Customer Operations") tool stops working. Now this problem has been upgraded to a comprehensive platform downtime, there is no service to meet the requirements. As you can imagine, almost all automatic alerts have been triggered. 15:27: We noticed that linkerd is logging a NullPointerException (http: t.cn/Rl086mW) when trying to resolve a service discovery response from Kubernetes apiserver. We found that this is an incompatibility between the Kubernetes and Linkerd versions that we are running, especially there is no resolve service. Because the newer version of linkerd has been tested in our staging environment for a couple of weeks, with incompatible fixes, the platform team's engineers began deploying a new version of linkerd in an attempt to scroll forward. 15:31: After checking for code changes, engineers realized they could prevent parsing errors by removing the Kubernetes service that does not include endpoints (ie, the aforementioned service was reduced to zero copies as a precaution). They remove the violation service and the linker successfully loads the service discovery information. At this point, the platform is back to normal, traffic begins to migrate gracefully between services, and payments begin to resume work. Event is over! source At this point, although we have brought the system back online, we do not yet understand the root cause of the problem. Due to the frequency of deployment and the automatic response to node and application failures, the network is very dynamic at the backend, so it is important to be able to trust our deployment and request routing subsystem. After that, we found a bug in the Kubernetes and etcd clients (https://github.com/kubernetes/kubernetes/issues/47131) that caused the request to time out after the cluster reconfiguration we performed the previous week. Due to these timeouts, linkerd was unable to receive updates from Kubernetes on the network when deploying services. Although well-intentioned,

0 comments:

Post a Comment