Using the Retry pattern to make your cloud application more resilient
This post was authored by Jason Haley, Microsoft Azure MVP.
Recently, I was at Boston Code Camp catching up with some old friends and looking to learn about containers or anything that could help me in my current project of migrating a microservices application to run in containers. I was speaking with one friend who had just presented a session on Polly, and he made a comment that got my attention. He said that one of the attendees at his session was under the impression that using the cloud would make his application inherently resilient and he would not need any of the features that Polly provides.
In case you are not familiar with Polly, you can use this library to easily add common patterns like Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback to your code to make your system more resilient. Scott Hanselman recently wrote a blog post: Adding Resilience and Transient Fault handling to your .NET Core HttpClient with Polly, discussing how he was using Polly and HttpClient with ASP.NET Core 2.1.
What that attendee may have been referring to is that most Azure services and client SDKs have features to perform retries for you (which can play a large part in making your application resilient), but in some cases you need to specifically set up the retry logic. Also keep in mind that your third-party client SDKs may need retry logic turned on in diverse ways. Your application will not automatically become resilient just by putting it in the cloud.
What is resiliency?
Resiliency is the capability to handle partial failures while continuing to execute and not crash. In modern application architectures — whether it be microservices running in containers on-premises or applications running in the cloud — failures are going to occur. For example, applications that communicate over networks (like services talking to a database or an API) are subject to transient failures. These temporary faults cause lesser amounts of downtime due to timeouts, overloaded resources, networking hiccups, and other problems that come and go and are hard to reproduce. These failures are usually self-correcting.
You can’t avoid failures, but you can respond in ways that will keep your system up or at least minimize downtime. For example, when one microservice fails, its effects can cause the system to fail.
With modern application design moving away from the monolithic application toward microservices, resiliency becomes even more important due to the increased number of components that need to communicate with each other.
How can you make your system more resilient?
Lately I’ve been reading another great resource, the .NET Microservices: Architecture for Containerized .NET Applications e-book, which also has a reference microservice application on GitHub called eShopOnContainers. Written in .NET Core 2.1, eShopOnContainers is a microservices architecture that uses Docker containers. I’ll be referring to the reference application for sample code. Keep in mind, there are some new technologies that promise to help make service-to-service communication more resilient without adding code, like a service mesh, which I won’t be discussing here. I want to look at what type patterns we can use in code. Let’s look at examples of a couple of resiliency patterns: Retry and Circuit Breaker. The Azure Architecture Center covers several resiliency patterns that you can use in your application.
Retry
Retries can be an effective way to handle transient failures that occur with cross-component communication in a system. As I mentioned, most Azure services and client SDKs have features for performing retries. One example when working with a database is to use Entity Framework Core and EnableRetryOnFailure to configure a retry strategy. In the eShopOnContainers code, you can see an example of this by looking at the Startup.cs file in the Catalog.API project. In the AddCustomDbContext method of the CustomExtensionsMethods utility class towards the bottom where it configures CatalogContext, you’ll see the usage of EnableRetryOnFailure.
The code above shows Entity Framework Core is to retry database calls up to 10 times before failing and to add some time delay between retries — but not delay more than 30 seconds. This is using the default execution strategy (there are others). If you want to know more about configuring Entity Framework Core to automatically retry failed database calls, you can find the details at Connection Resiliency.
Circuit Breaker
Developers often use the Circuit Breaker and Retry patterns together to give retrying a break. Retry tries an operation again, but when it doesn’t succeed, you don’t always want to just try it one more time or you may risk prolonging the problem (especially if the failure is due to a service being under a heavy load). The Circuit Breaker pattern effectively shuts down all retries on an operation after a set number of retries have failed. This allows the system to recover from failed retries after hitting a known limit and gives it a chance to react in another way, like falling back to a cached value or returning a message to the user to try again later.
Earlier this year when I read the .NET Microservices: Architecture for Containerized .NET Applications e-book, I found the especially useful HttpClient wrapping class named ResilientHttpClient in the chapter on resiliency. This utility class was the reason that I cloned the GitHub repo and started learning the eShopContainers code. Following the codebase update to use .NET Core 2.1, refactoring removed that utility class in favor of using new features that do the same thing. But if you are curious or are not using .NET Core 2.1 yet, you can find the code in ResilientHttpClient.cs on GitHub. The HttpInvoker method is the heart of this utility.
This method uses Polly to make a call using an HttpClient with an exponential back-off Retry policy and a Circuit Breaker policy that will cause retries to stop for a minute after hitting a specified number of failed retries. The last line in the method is the one that makes the call by executing the passing in action. This may not make a lot of sense just from the small code snippet, but Section 8 of the e-book, Implementing Resilient Applications, goes into detail on how the ResilientHttpClient utility class works. The polices are in the CreateResilientHttpClient method of the ResilientHttpClientFactory class:
You should be able to understand what the policies are, but again I refer you to the e-book for a detailed explanation. Also, if you are using .NET Core 2.1, the e-book’s update has useful information on how to configure and use the new HTTPClientFactory to do the same thing.
Summary
Running your application in containers or in the cloud does not automatically make your application resilient. It’s up to you to configure the features that will enable the retry logic you provide. When you need retry logic added to your system, you should use a library such as Polly to speed up your implementation. Or, if you are exploring how to add resiliency without code, you should investigate service mesh products like Istio and Linkerd.
If you use HttpClient in your applications to call APIs, you should download the .NET Microservices: Architecture for Containerized .NET Applications e-book and clone the GitHub repo. The e-book discusses the reference architecture in depth to help you understand microservices architecture.
There’s also this recording of an Ignite 2017 breakout session about the e-book and eShopOnContainers project: Implement microservices patterns with .NET Core and Docker containers.
Source: Azure Blog Feed