For this reason, we advise developers not to send CSS or too much structured data in a check’s response. A health check endpoint will be hit many millions of times over the lifetime of a service, so it needs to be fast. Still, we advise our developers to include some error handling, as a “flapping” instance can delay functionality for any other service instances that require it to be running.Ī service’s health check should respond with a status as quickly as possible. From an orchestration perspective, this is generally OK, so long as the service records (in a log line or traced error, for example) why it restarted, so operators or developers can discover and resolve the dependency failure. Typically, if a service fails a health check because its dependencies have failed, it’ll start “flapping”-repeatedly restarting until its dependencies are running. For example, if a service fails its health check in CF, it’ll be killed and pulled from the load balancer and then restarted. A simple “can I respond at all” check is rarely sufficient for a service of any complexity. To write quality health checks, we consider five areas:Ī health check’s response code should represent the true state of the service’s health. Health checks are critical to make sure an orchestration platform is correctly deploying and managing services at scale, so our developers define a health check for any service they run in a container. On the right end, happy customers and well-rested operators are the result of development and operations discipline that produces resilient services.īut what makes for effective health checks and monitoring for container-based services? How can we make sure our services really are running, ready, and observable every time? To answer that question, the CF team has evolved a set of best practices for maintaining the integrity of services running on our container platform. In the middle, monitoring and alerting solutions help us keep our platform observable-and our operators happy and rested. On the left side of the spectrum, health and readiness checks help us ensure that services are running and ready when they’re deployed. We currently manage some 1,000 machines, most of which run on physical hardware spread across three infrastructure providers in multiple data centers.įor teams running services on the CF platform, we believe the reliability of those services exists on a spectrum: The environment we manage isn’t massive by modern enterprise standards, but it’s large and growing rapidly. Our developers build releases as Docker images, and we run those as containers on our platform. New Relic’s container orchestration team-known internally as Container Fabric (CF)-builds and maintains the company’s platform for deploying and maintaining containerized services.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |