Managing dependencies between apps when using AWS ECS Service Connect

Reproduce the scenario with my CDK example: https://github.com/foyst/ecs-service-connect-dependencies

I’ve been working on a project recently that made use of AWS Elastic Container Service (ECS) Service Connect. Since I last used ECS they’ve introduced capabilities around service discovery which aim to solve challenges orchestrating services in distributed and fluid environments.

ECS Service Connect is different to Service Discovery, in that it still uses AWS Cloud Map under the hood as a service catalogue, but uses API based discovery rather than DNS. This means it’s quicker for it to detect outages to applications and update service entries as necessary.

The evolution of integration options available in ECS over time

It’s kinda like an “AWS App Mesh-light”, in that you don’t need the extensive configuration and resources involved in setting up a service mesh, and it sits nicely within the ECS ecosystem as well. It feels like an extension of App Mesh, as it utilises the same Envoy sidecar proxy under the hood to intercept and dynamically route traffic to application instances.

A challenge I recently encountered with this though was sometimes applications running in ECS would sporadically fail to resolve other applications running within the same ECS cluster and Cloud Map namespace. A redeploy of the affected services would resolve the issue, but at the time of deploying it wasn’t guaranteed if connectivity would be working or not.

Turns out, when using Service Connect it matters what order you deploy services in to ECS – you need to ensure that your applications start up in dependency order. When deploying services with CloudFormation or CDK there’s a good chance they’ll all get provisioned simultaneously.

Service Connect and Envoy work by updating the /etc/hosts file within your application’s Docker container, listing the alias used by a service alongside an IP address allocated to the Envoy sidecar.

For this example, I’m going to use a trusty old application I’ve recently been reacquainted with… good old WordPress. You can see here when MySQL is already available in ECS as a Service, when WordPress is provisioned Service Connect automatically adds entries to it in the hosts file:

This makes WordPress happy.

However, if both MySQL and WordPress are deployed at the same time, it’s possible MySQL isn’t registered in time to be included in the WordPress host file. This makes WordPress sad:

Now, there’s a number of reasons why you may or may not encounter this race condition:

  • If you’re using Fargate, it’s likely you’re defining security groups for each service to talk to others. Defining these and linking to other services will likely link the service’s deployment order implicitly.
  • If you’re using EC2 (like I was), then you might not be going as far as defining security groups if all applications run on the same instance (like mine was). Here you’re more likely to encounter the race condition.

To ensure that services are deployed in the correct dependency order, you can use the “DependsOn” CloudFormation directive. In CDK it looks like this:

wordpressService.node.addDependency(mySqlService);

I’ve added a reproducible scenario in this GitHub repo, you can comment and uncomment the above CDK code and see the issue.

https://github.com/foyst/ecs-service-connect-dependencies

Hopefully this is of help to someone, as it was a fun issue to diagnose!