'Dynamic Stage Routing / Multi-Cluster Setup with Fargate
I'm having a fargate cluster with a service having two containers:
- a container running nginx for terminating mTLS (it accepts a defined list of CAs) and forwarding calls to the app container with the DN of the client certificate
- a Spring App running on tomcat which does fine-grained authorization checks (per route & HTTP method) based on the incoming DN via a filter
The endpoints from nginx are exposed to the internet via a NAT gateway.
Infrastructure is managed via terraform and rolling out a new version is done via a task definition replacement which then points to the new images in ECR. ECS takes care and starts the new containers and then switches the DNS to those within 5 to 10 minutes.
Problems with this setup:
- I can't do canary or blue/green deployments
- If the new app version has issues (app is not able to start, we have huge error spikes, ...) the rollback will take a lot of time.
- I can't test my service integrated without applying a new version and therefore probably breaking everything.
What I'm aiming for is some concept with multiple clusters and a routing based on a specific header. So that I can spin up a new cluster with my new app version and the traffic will not be routed to this version until I either a) send a specific header or b) completely switch to the new version with for example a specific SSM parameter.
Basically the same you can do easily on CloudFront with Lambda@Edge for static frontend deployments (using multiple origin buckets and switching the origin with lambda based on the incoming request).
As I'm having the requirement for mTLS and those fine-grained authorisations I'm neither able to use a standard ALB nor API Gateway.
Are there any other smart solutions for my requirements?
Solution 1:[1]
To solve this question finally, we wen't on to replicate the task definitions (xxx-blue and xxx-green) & ELBs and creating two different A records. The deployment process:
- find out which task definition is inactive by checking the weights of both CNAMES (one will have 0% weight)
- replacing the inactive definition containing the new images at ECR.
- waiting for apps to become healthy
- switching the traffic via the CNAME records to ELB of the replaced task definition
- running integration tests and verifying that there are no log anomalies
- (Manually triggered) Setting the desired tasks at the other task definition to zero to scale the old version down. Otherwise, if there is unexpected behaviour the A records can be used to switch the traffic back to the ELB of the old task.
What we didn't achieve with this: having client-based routing to different tasks.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | tpschmidt |