Optimizing our Service Stack at DigitalOcean

You can skip to the tl;dr at the bottom if you’d like!

Services

I recently moved to the Spaces team at DigitalOcean where I help maintain and improve the existing service stack. If you don’t know what Spaces is, it is basically the equivalent of AWS S3. We aim to provide a similar service where users can simply upload / download files to a S3 bucket with compatibility to AWS services.

You can imagine that we get thousands of requests a second in our data centers across the globe. A request is routed like so:

Internet -> Envoy -> Rate Limiter -> Internal Services

One thing we want to improve on is customer experience with fast uploads and less timeouts. We have been experiencing timeouts from envoy being able to reach our internal services through the rate limiter. Here’s a graph of the latencies we were seeing on the rate limiter before our optimizations

Grafana Latency Dashboard
Flame Graph of the Rate Limiter Service

Going down the rabbit hole

Now we can see that the p99.9 percentiles are definitely not good especially if we plan to offer a SLA for Spaces! So hence I began to dig into this rabbit hole. I started by adding traces using LightStep to our various golang services that handle rate limiting. We use a grpc call from envoy -> rate limiting service -> redis cluster to get whether or not a request should be rate limited.

Furthermore, our rate limiting service actually made a second grpc call to another golang service which finally talks to redis. The unpredictable latency from this grpc call was something that was revealed with the support of traces.

Things we tried:
– increase GOMAXPROCS
– re-architect the design of our rate limiter solution
– increase redis cluster size in hot regions

I definitely came across several red herrings while going down this rabbit hole. Firstly, I used flame graphs to see where things were taking long times. I believe the more red a bar is, the longer it took. And the longer it is, the more CPU cycles it used. We can ignore the everything to the right of the runtime bars as those functions weren’t really in the direct hot loop.

The biggest thing that stood out to me was the gRPC internal balancer we used to randomly select a backend golang redis service. It definitely didn’t make sense to have another service that was only used by the parent service. The red herring though was looking at the mux parser that tries to see if the incoming request is a valid S3 one.

Ripping out that golang redis service and just calling redis directly from the parent really helped! You can see the p99.9 values somewhat in the graphs below. But it basically brought down the latency from ~50ms to under 10ms.

Another thing I noticed was our redis latencies weren’t very consistent too. Sometimes the load was too high on various pods so I decided to build a headless service in Kubernetes that basically allows the caller to do round robin. I don’t think this had much impact but actually increasing the replicas did. We increased the replica count to 5 and reduced the variance in latencies to redis. Now most calls to redis had only ~7ms.

Further expanding on this redis, we had a local cache that kept a local copy of the remaining tokens from the redis and we only updated it when the cache count was <0. Ripping this out allowed direct queries to go to redis. I’m still not sold on that this helped reduce some latency but definitely it brings up the idea on how we can do more network level analysis to understand why latencies are so unpredictable.


tl;dr Lots of Logging + Re-architecturing!

Finally after all those changes, we have this! It’s not perfect yet but it’s a good starting point. I definitely learned a lot of tips and tricks along the way. Some things we’re examining for future improvements are a better way to do local caching as well as analyzing the latencies between envoy and the rate limiting service.

Want to see more content? Please consider subscribing! It helps go a long way 🙂

Leave a Comment