Watching your AWS cluster fluctuate nodes is about as bad as it gets. There's so little insight into why, you're stuck making a lot of uneducated guesses. You turn off your bulk indexing and hope that helps. You try and kill client connections and hope that helps. Neither did. It turns out it's taking it self down trying to recover. In a lot of information around ES they talk about keeping your node count and shard count as low as possible. What they don't really get too far into is why though.

We caused this issue by having too many smaller nodes. While ES lets you make a cluster with a c4 (it's the default!) it's actually a really bad idea. They also let you use EBS which is a worse idea. ElasticSearch is heavily disc dependent, any slow disc I/O can spell doom. At a minimum it needs to be attached storage, if not attached SSD or event better attached NVMe storage.

As nodes started to run hot the cluster would mark them unhealthy too quickly (something you can configure via cluster settings, but you can't on AWS). Because of this hot nodes were never allowed to recover... They were just terminated. This caused a new node to be added to the cluster and the replication to start recovering data... Which leads to more hot nodes because now a good portion of the capacity is being used for recovery... Which causes another node to go down... And another to come up and start replication. It took nearly 24 hours of total access cut off for this to just stop. If we had been able to modify a handful of settings we could have stopped this.

How to Recover from Recovery Death Spiral

If you find yourself in a similar situation where nodes are being killed because too much recovery is occurring here is how you can get your cluster out of it:


1) Disable shard re-balancing.

PUT /_cluster/settings { "transient": { "cluster.routing.rebalance.enable": "none" } }

Why: Shard re-balancing occurs when the master believes that a node is underutilized compared to some other node. During recovery operations the setting cluster.routing.allocation.allow_rebalance should be at the default of indices_all_active meaning that all primary and replica shards are green. However it can kick in immediately after all shards turn green and not give the cluster a proper amount of time to cool down, especially if you've disabled indexing and querying on the cluster during recovery. Once external load has been placed back on the cluster turn this setting back on.

To restore: PUT /_cluster/settings { "transient": { "cluster.routing.rebalance.enable": "all" } }

2) Set the rate limit shard recovery to 1.

PUT /_cluster/settings { "transient": { "cluster.routing.allocation.node_concurrent_recoveries": 1 } }

Why: This setting is poorly named, this indicates how many inbound and outbound recoveries a node can handle at once. If you have 30 nodes and this is 1 the cluster may attempt to recovery 30 shards at the same time. If it is at the default of 2 it can attempt 60.

3) Disable marking a node as out of service for a longer time.


PUT _all/_settings { "settings": { "index.unassigned.node_left.delayed_timeout": "15m" } }

Why: You want this setting high enough so that if the ElasticSearch service on a node becomes unresponsive due to load the Master will not immediately remove it and try to recover. The service on the node, if it crashes, should come back up on its own.

We've since had this same death spiral attempt to happen but because of being able to take the above actions we've been able to mitigate it with no down time.

About Author

Siva Katir

Siva Katir

Principal Software Engineer working on PlayFab at Microsoft in Redmond, Washington.