ElasticSearch is one of the most widely used full text search engines available to developers these days. It's ease of setup makes it an almost no brainier. But that ease can lull you in to a terrible false sense of security.
This series of posts will go into detail of all of the hard lessons learned as we at PlayFab built a cluster that went from 0 to 3 billion events.
A Quick Topology
PlayFab runs two ElasticSearch clusters, the first is a fairly standard rolling log style cluster that is used for PlayStream events. Events cover anything from "player logged in" to "developer changed security key". On average we currently generate 2,000 events per second. Not an insane number by any means. The second cluster contains what we refer to as a player's profile, this is a snapshot of a player's state that is effectively a "rolled up" version of their events. So as events for a player come through we write the original event into one cluster, then we use the event to update the player's profile and update it in the other cluster. ElasticSearch does not like the second cluster, overwriting a complex document with nested documents is incredibly heavy on the CPU. Even though the profile cluster contains less data and we have performance optimizations to reduce the writes to about 500/second it takes more nodes to handle it than the event cluster that handles 4 times as much data per second.
Why not to use AWS ElasticSearch
As AWS customers who don't want to manage infrastructure too directly we tend to look around at either a) AWS options or b) hosted options we can run in AWS. While the prices for AWS ES were good we ran into some really comical issues.
When we first started only ES 1.5 was available on AWS, even though ES 4.x was already shipping. Worse than that it was a custom version of 1.5 that made unknown changes to the underlying platform, so not only are you running one of the oldest versions, you're running an undocumented old version. The biggest issues revolved around being allowed to tweak cluster settings. In a lot of cases AWS has perfectly fine choices. But if you run into a "100% CPU Death Spiral" (coming up later) the fastest way out of that is to change cluster settings. Instead, on AWS, you have to turn off traffic and prey it all comes back (true story, also later).
Eventually AWS launched 2.3. We attempted to migrate to this but it turns out there were some performance issues specific to 2.3 that forced us to go back to 1.5 (this is specific to how we overwrite documents for the profiles).
Now AWS has ES 5.1 and 5.3 available, unfortunately we weren't aware these were coming and so were forced to make the decision to move away from them without testing these.
We also wanted off of the i2 instances and to move to i3, something their clusters didn't allow for whatever reason. In this same vein you can get better performance with your own cluster by also taking advantage of the Application ELB, Target Groups and using Security Groups to control which nodes belong to which cluster within a VPC. And finally, how in the world does AWS ship a product that has no ability to auto scale out of the box?