Elasticsearch & SmartOS
The Postmark team has been busy with our continuing effort to double down on Elasticsearch. In our more recent efforts, we have been focused on increasing the stability and availability of our infrastructure. Automating and adopting the SmartOS platform have been at the forefront of our goals. If you’re considering Elasticsearch on SmartOS, I encourage you to read on!
The start of something beautiful #
Late last year we migrated some of our virtualized environments to SmartOS. The benefits of the platform were immediate. Processes running in production are able to be inspected with negligible impact to performance with DTrace. Process isolation provided by Zones ensures misbehaving processes will not be able to impact other processes. ZFS maintains our data integrity, and allows us to expand without concern. These features and experiences compelled us to evaluate how well Elasticsearch and SmartOS would work together.
The features provided by SmartOS heavily influenced our decision to pursue it as the platform for Elasticsearch. The ZFS filesystem provides unsurpassed data integrity features, native send and receive support for datasets, and a performance focused cache. Comparing ZFS to alternative filesystems seemed almost unfair, these features simply don’t exist in the alternatives.
Initially we were concerned about diverging from the standard of deploying onto Linux. Although the advantages offered by SmartOS are attractive, without knowing whether or not Elasticsearch would function and perform well on SmartOS was a question we needed answered. We reached out to the SmartOS community and found that many high profile projects have been, and continue to, run Elasticsearch on SmartOS. With the validation from the community we continued to dig in to see for ourselves. Our first step was to prepare an environment for evaluation. Having SmartOS responsible for our virtualization already, we simply provisioned a new SmartOS instance. Although there are images with Elasticsearch preinstalled we opted to start with a minimal image, allowing us to install the most recent release without the potential for conflicts. With a running instance we proceeded to install Elasticsearch, then we added test data for benchmarking. We compared I/O to other platforms, and observed improvements with SmartOS. This came as no surprise given our history with the ZFS filesystem. While observing CPU and memory utilization, we noticed a negligible difference between platforms when compared to bare metal. However, when compared to other platforms offering operating system virtualization, the maturity and completeness of SmartOS shone through convincing us to continue our journey with the platform.
Making the invisible visible #
The capability to truly understand what a system is doing is crucial to resolving issues and identifying bottlenecks. As such, profiling application behavior is a non-negotiable requirement of all considered platforms. Traditionally tools like strace, iostat, tcpdump, and lsof were used to get insight into system and application behavior, however these tools only touch the surface and require constant context switching. These tools also impact system performance, resulting in a reluctance to use them in production environments. DTrace, provided by SmartOS, is able to provide significantly more visibility than traditional tools are capable of, with negligible impact to performance. As an example, it’s a matter of executing the following one-liner to find all files opened by our Elasticsearch process:
dtrace -n 'syscall::open*:entry /execname == "java"/ { @[execname, copyinstr(arg0)] = count() } tick-1sec { printa(@); trunc(@) }'
Gaining this type of insight into production processes is incredibly useful and powerful, if you’re keen to learn more about DTrace I strongly recommend reading the DTrace Guide.
The lonely zone #
We wanted the ability to provide a truly isolated environment for processes. SmartOS was able to satisfy this requirement with zones, which isolates processes, network, and namespaces. Resources are allocated to a zone, and the processes within are restricted from exceeding this resource limit. These limits ensure that should a process misbehave, the process will not impact others, and the underlying server will remain available. If you’ve been unable to connect to a server due to load, this solves your problem.
Chef on the run We need to be able to quickly, reliably, and repeatedly deploy Elasticsearch. Being proponents of automation we knew this was achievable. We automate with Chef and decided to look to the community to find a suitable cookbook. Unfortunately, at the time, no cookbook was able to scratch our itch, so we decided to roll up our sleeves and write our own. The result was a cookbook that could easily deploy to both SmartOS and RHEL platforms, support for plugin installation, and two new resources for you to include in your cookbooks. You can find the cookbook on GitHub today. If you’d like to improve on it, we’re looking forward to your pull requests!
Wrapping it up with monitoring #
There is no shortage of options available to monitor Elasticsearch. We’ve evaluated our fair share of options including Head, Kopf, Paramedic, and Marvel. All provide a similar dashboard overview, charts, node and index statistics. However it’s Marvel that won us over with it’s modern interface, simple installation, scope, and wide-array of tunables. Marvel compliments monitoring solutions like NewRelic, providing insight into issues brought to your attention by NewRelic and the ilk. I encourage you to try any and all monitoring solutions as your mileage may vary, and every team has different needs.
As we’ve continued to grow with Elasticsearch and SmartOS we’ve seen improved stability with no downtime to date. One of the best benefits so far has been storage expansion. Just this week we added another 5TB of capacity without rebooting a single server. You’ve gotta’ love ZFS.