ELK in Production with Security

Update: This post was written when Kibana 3 was the new kid on the block. Kibana/ElasticStack has changed a lot since (even the name!)and so the content of this post should be considered somewhat aged!

ElasticSearch, Logstash & Kibana (ELK), a lot has been said about this stack. The traction that it's gaining is testament to the widespread need for a tool of this nature. I'm not going to be spend a long time introducing this tool, there are plenty of great resources out there already. I would thoroughly recommend that you read the following if you are new to this stack and also as primer for this article: how-setup-realtime-alalytics

What I am going to do, is talk about some of the information which I found harder to track down and which I couldn't seem to find gathered in a single place for easy consumption. I'm also going to talk about our current ELK setup and some of the considerations of deploying an ELK stack in production.

Production Architecture


The diagram above gives us an overview of our current setup. The setup is comprised of two main components, the deployed production instance and the monitoring server.

Within the deployed production instance, there are 3 components: the API (1) (in our case, it could be a web site) logs messages to a local network RabbitMQ cluster (2) which acts as our outgoing queue, a forwarding daemon that we've named RabbitFWD(3) listens to the eventLog queue and forwards the messages to our centralised logging server. RabbitFWD is a simple node.js script written in house which simply listens to one queue and forwards to another (I will be making this available on Github shortly). RabbitFWD plays the part of the shipper. We decided to go this this option over either Logstash due to high resource requirements of Logstash. Another project logstash-forward (logstash-forwarder) would also have suited our purposes but was considered less simplistic then our custom solution.

The centralised logging server is a pretty standard ELK setup with the addition of a RabbitMQ server which acts as our broker and receives logs from one of several production instances. For a great guide to installing ELK on windows, see: Installing Logstash on Windows. Logstash then listens to this incoming queue and outputs to ElasticSearch. This central instance of Logstash is known as the indexer. The configuration for Logstash (logstash.conf) is very straightforward and an example file might look something like this:

input { 
	rabbitmq {        
        host => "localhost"
        queue => "eventlog"
        auto_delete => false
        passive => true     
        durable => true
        threads => 3
        prefetch_count => 50
  } }
output {
    elasticsearch {
        host => "localhost"
        protocol => "http"
    }
}
[/code]

Why RabbitMQ? What about Redis?

The simple answer: we already utilised RabbitMQ in our production environment, the entire team was comfortable with it, it was tried, tested and well respected.

The official documentation for ELK recommends using Redis as the broker due to the complexity of RabbitMQ. I believe this complexity is very real for teams that are coming to RabbitMQ new and who only require an installation for Logstash, but for teams that use RabbitMQ for others purposes already and are already familiar with it, this complexity is reduced or eliminated completely as the complexity is not a day to day overhead but more of an initial learning curve.

In addition to using RabbitMQ as our broker, we also use it in the production environment in order to provide an asynchronous mechanism for collecting our logs. Employing a queue to which our applications can log means the overhead of recording a log is significantly reduced thereby freeing up our applications to do the job at hand, serving users! The decoupling provided by a queue also means that if our logging goes down, it does not bring our production instances down with it. In short, all the stock reasons for going with a message queue over synchronous communication. I will be doing a follow up post to elaborate on the log collection mechanism from our various production applications.

Does this mean I don't recommend Redis? Certainly not, but for our purposes having a single well understood and already employed technology for both collection and brokerage, made a lot of sense.

Windows or Linux?

ELK (and everything else I've mentioned) work perfectly well on both Linux and Windows. Linux has the slight advantage for the centralised server as it can be more quickly spun up using docker. Once setup, there is no real advantage to either OS. We currently have a Windows logging server due to the majority of our infrastructure teams being more comfortable with Windows than Linux, but really, either would work just fine.

Retention

We produce a decent amount of log data, not extreme, but not insignificant. Currently running at about 5gb a day and hosted in Azure, it means that we need to be aware of our storage requirements in order to minimise our costs in terms of storage.  Enter curator: (https://github.com/elasticsearch/curator). With curator, it's easy to setup an retention policy which limits the number of older indexes (think tables, one for each day, each with a date in the name) that elasticsearch retains. We currently keep the last 30 days worth of logs. The following steps will get you setup with the same retention policy.

Ubuntu:

  1. sudo apt-get pyhton-pip
  2. pip-install elasticsearch-curator
  3. sudo crontab -e
  4. Secure setup:

    /usr/local/bin/curator --host "yourhost.com" --url_prefix elasiticsearch --port 443 --ssl --auth "username:password"  delete --older-than 30

    Non secure setup:

    usr/local/bin/curator --host "yourhost.com" delete --older-than 30

**Windows:**
  1. Install python 3.4 https://www.python.org/
  2. Open a cmd prompt:
  3. pip-install elasticsearch-curator
  4. Create a batch file and enter the command from step 4 above. Save it.
  5. Use  a scheduled task to run the batch file

The above simply installs curator and creates a crontab/scheduled task job for 1am every morning. The url_prefix, ssl and auth settings will make more sense once you've read the security section below. Thats it, all logs older than 30 days will now be deleted at 1am each morning.

Security

We currently host our central logging server in the cloud, which means that securing this environment is of great importance to us.

The built-in security features for both ElasticSearch and Kibana are virtually non-existent in a public facing scenario. There is a commercial product (Shield) which aims to address this gap, but without this, it means that we're required to resort to mechanisms external to these tools in order to secure our environment. We will be relying on the authentication mechanisms of our web server in conjunction with SSL in order to get a secured environment which requires authentication. This example uses IIS, but the same principals apply for nginx or apache.

  1. Enable Basic Authentication for your Kibana site. See: IIS Basic Authentication

  2. Enable SSL  for your Kibana site - Add a https binding using your SSL certificate and remove the http binding.
    See: http://www.techrepublic.com/blog/how-do-i/how-do-i-request-and-install-ssl-certificates-in-iis-70/

  3. Add reverse proxy to ElasticSearch - this will allow access to ElasticSearch for Kibana (which communicates with ES from the browser) but not expose ElasticSearch to whole wide world. You will need to install the rewrite module for IIS and application request routing (ARR). Download the webplatform installer http://www.microsoft.com/web/downloads/platform.aspx and install url rewrite 2.0 and application request routing. This step will give ElasticSearch a new endpoint (yourhost.com/database) which requires authentication. Open up the web.config of your Kibana site and add the following:

 <httpProtocol>
		<customHeaders>
			<add name="Access-Control-Allow-Origin" value="*" />
			<add name="Access-Control-Allow-Methods" value="GET,PUT,POST,DELETE,OPTIONS" />	
		</customHeaders>
	</httpProtocol>
	<rewrite>
		<rules>
			<rule name="ReverseProxyInboundRule1" stopProcessing="true">
				<match url="database/(.*)" />
				<action type="Rewrite" url="http://localhost:9200/{R:1}"/>
			</rule>
		</rules>
	</rewrite> 
  1. Point Kibana to the new ElasticSearch endpoint - Update your Kibana config.js, updating the elastic search setting to the following:
elasticsearch: {server: "https:/<yourhost>/database", withCredentials:true},
  1. Create a secure user for RabbitMQ - by default RabbitMQ versions below 3.3 come with a default guest user with a well known password that can be used to connect to your RabbitMQ server remotely. As we'll be exposing RabbitMQ on the internet, we want to remove this user and create a new one which our forwarder can communicate with. You can do this via the web console for RabbitMQ. Remember to update your shipper/forwarder to use your newly created user.

  2. Enable RabbitMQ SSL, this is quite a complex task, a good primer can be found at: RabbitMQ SSL. I will be doing a follow up post that covers this topic in more detail.

  3. Ensure that 9200 and 80 are not exposed as azure endpoints. Ensure that windows firewall is set to block these ports.

That wraps it up for now. I'll be doing a number of follow-up posts which cover some of topics touched on in this post over the next few days.

Sam Shiles

Read more posts by this author.