The road to EKS monitoring

prometheus
cloudwatch
kubernetes
eks
aws
Author

Erik Lundevall-Zara

Published

January 29, 2025

Recently, I added integration with Jira for a monitoring solution for AWS Elastic Kubernetes Service (EKS) that we had set up. This kind of marked a milestone in a journey that had been going on for a while for the operations of a few EKS clusters that we were helping a customer with. It was kind of a rocky road to get to this point, and I intend to briefly share some experiences here.

The starting point

This customer has a few clusters set up in EKS, including both production and test/staging environments. While the logging setup was pretty ok from an application perspective, monitoring the cluster infrastructure itself had room for improvement. If there are issues with a node, or some pods end up in a bad state, we want to know about it - before the business notices it.

This was our starting point, and we wanted to get appropriate metrics and alarms to deal with such situations. The logging was fine for now, so the focus was on the metrics part.

The places along the road

The place we ended up with, so far, consists of an Amazon Managed Service for Prometheus (AMP) workspace. It gets metrics data from multiple EKS clusters, and sends alerts on an SNS topic. The alerts are captured by an integration with Jira.

Data is written from a Prometheus setup in each cluster, writing to our AMP workspace.

This was not the first choice on the journey.

The different places we “visited” along the way were:

  • CloudWatch ContainerInsights (EKS addon + Terraform)
  • AWS OpenTelemetry (OTEL) collector ContainerInsights (helm + Terraform)
  • Kube-Prometheus-Stack (helm + Terraform)
  • AMP + Prometheus (helm + Terraform) we are here now

Basically all AWS infrastructure at the customer is set up with Terraform, and thus Terraform is involved all over the place, including running helm charts.

CloudWatch ContainerInsights

The EKS addon for CloudWatch ContainerInsights was really simple to set up. This was also the primary reason we first went with that, the simplicity.

The AWS EKS documentation was quite confusing on what to set up for monitoring of the EKS cluster. There were sections that mentioned setting up this without the add-on, and with the add-on, and lots of the pages referred to some other page to read more. If you followed those references, you ended up at your starting point eventually…

Not caring too much about the doc details, we tried set up this EKS add-on, and it worked just fine. Really Great!

But there is a but…

We and the customer soon noticed a spike in the cost in AWS, which we tracked down to ContainerInsights. The cost for monitoring the clusters were significantly higher than the cost for running the clusters themselves. Could we bring down the costs somewhat?

The documentation for configuring the Amazon CloudWatch ContainerInsights EKS addon were pretty much non-existent. Only a JSON schema indicated what kind of configuration fields may be used. If there were no good documentation or description on how to tweak settings to reduce costs, this is probably not the path we want to go down.

This led us to the next step in the journey..

AWS OTEL Collector ContainerInsights

AWS has documented two ways to collect metrics for ContainerInsights. The first one was the one we already have tried, which is based on CloudWatch agent.

The second one is based on the AWS OpenTelemetry distribution. This could also send metrics data to ContainerInsights, and it could also send data to Prometheus - both options were available.

We found blog posts from AWS on approaches to configure the OTEL collector to filter out certain metrics also to reduce cost. This may be it!

Unfortunately, the AWS documentation around this was perhaps even more confusing, somewhat contradictory, or out of date. Perhaps if you live and breathe OTEL every day, you can understand what to make sense of all the different pieces of information. That is not me though.

Eventually, I did manage to sort out the setup via Helm, and also apply a configuration that would reduce the cost significantly. We deployed it in one cluster and we got some metrics, and reduced costs.

But we were also missing some key metrics, such as the state of a node, for example. Metrics have a numerical value, they do not have a state with multiple discrete values.

So we could not see if for example a node was in a ready state, a failed state of some type, or even no know state - the data was simply not there. We tried multiple ways to try to convert the state informatiobn to metrics values, similar to what the CloudWatch agent based ContainerInsights addon had done. But that failed.

We asked AWS Support, but they did not have any way to accomplish that, short of writing custom code for collecting the data.

We then asked AWS Support: “what do people actually use then, this is surely not the first time someone has run intop this issue?”

The answer was pretty much “ah well,… many use Prometheus”

Which leads us to our next step on the road.

Kube-prometheus-stack

At this point, we were a bit wary of potentially crappy documented AWS-specific solutions, so we looked more at general options in the Kubernetes community.

We landed on kube-prometheus-stack, which is a packaged solution that installs multiple components, including Prometheus and Grafana, in a Kubernetes cluster and sets up the monitoring of that cluster. It was pretty straightforward to set up via Helm, and we got it running in a test cluster.

A bit of extra setup was needed to expose the user interface endpoints for Prometheus, Alertmanager, and Grafana to be reachable outside of the cluster. Alert information and dashboards all looked ok.

However, this was not ideal. Preferably the monitoring of a Kubernetes cluster would not be dependent of the cluster itself. We also want the ability to monitor multiple clusters, e.g. send alerts from multiple clusters to the same alerting solution and also have dashboards in Grafana that allows us to view multiple clusters in the same dashboard.

Thus, while setting up the kub-prometheus-stack worked fine, it is not sufficient. This led to the next step on our journey.

Amazon Managed Service for Prometheus (AMP)

A few years ago, AWS introduced Amazon Managed Service for Prometheus (AMP). They also introduced Amazon Managed Service for Grafana, presumably to complement solutions set up with AMP.

AMP is a managed service where you do not care about any servers or running components, you just have workspaces where you store your metric data. The workspaces also have the Prometheus rule execution, and a stripped down version of the Alertmanager component of Prometheus. AMP as such is also stripped down, since you do not have any of the regular user interface components of Prometheus.

What this means is that you will push metric data to the AMP workspace you have set up. Thus you need to have components that can write to a Prometheus server setup and configure those to point to your AMP workspace. These components collect data, the term that is commonly used is to scrape the metric data.

As part of AMP there is also a component called “Managed scrapers”. These are not installed, but configured to access your EKS clusters to scrape metric data from them.

Setting up AMP

Setting up the workspace itself was quite easy, and worked also fine to set up via Terraform. Getting the data into this workspace was not that straightforward, mainly because the documentation was not clear, and the description of the managed scrapers left a lot to be desired.

Overall, I think a lot of the documentation was written with the perspective of someone who already have a functional setup with Prometheus in place, and they just want to switch as-is to AMP. It is not written for someone who wants to set this up from scratch, and are not well versed in all the details about Prometheus. This may perhaps have been a reasonable assumption for the initial release of AMP. Now a few years since thew launch, I would have expected documentation for a bit wider audience.

Anyway, there were AWS blog posts that guided in the right direction to set up ingestion, and I managed to get this working with Prometheus installations in each cluster that were configured with a remote write to send data to the AMP workspace.

Did the set up work? There is no interface in AMP to view the data, and it only writes logs if there are errors - although not clear from the docs what kind of error logs to expect. There were no logs at all.

CloudWatch turned out to have some general usage metrics for AMP and there I could see that there were an indication of data going into the workspace. So it seemed to work, but I could not see what the actual data was.

I managed to create a small script to send queries directly to the AMP workspace endpoint, and from there I could see some of the metrics data. I noticed there were no labels on the metric data for the EKS clusters the data was coming from, so I added an extra entry in the remote_write config for that for each Prometheus installation.

You might say, why not hook up Grafana and look at the metrics? That is exactly what we did. We could not use AWS managed Grafana though, because all our infrastructure is in the eu-north-1 region. AMP is there, but the managed Grafana is not in eu-north-1. So instead we used a Grafana installed in one of the clusters to interface with AMP.

Rules, alerts and Alertmanager

Setting up different alert rules, and recording rules (rules that calculate new metrics), was not too difficult. The documentation was a little bit fuzzy on exactly what to expect. Same was with the alertmanager configuration.

The original alertmanager component has support to send alerts to many different destinations. In AMP that is stripped down to only support SNS, nothing else.

This meant that we have to set up Alertmanager to send alerts to an SNS topic, formatted in a processable fashion (we set up a template for generating JSON). Then we set up subscribers on that SNS topic and re-implement integrations to our targets for the alerts.

We tested it out first with an email subscriber to the SNS topic, then proceeded to set up a Jira integration. Next up will be to re-implement integration with PagerDuty.

Multi-cluster rules and dashboards, oh my…

What alert rules and dashboards should we set up with our AMP-based solution?

The kube-prometheus-stack had a lot of these already in place, so our first stab was to move these into AMP; and the Grafana running in one of the clusters.

This turned out to be a lot harder than we had expected. The rules and dashboards are geared towards monitoring a single cluster, not data from multiple clusters in the same metrics storage. There were also a few glitches here which were due to EKS doing things slightly differently under the hood than a generic Kubernetes cluster. Another complicating factor was also that we wanted to change the scrape interval from 1 minute to 2 minutes to reduce the cost of AMP.

All these factors contributed to a fair amount of struggle with making rules and dashboards to work.

After some time, with alert rules generally working, but dashboards not so much, we decided it that kube-prometheus-stack may not be the best starting point for what we want to do. Looking for other options, we found AWS EKS Observability Accelerator.

This is an unofficial solution package for monitoring EKS and other things, using among other things, AMP. It is a bit overengineered for our use case though, and we did not want to set up the whole thing, which would have been overkill.

But extracting the Prometheus rules and the Grafana dashboards provided a much better starting point for us. We still had to tweak the rules and the dashboards a bit, but it was a much smoother experience in comparison.

Wrapping up

This was a learning journey, and while there were quite a few bumps on the road, I ended up being wiser. I was new to Prometheus, and I have learned a fair amount about it during this journey.

Unfortunately, I think AWS has a few gaps in their documentation that makes these things harder than they should be. Better guidance and better documentation would have made the journey smoother and shorter.

We are not done with our setup, but it is at a point where we are productive and it is useful for what we need.

Next up, we are working on the PagerDuty integration now, and we might take a look at the AMP managed scraper at a later point in time. Right now though, we are more concerned with making sure what we have works well than investigate a not-so-well-documented feature.

Back to top