Hello all,

I hope your first week in 2022 has been good! This week’s topic was inspired by a comment an old friend and former colleague made about selecting tools for managing and monitoring the AWS infrastructure.

How working backward is the best way forward

Have you had a situation where you have chosen a specific solution to a set of problems because it seemed to be the right thing to use, only to discover later that it did not work well at all, was too expensive or no one used it? I have been there multiple times. It is all too easy sometimes to let the engineer brain find solutions, before actually understanding the problem - just believe what the problem is, without doing the research.

An example: At a former employer, we used SumoLogic for a lot of log analytics and monitoring of AWS workloads. It worked very well, and it was quite useful for the operations group. Then some persons from that group joined a startup company and had to select the tooling to use when operating their AWS workloads. Here, it did not work out at all. The same tool was barely used, and the value was questionable.

A key reason for this was that the tool was picked before properly understanding what was needed compared to the old company usage:

Different and smaller organization, different responsibilities and roles
Workloads were not the same, use cases were different, although similar at a glance
Priorities were different

In short, the differences meant that applying the same solution pattern did not provide enough value to the organization. Not that SumoLogic is a poor tool - on the contrary, I think it is a very capable tool and can provide significant value.

Amazon and Amazon Web Services, which pride themselves on being customer-obsessed, have an approach to product development which is about working backward from the customer - they start from a press release:

Working backward at Amazon

You certainly do not need to make a press release like Amazon does to let people know which monitoring solution you have picked, but the general principle applies still.

A bit of research going backward would have helped to make a better decision for the monitoring solution:

Identify stakeholders
- What roles do need information about the state of the solution(s) and what are their responsibilities?
- What do information they require, in what form, and when?
Identify key performance indicators
- What are the business values that we want to ensure are upheld for our customers?
- How is that measured?
- What is the definition of a healthy state for those values?
- At what point is that healthy state at risk?
- What actions should be performed when the healthy state is at risk?
Map KPIs and solution resources
- Identify resources/components that can provide data
- Identify types of insights that can be gained (behaviour, faults, performance)
- Identify the type of information to collect (logs, metrics)
- Identify source to collect from (e.g. log files, system performance data)
- Figure out threshold and/or patterns in resources to use for alarms
Reports, alerts, actions
- What to report to who
- How to deliver report data
- Which formats to use
- Determine Severity
- Actions (automated, manual)

The above is an iterative process - both to get into enough detail to answer questions, and changes in requirements.

The general pattern here applies to many areas, not just monitoring solutions. This also applies to automation. When should you automate a process or activity?

The answer is “it depends”, or rather it is not the right question to ask first. If you ask a question like that, you may get answers such as “When you have repeated it 3 times” or “Always”. But there is no context in such answers.

A better question to ask is why should this process or activity be automated? At the surface level, answers that may come up could be:

To save time for repeated tasks
To be (more) consistent
To avoid human error
To distill expert knowledge into something that others may use
To have some documentation of the steps of the process or activity

But these are still vague. They may all be valid to some extent, however, they should still go back to a defined business value or aim.

Start with your customer and work backward. It requires discipline though, and practice.

Even with Amazon Web Services, where they presumably practice this every single day and it is a part of their corporate culture, they can still provide customer experiences that are quite crappy. I think there are other aspects at play as well though, plus it might not always be clear who is the customer from AWS point of view.

What do you think about working backward from the customer?

Until next time,

/Erik