In a world of highly abstracted, typically virtualised, often ephemeral and always dynamic cloud computing resources, the need to achieve continuous observability is key. However, the cloud was not created with observability of internal systems in mind; it was initially sold as a key route to IT agility through resource flexibility and cost manageability.
Now that cloud is here and adoption is growing, we need to stand back and assess our observability capability. In addition, as cloud-native implementations now span public, private, hybrid, multi-cloud (multiple vendors) instances, we can start to think about poly cloud, where different parts of an application and data service workloads are separated out over various Cloud Service Providers (CSPs).
With roots in control theory, observability in the modern cloud era manifests itself in many forms, so what key drivers are shaping the way we stick our head in the clouds to get a better view?
APM is everywhere
Many ask what the difference is between cloud observability and APM (Application Performance Monitoring). We used to ‘just simply’ have virtual machines, which meant that blocks or instances of compute could be comparatively easily exposed to observability.
We now live in the world of nested virtualisation, Software-Defined Infrastructure (SDI) and cloud services. Our application workloads are often surrounded by layers of software (also “applications”): operating systems, proxies, orchestration software, container engine, virtual machine, external service and more.
As APM has become almost synonymous with observability, we now see it extend to every tier and structure throughout the IT stack. We need APM for applications, obviously, but we also need infrastructure APM (iAPM, if you will) and it needs to be capable of being directed at any of the stars in the virtualised galaxy we now exist in.
We might be at the time when there is no need to differentiate between APM and non application monitoring. We can and the industry already leverage tools that allow monitoring and observability of software of all kinds in cloud, in similar way.
Establishing a strong network monitoring strategy
This article will explore how organisations can establish a strong network monitoring strategy, to ensure that connectivity and vulnerabilities are quickly mitigated. Read here
A federated centralised orchestrated view
In a world where we have multiple different cloud providers and many different cloud instances from different CSPs, we need an orchestrated federated level of observability with a centralised view and ability to filter and aggregate across multiple clouds in multiple clusters, if we want to be able to stay in control.
Federating observability data to a centralised place is a common technique and process these days. This has been proven to be the best way to look for cloud overloads, bad provisioning and ‘zombie’ cloud wastage where instances are left idle. When we bring all of these signals together, we can drive more efficient cloud resources to service our Content Delivery Networks (for example) and work at a smarter level all-around.
Connected correlation inside the firehose
The amount of data we are consuming and producing right now enables us to get many more signals to track our observability requirements. If we think about the fact that the Internet of Things (IoT) is exponentially increasing our data points, we are drinking from a firehose in terms of data flow… and that can make observability far more difficult.
To address this challenge, we need to think about connected correlation. When we seek to analyse system metrics, logs and traces, we need to be able to jump between those procedures and tasks quickly to work dynamically at different parts of the IT stack coalface. Because there is so much out there to observe, connected correlation helps provide vital links between the data sources that are actually mission-critical to the IT function’s operation.
Predictions for the future of IoT sensor technology
This article will explore predictions for the future of Internet of Things (IoT) sensor technology, as provided by IoT experts. Read here
Our observability goals see us continually looking for optimisations that will increase performance efficiency. This means we will need to look for, track and analyse different observability signals. One of the best ways to do this is by profiling. This technique enables us to know what part of the application is using how much compute resources (CPU time, memory, disk or network IO) without having to guess it when looking at total resource usage for our process.
Continuous profiling enables us to look at the applications and see past performance characteristics during interesting cases. It’s especially useful if it is about to run out of memory and perhaps crash the whole node. If we can look at application profiles every 60 seconds (or perhaps even more regularly), then we can see where a function in the application source code might need optimisation or augmentation. We can do this retrospectively even in the case of applications that are compiled (as opposed to interpreted) as it embeds debug symbols that enable us to map backwards to a specific function call.
A hive of activity with eBPF
Lastly then to eBPF, or extended Berkeley Packet Filter to use its full name. This is a mechanism that allows us to execute additional code in the Linux kernel. When we can look at specific functions inside the kernel using this ‘special spy agency’ technique, then we can gain new controls over observability. As an additional benefit, we can also note that eBPF does not require app-level instrumentation to start capturing metrics.
Even though it was initially designed for security, it can now be used more proactively for exposing the metrics of the application. We used to think of using a service mesh as a way to put proxies around an application, but service mesh can be replaced with eBPF, which has much lower overhead and more capabilities.
A ‘canary deployment’ might still require a service mesh and we should note that there are still non-observability use cases for service meshes like those in canary deployments (where tight control of traffic occurs) and authorisation (by mutual TLS). There is currently no attempt of eBPF to adjust traffic on such level, currently eBPF use cases are security and observability only.
If we can consider some (ideally all) of these factors and functionalities in our quest to achieve observability in modern IT stacks, then we just might be able to pop our head above the clouds and see what’s coming next.