CFP: Making Sense of Metrics: Crafting and Leveraging Prometheus Metrics for Infrastructure Intelligence

February 29, 2024 9:28 am Published by Manuel Dewald

Audience

This talk is targeted at System Administrators and Site Reliability Engineers interested in learning about how to best make sense of the Prometheus metrics their system exposes. If you know PromQL, but the queries behind your dashboards are still a mystery to you, you are not alone. This talk will show how to get information out of your metrics to maximize the insights and make data-based decisions.

Outline

Creating new metrics and collecting them with Prometheus is easier today than it was ever before. Site Reliability Engineers and System Administrators have all the data at hand they need to make the right, data-based decisions. But how?
Making sense of all that information is still a challenge. Crafting the right PromQL query to answer your question and manifesting it in a Grafana dashboard is a complex and time-consuming task. Not speaking of understanding that query when you need to change it a few weeks later.

In this session, you will see different approaches to make sense of the prometheus metrics exposed by a software deployment: Starting from the default Prometheus UI, via PromLens, an improved, open source query-building UI, all the way to an experiment on transforming Prometheus metrics into a data warehouse for improved data exploration and visualization. Data analysts have used Business intelligence software for decades. What can we learn from these systems to discover knowledge in the ocean of metrics to make better decisions for our infrastructure?

Key Takeaways

During this talk, attendees should have learned how to (1) best explore and query the available metrics in their environment, (2) which tools are available today and (3) how infrastructure intelligence can leverage data warehouse concepts for improved knowledge discovery and decision making.

CFP: Things I love about SRE that I loathe about DevOps

January 28, 2020 8:32 pm Published by Manuel Dewald

DevOps & SRE – what is it?

Let’s define what those terms mean.

DevOps means, the same team building the software is responsible for running it. This can be easiest imagined for software that is operated by the vendor themselves, i.e. cloud services. The idea is, no one knows the software better than the developers, so no one could better operate and fix in-flight issues than them. On the other hand, developers have high interest in building software that is easiest to operate, if they operate it themselves. Issues found during operations can be addressed immediatly.

SRE is a concept where one team is responsible for running one or more services. Again, imagine a company building cloud services. They may all have similar requirements in operations, so why not create a dedicated team to run them all. This team contains software developers to automate common tasks that emerge when running the services to minimize manual efforts. SRE teams monitor the software to spot issues as soon as they appear and fix them in the best case before a customer notices them.

Which DevOps problems may SRE solve?

In DevOps, having a team that runs software and a separate team that builds it, is a well-known anti-pattern.

“If you have a DevOps team, you’re doing it wrong”

"If you have a DevOps team, you're doing it wrong" @kmugrage #devopsdaysbuf #devopsdays #bflo pic.twitter.com/obwNaF1Qig
— Jennifer Petoff (@jennski) September 26, 2019

This quote can be found in many tweets, blog posts, and conference talks. But when it comes to SRE, this is a common pattern. One team is building software, another team is running it, and building software to improve running it.

In DevOps, you usually find a specific support role being rotated through the team. That means everybody is in that role for example for a week out of 10, given a team of 10 members.

Working in this support role usually has very different requirements than the usual development work.

The context can get completely lost during the transition into the support role (which often comes as surprise on Monday mornings). And once context is built up and the developer feels comfortable in that role, his shift ends and the support context is again lost.

This results in people hating to get into the support role, which also makes it less attractive to build things that improve the supportability of the software.

In SRE teams, people are on call as well every few weeks, often even 50% of their time. Why don’t they dislike it? The difference is the focus of the team. When not on call, developers in the SRE team work on improving the support. They are less involved in actual product code, but rather build software to ease the operations. For example, to automate updates and other maintenance tasks.

DevOps: supportability vs. new features
SRE: supportability = new features
DevOps: Dev team learning ops all together
SRE: Ops learning from devs and devs learning from Ops

Agile development in SRE teams

As SRE teams contain a fair portion of software development work, and get filled up by software developers, it is a natural move to also adapt agile software development practices. It depends heavily on the percentage of development work vs operations, which may be influenced by the team size, to find the right model to track and manage the project work. For example, in a small team where a high percentage of people is on call during the day, it might not make too much sense to plan sprints of 2 weeks if only a few backlog items are expected to get done in that timeframe.

Key Takeaways of this talk

By the end of this talk about the differences between SRE and DevOps working styles, attendees should have awareness (1) what the most significant differences between DevOps and SRE are, (2) A successful team running DevOps or SRE needs experienced Ops as well as Dev people, (3) If team members greatly dislike getting into the operating role, the team should work hard on improving the support experience.

Tag Archive: SRE