Enhancing your Rancher monitoring experience with Grafana Loki
Monitoring and gathering metrics on your systems is an essential part of the lifecycle of any machine, application, or cluster. It provides us with the insight needed to be more proactive and less reactive to what our systems are doing. Gathering and viewing numerical data, like CPU load or memory usage, is something that is done with ease and looks great in a graph, but what about the massive amounts of log data out there? Historically, this was not something that you could easily put into a graph and it’s not always easily correlated with other events.
One of the reasons I’ve always been such a fan of Linux is the copious number of logs it generates. Over the last decade I’ve held roles from Linux Sys Admin to Applications Developer to Kubernetes Cluster Admin to everything in between. Whenever things went wrong when I was an admin, I always turned to logs. When writing an application, I ensured that I offered as much data to the administrator who would inevitably have to troubleshoot if the application failed for some reason. Unfortunately, there was not a “single pane of glass” for both logs and metrics.
Without a way to visualize and correlate the log data with the metrics your machine/application/cluster provides, this slows down the process of troubleshooting and thus wastes staff-hours. Administrators need to be able to fix something when it is broken and do all they can to ensure that the failure doesn’t happen again. An alert from a monitoring stack lets them know when something goes wrong, but without enough information, they may not have what they need to stop it from happening again.
Preferably you would be able to see your logs using the same “dashboard” where your metric data is available, like Grafana. There are a few SaaS solutions, like Datadog, that address some of the issues, but they require you to ship your logs and metrics over the internet to their servers to be processed. Unfortunately, for many people in the government space this is not possible. The need to keep log data safe and on premises is usually a requirement and SaaS solutions just do not provide that. There are solutions like Elasticsearch that do something similar that parse the logs and index them, but this eventually requires a massive amount of processing power and storage as the index becomes larger than the original log. There are other paid-for, non-open-source solutions like Splunk that grow in cost exponentially as the amount of data sources grows.
An ideal on-prem solution would allow you to keep copious amounts of logs and allow you to search for events and process that data into an easily digestible format. You would be able to create metrics out of this non-metric data without overcomplicating things by creating machine learning algorithms. This solution would allow you to troubleshoot and debug different applications, alert you when certain events happen, be a part of a larger monitoring solution, and even assist in creating actionable business intelligence.
This is where my recent discovery of Grafana’s Loki comes to the rescue. Loki is an open-source project from Grafana that lives up to its tagline of it being “like Prometheus, but for logs.” Loki was inspired by Prometheus, is very cost efficient, and easy to operate. Loki doesn’t index the contents of your logs, like other solutions, but instead each log stream is labeled. Since Loki is only indexing metadata, it becomes much simpler to operate and cheaper to run. Loki can utilize the same labels you are using with Prometheus which allows you to easily switch between metrics and logs. Loki is also a great fit for all your Rancher managed Kubernetes clusters. Metadata, such as pod labels, is automatically scraped and index when Loki is deployed/installed.
Install and Simple Demo
I have a cluster running RKE Government (RKE2) with Rancher (v2.5.7) installed via Helm. I installed the Monitoring stack of Prometheus and Grafana from the Rancher UI. Rancher running on RKE2 makes installing this stack extremely easy by just clicking on Cluster Explorer for the cluster where you want to install monitoring. You can then click Apps & Marketplace. Then click on the Monitoring chart. You can click through and choose what options you would like to configure for Prometheus and Grafana. After configuring you can scroll down and click Install. This simple and powerful open-source monitoring stack is then fully installed for you.
Install Loki via Helm
You will need to ensure that Helm is already installed. The install process is simple, and we will only be using minimal options for this demo setup.
- Create a namespace for Loki to run in.
kubectl create namespace loki
- Install Loki and Promtail without persistence
helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm upgrade --install loki --namespace=loki grafana/loki-stack -f https://gist.githubusercontent.com/dgvigil/9ed8f4f6de09d0da7f33cda88e10ae88/raw/204b1c090c78732c1f5d205b0fbf737a50b5a27f/loki-rancher-rke-helm-values.yml
Add Loki as a data source
To add Loki as a Grafana data source, you will return to the Cluster Explorer and use the top left drop-down to go to Monitoring. This is where you can click on the link for Grafana. This will open up Grafana. As an administrator, you will then need to move your cursor to the cog icon on the side menu which will show configuration options. You will click on Data Sources. You will see the sources that are automatically added. From here you will click Add data source.
Loki will be listed as an option on this page and you will click Select next to Loki. The only configuration setting that you will need to update for our purposes is the URL in the HTTP section. You can add http://loki.loki.svc.cluster.local:3100
.
Finally you’ll scroll down to the bottom to Save & Test.
Explore the logs
We will now move the cursor back to the side menu to the Explore icon and click there. At the top of the Explore page is a drop down that you will change from Prometheus to Loki. Some examples from your cluster will be presented to you after that change. One of the log labels I find useful is {job="kube-system/rke2-ingress-nginx"}
. This will give you the log data for the Nginx Ingress controller.
A simple RKE2 Dashboard
Next, we can add a dashboard to Grafana that uses both Loki to pull data from the logs and Prometheus for metrics. This is simple process. To import a dashboard click the + (plus sign) on the left side menu, and then click Import. You can “Import via grafana.com” by adding the ID of 14243
and click Load. The following screen will have you show the dashboard where to find the Prometheus and Loki data sources, then you can click Import. This will give you a nice, simple dashboard to get you started.
Next Steps
At this point you can begin to create your own custom dashboards in Grafana that pull from your logs and create usable graphs and other metric data, like the example below from Grafana. You can also use Alertmanager to create alerts based off the data and generated metrics from Loki, like creating an alert for the number of ‘502’ apache responses over the past 5 minutes going over a specific limit.
Conclusion
The need to ensure that your systems are performing and behaving up to your expectation is not a passive task. Proactive monitoring from a single pane of glass offers the ability to feel confident that your Ops team has a firm grasp on what your cluster is doing without sending your logs to some unknown server overseas. Loki provides the ability to address the needs of visualizing and correlating all of the information that your machines/clusters/apps are providing. Your workloads running in your Rancher managed clusters can now become more visible, which makes operators, app owners, and managers all very happy.
If you’d like to learn more about Loki, a great blog to start with would be this one:
An (only slightly technical) introduction to Loki, the Prometheus-inspired open source logging system
Late breaking update: Loki has recently been accepted into the USAF IronBank image registry. When coupled with the only IronBank Kubernetes distribution, RKE2, Loki provides scalable log analytics for any government mission.
“This publication was prepared or accomplished by the author in a personal capacity. All opinions expressed by the author of this publication are solely their current opinions and do not reflect the opinions of Rancher Federal, Inc., respective parent companies, or affiliates with which the author is affiliated. The author's opinions are based upon information they consider reliable, but neither Rancher Federal, Inc., nor its affiliates, nor the companies with which the author is affiliated, warrant its completeness or accuracy, and it should not be relied upon as such.”