Using the Grafana Cloud Agent with Amazon Managed Prometheus, across multiple AWS accounts

Observability is all the rage these days, and the process of collecting metrics is getting easier. Now, the big(ger) players are getting in on the action, with Amazon releasing a Managed Prometheus offering and Grafana now providing a simplified “all-in-one” monitoring agent.

This is a quick guide to show how you can couple these two together, on individual hosts, and incorporating cross-account access control.

The Grafana Cloud Agent

Grafana Labs have taken (some of) the best bits of the Prometheus monitoring stack and created a unified deployment that wraps the individual moving parts up into a single binary. There’s nothing particularly exciting about this, until you consider it also includes the central Prometheus server as well - meaning the Grafana Agent allows us to push metrics from any individual host without needing to discover it from a central server, and maintains a temporary metrics store if the remote host should be unreachable.

By running a slimmed-down Prometheus instance on each node, we get to handle transient central node failures (for example, if your central Prometheus instance is offline and unable to scrape) as well as allowing epemerial/temporary instances to push their metrics automatically to a central source for as long as they’re online. Management overhead is reduced, single points of failure eliminated.

Amazon Managed Service for Prometheus

(NB: At the time of writing, this service is still in open preview)

Anyone that has spent any time trying to manage a Thanos or Cortex cluster will no doubt appreciate that AWS are now offering this pain as a service, but without the pain. Sure, it has a bit of a mouthful of a name, but that’s the hallmark of any good AWS service, right?

Developed in collaboration with Grafana Labs, behind the scenes is a managed Coretex service. AWS takes care of “everything” and gives you two simple things:

A remote write endpoint, compatible with the Prometheus Remote Write spec
A query endpoint, compatible with the Prometheus query specification

Everything else - scaling, storage, upgrades, monitoring - is all handled “automagically” by AWS. Obviously, this comes at a price, and working out what you’re going to pay can be complex (as there is a price for storage, as well as a price-per-query), but as always it is up to you, dear readers, to decide if the cost is “worth it” or not.

Installing the Grafana Agent

If I had one complaint about the Grafana Agent, it would be the configuration file. Not just the complex syntax when you start enabling multiple collectors, but the fact it seems to keep changing with each release.

The agent development seems very focused around monitoring Kubernetes clusters, rather than installation on individual nodes, so trying to find out what settings are needed and where to actually set them can be a challenge - especially when you stumble across an article and find the config file is now “useless” as the file format has changed… again :(

Anyway, enough moaning - lets jump straight in. Once you have the agent installed (I used the RPM), you’ll want to edit your /etc/grafana-agent.yaml configuration file to look something like this:

server:
  http_listen_address: '127.0.0.1'
  http_listen_port: 9090

prometheus:
  global:
    scrape_interval: 15s
  wal_directory: '/var/lib/grafana-agent'

integrations:
  agent:
    enabled: true
  node_exporter:
    enabled: true
    include_exporter_metrics: false
    disable_collectors:
      - "mdadm"
  prometheus_remote_write:
    - url: https://aps-workspaces.eu-west-1.amazonaws.com/workspaces/ws-[your workspace ID]/api/v1/remote_write
      sigv4:
        enabled: true
        region: eu-west-1

The more eagle-eyed of you will notice that this is:

Mostly the default configuration as supplied with the agent
I have provided another random blog post with a snippet of configuration that might turn out to be invalid
- (I am using Grafana Agent release 0.13)

All you should need to change in the above file is the Workspace ID in the URL. If you’re running this on an EC2 instance with the arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess policy applied, you should now see metrics flowing into your Managed Prometheus instance.

But what about this cross-account stuff?

Ah yes, that’s where it can get a bit tricky. There seems to be no specific access control on the AMP endpoints - as long as you’re writing to the endpoint “from the same account” (that is, you have a valid set of IAM credentials to generate the ‘Sigv4’ authentication token), your metrics will be accepted.

Unlike other AWS services where you can apply an IAM-style policy to the service, this isn’t (yet?) the case with the Amazon Managed Prometheus service.

Luckily, we can assume a role across the account boundaries to allow us to obtain suitable credentials for writing to the AMP endpoint. We need to do a few things to make this happen.

AWS structure

For the purposes of the examples, I’m assuming you have two AWS account IDs:

123456700000: “Primary”: The account where you have created your AMP Workspace
123456711111: “Secondary”: The account where you’re going to host your EC2 instances running the Grafana Agent

Primary account setup

We will need to create an IAM role which can be “assumed” by the secondary account, in effect granting credentials “from” the primary account to instances running in the secondary account. I am assuming if you’re reading this you know how to create IAM roles, so go ahead and create one with the AmazonPrometheusRemoteWriteAccess policy attached, and a trust relationship with the “secondary” account.

For the purposes of the example here, my IAM role will be named: AMP-RemoteWrite.

By default, when creating a cross-account role via the console, this will permit any roles in the secondary account to assume the new role in the primary (provided there is a corresponding trust relationship on the secondary), but you can modify this if you prefer to have a finer-grained trust policy.

Secondary account setup

Go ahead and create an IAM Instance Profile for the EC2 instances running in the secondary account, and ensure they have at least the following permissions policy assigned:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAMPRemoteWrite",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::123456700000:role/AMP-RemoteWrite"
        }
    ]
}

This permissions statement will allow the “seconary” accounty IAM role to assume the AMP-RemoteWrite role in the primary account.

Grafana Agent setup

In the secondary account, you’ll need to make a couple of changes to the environment setup for the Grafana agent:

Provide an AWS SDK configuration file to configure the AssumeRole operation, and
Tell the grafana-agent (or, more specifically, the AWS Go SDK) to read this configuraton file.

In the examples below, I am assuming that you have installed the grafana-agent RPM package. If you’re installing via another method, you’ll need to update the paths/file names accordingly.

The AWS SDK configuration is quite straight forward, simply create ~grafana-agent/.aws/config with the following content:

[profile remotewrite]
role_arn = arn:aws:iam::123456700000:role/AMP-RemoteWrite
credential_source = Ec2InstanceMetadata

In this file, remotewrite is the profile name we’ll define in our agent environment, and you will need to replace role_arn with the relevant ARN for your role in the primary account.

Next, we make a couple of changes to the environment variables provided to the Grafana Agent binary at startup. We can do this with a systemd override file, so go ahead and create a new file (and corresponding directory) on your EC2 instance called /etc/systemd/system/grafana-agent.service.d/aws.conf, with the following content:

[Service]
Environment=AWS_SDK_LOAD_CONFIG=true
Environment=AWS_STS_REGIONAL_ENDPOINTS=regional
Environment=AWS_REGION=eu-west-1
Environment=AWS_PROFILE=remotewrite

Replace remotewrite with the profile name defined in the ~grafana-agent/.aws/config file, if you’ve changed it.

In simplistic terms, this file is saying:

AWS_SDK_LOAD_CONFIG: Please load the configuration file.
- By default, the Go SDK does not read this file. (See the SDK docs for more)
AWS_STS_REGIONAL_ENDPOINTS: Please use per-region endpoints for the STS service.
- This is generally only required if you’re using VPC Endpoints or have restricted Internet access from your VPC. In short, rather than the sts:AssumeRole operation trying to talk to sts.amazonaws.com, it will use the region specific endpoint, - in our case sts.eu-west-1.amazonaws.com
AWS_REGION: Defines the AWS region the host resides in.
- NB: you may need to override this if your AMP instance is in a different region than your EC2 instance, as the region code is integrated into the Sigv4 authentication process. For simplicity I would ensure this is set to the EC2 instance region, and ensure the grafana-agent.yaml region is set to the region your AMP workspace is in
AWS_PROFILE: This is the name of the profile that will be used from the .aws/config file

That’s it!

That should be it - if you’re followed the steps above, you should have metrics from your EC2 instance flowing into your AWS Managed Prometheus instance!

Postscript: A note about endpoints

Most of what I have described above will work with the standard public endpoints, or if you’re using a VPC Endpoint. The only thing to be mindful of is that the AMP VPC endpoints have a slightly different SSL confiuguration to other AWS services, in that they do not have an SSL CN of the VPC endpoint address, only the “public” DNS names.

Generally, this is only an issue if you’re targetting the VPC endpoint directly, for example from a private network connected to AWS. I’ve not tackled this problem with the grafana-agent, but if you’re using the AWS Sigv4 Proxy, and hitting SSL issues, Pull Request #35 might help.