thanos
Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
Thanos extends Prometheus with unlimited long-term metric storage in cloud object storage, a global query view across multiple clusters, and high availability, without replacing your existing Prometheus setup.
Thanos is a set of tools that extend Prometheus, a popular open-source monitoring system used to collect and query metrics from servers and services. Prometheus works well for a single machine or cluster, but it stores data locally, which means data can be lost if the machine goes down, storage fills up over time, and there is no built-in way to query metrics from multiple Prometheus instances at once. Thanos was built to solve all three problems.
The main things Thanos adds are a global query view, unlimited long-term storage, and high availability. The global query view lets you send one query and have it reach all of your Prometheus servers across multiple clusters or regions, with results merged automatically. For storage, Thanos ships metric data from Prometheus into any object storage service (like Amazon S3, Google Cloud Storage, or Azure Blob Storage) where data can be kept indefinitely at low cost. Older data can also be downsampled to make queries over long time ranges faster.
For high availability, teams sometimes run two identical Prometheus instances pointed at the same targets. Thanos can merge their data on the fly and remove duplicates, so if one instance fails, queries still return complete results.
Thanos integrates with existing Prometheus setups by adding small sidecar components next to your current Prometheus servers. No major restructuring of your monitoring setup is required. The project supports cross-cluster federation, fault-tolerant query routing, and exposes a gRPC API that other tools can build on.
Thanos is written in Go and is an incubating project at the Cloud Native Computing Foundation.
Where it fits
- Add long-term metric retention to an existing Prometheus setup by shipping data to S3 or Google Cloud Storage instead of running out of local disk.
- Query metrics from multiple Prometheus instances across regions or clusters with a single PromQL query.
- Run two identical Prometheus instances for redundancy and use Thanos to deduplicate their data so queries return complete results even when one goes down.
- Downsample old Prometheus data to reduce storage costs and speed up queries that span months of history.