Files
firezone/terraform
Jamil 5bac3f5ec2 fix(infra): Don't send more/faster metrics than Google accepts (#8028)
We are getting quite a few of these warnings on prod:

```
{400, "{\n  \"error\": {\n    \"code\": 400,\n    \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n    \"status\": \"INVALID_ARGUMENT\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n        \"totalPointCount\": 40,\n        \"successPointCount\": 31,\n        \"errors\": [\n          {\n            \"status\": {\n              \"code\": 9\n            },\n            \"pointCount\": 9\n          }\n        ]\n      }\n    ]\n  }\n}\n"}
```

Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.

The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.

To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.

**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
2025-02-10 18:21:40 +00:00
..