Manifest configuration

Saturation points

A saturation point is an abstract definition of a saturation metric that can apply to many services.

Example:

{
    "saturationPoints": {
      "pg_int4_id": {
        "appliesTo": [
          "patroni",
          "patroni-ci"
        ],
        "capacityPlanning": {
          "changepoints_count": 25,
          "forecast_days": 365,
          "historical_days": 365,
          "strategy": "quantile95_1h"
        },
        "description": "This measures used int4 columns capacity in all postgres tables. It is critically important that we do not reach\nsaturation on primary key columns as GitLab will stop to work at this point.\n\nThe saturation point tracks all integer columns, so also foreign keys that might not match their source.\n\nIID columns are deliberatly ignored because they are scoped to a project/namespace.\n",
        "horizontallyScalable": false,
        "raw_query": "max by (column_name) (\n  pg_int4_saturation_current_largest_value{%(selector)s,column_name!~\".+(.|-|_)iid\"} / pg_int4_saturation_column_max_value{%(selector)s,column_name!~\".+(.|-|_)iid\"}\n)\n",
        "severity": "s1",
        "slos": {
          "alertTriggerDuration": "5m",
          "hard": 0.90000000000000002,
          "soft": 0.5
        },
        "title": "Postgres int4 ID capacity"
      }
    }
}

This saturation point applies to multiple services listed in appliesTo.

The soft SLO in .slos.soft defines the threshold for capacity planning: A capacity warning is created, when a saturation metric is forecasted to breach the soft SLO (or has already done so). We can include a hard SLO, which will be plotted but otherwise has no meaning for Tamland.

For capacity planning and forecasting, parameters can be changed in capacityPlanning:

changepoints_count: Maximum number of changepoints to include in the model.

Default: 25

Example:

"capacityPlanning": {
   "changepoints_count": 25,
},

forecast_days: Forecast this many days into the future.

Default: 90

Example:

"capacityPlanning": {
   "forecast_days": 90,
},

historical_days: Use this many days of history to build the forecasting model.

Default: 365

Example:

"capacityPlanning": {
   "historical_days": 365,
},

strategy: Use a specific query template from .defaults.prometheus.queryTemplates to retrieve the saturation metric. Specify exclude to exclude the saturation point from forecasting.

Other strategies could be: - quantile95_1h - quantile99_1h - quantile95_1w - quantile99_1w

Example:

"capacityPlanning": {
   "strategy": "exclude",
},

saturation_dimensions: A list of PromQL selectors, where each item is formatted as an object, containing a selector and an optional label as a string.

Example:

"capacityPlanning": {
   "saturation_dimensions": [
      { "selector": "shard=\"private\"" },
      { "selector": "shard=\"shared-gitlab-org\"" },
      { "selector": "shard=\"saas-linux-small-amd64\"" },
      { "selector": "shard=\"saas-linux-medium-amd64\"" },
      { "selector": "shard=\"saas-linux-medium-arm64\"" },
      { "selector": "shard=\"saas-linux-medium-amd64-gpu-standard\"" },
      { "selector": "shard=\"saas-linux-large-amd64\"" },
      { "selector": "shard=\"saas-linux-large-arm64\"" },
      { "selector": "shard=\"saas-linux-xlarge-amd64\"" },
      { "selector": "shard=\"saas-linux-2xlarge-amd64\"" },
      { "selector": "shard=\"saas-macos-medium-m1\"" },
      { "selector": "shard=\"saas-macos-large-m2pro\"" },
      { "selector": "shard=\"windows-shared\"" },
      {
         "selector": "shard!~\"private|saas-linux-2xlarge-amd64|saas-linux-large-amd64|saas-linux-large-arm64|saas-linux-medium-amd64|saas-linux-medium-amd64-gpu-standard|saas-linux-medium-arm64|saas-linux-small-amd64|saas-linux-xlarge-amd64|saas-macos-large-m2pro|saas-macos-medium-m1|shared-gitlab-org|windows-shared\"",
         "label": "shard=*"
      }
   ],
},

saturation_dimension_dynamic_lookup_query: A PromQL query template to dynamically lookup labels and use as saturation dimensions.

Example:

"capacityPlanning": {
   "saturation_dimension_dynamic_lookup_query": "sum by(shard) (\n  last_over_time(gitlab_component_saturation:ratio{component=\"sidekiq_thread_contention\", %(selector)s}[1w])\n)\n"
},

Observation: when static and dynamic saturation dimensions are available, the values will be merged and deduplicated.

saturation_dimension_dynamic_lookup_limit: A limit to the number of dynamically looked up labels to use as saturation dimensions, where 0 means unlimited.

Default: 10

Example:

"capacityPlanning": {
   "saturation_dimension_dynamic_lookup_limit": 0
},

experiment_saturation_ratio_raw_query: enable the component to forecast using the raw PromQL query.

Default: false

Example:

"capacityPlanning": {
   "experiment_saturation_ratio_raw_query": true
},

Services

There is a list of capacity planning parameters that can be set or overriden by services’ parameters:

saturation_dimensions_keep_aggregate: boolean flag to indicate whether components should also be forecasted as an aggregate of all their dimensions.

Default: true

Example:

"capacityPlanning": {
   "saturation_dimensions_keep_aggregate": false,
},

saturation_dimensions: a list of saturation dimensions that will be concatenated to the saturation_dimensions defined for each saturation point target by the service. Duplicated dimensions will be removed at runtime.

components: more below on tuning the forecast:

Tuning the forecast

There are 2 ways to tune a forecast: (1) by specifying outliers and/or (2) adding custom trend changepoints.

Outliers

Read more about Prophet’s outlier detection here.

Outliers are specified in the manifest file within component’s parameters. For example:

"gitaly": {
    "capacityPlanning": {
      "components": [
          {
            "name": "node_schedstat_waiting",
            "parameters": {
                "ignore_outliers": [
                  {
                      "end": "2022-06-15",
                      "start": "2022-05-23"
                  },
                  {
                      "end": "2022-07-01",
                      "start": "2022-06-25"
                  },
                  {
                      "end": "2023-05-10",
                      "start": "2023-03-31"
                  }
                ]
            }
          }
      ]
    },
    "label": "Gitaly",
    "name": "gitaly",
    "owner": "reliability_practices"
},

Changepoints

Read more about Prophet’s changepoint detection here.

Changepoints are specified in the manifest file within component’s parameters. For example:

"sidekiq": {
   "capacityPlanning": {
      "components": [
         {
            "name": "rails_db_connection_pool",
            "parameters": {
               "changepoints": [
                  "2023-04-03"
               ]
            }
         }
      ]
   },
}

Query templates

PromQL query templates can be defined in .defaults.prometheus.queryTemplates. Those are used to retrieve saturation data for a specific component. In the example below, we specify different query templates to offer a selection of strategies to smoothen the saturation data. This can be used in a saturation point through .capacityPlanning.strategy, see above.

{
   "defaults": {
      "prometheus": {
         "baseURL": "https://prometheus.local",
         "defaultSelectors": {
            "env": "gprd"
         },
         "queryTemplates": {
            "quantile95_1h": "max(quantile_over_time(0.95, gitlab_component_saturation:ratio{%s}[1h]))",
            "quantile95_1w": "max(gitlab_component_saturation:ratio_quantile95_1w{%s})",
            "quantile99_1h": "max(gitlab_component_saturation:ratio_quantile99_1h{%s})",
            "quantile99_1w": "max(gitlab_component_saturation:ratio_quantile99_1w{%s})"
         },
         "serviceLabel": "type"
      }
   }
}

Tamland uses a query template to retrieve saturation data for a particular component. This component’s service label (see .serviceLabel) as well as default selectors (see .defaultSelectors) will be injected into the promQL query.