Manifest configuration¶
Saturation points¶
A saturation point is an abstract definition of a saturation metric that can apply to many services.
Example:
{
"saturationPoints": {
"pg_int4_id": {
"appliesTo": [
"patroni",
"patroni-ci"
],
"capacityPlanning": {
"changepoints_count": 25,
"forecast_days": 365,
"historical_days": 365,
"strategy": "quantile95_1h"
},
"description": "This measures used int4 columns capacity in all postgres tables. It is critically important that we do not reach\nsaturation on primary key columns as GitLab will stop to work at this point.\n\nThe saturation point tracks all integer columns, so also foreign keys that might not match their source.\n\nIID columns are deliberatly ignored because they are scoped to a project/namespace.\n",
"horizontallyScalable": false,
"raw_query": "max by (column_name) (\n pg_int4_saturation_current_largest_value{%(selector)s,column_name!~\".+(.|-|_)iid\"} / pg_int4_saturation_column_max_value{%(selector)s,column_name!~\".+(.|-|_)iid\"}\n)\n",
"severity": "s1",
"slos": {
"alertTriggerDuration": "5m",
"hard": 0.90000000000000002,
"soft": 0.5
},
"title": "Postgres int4 ID capacity"
}
}
}
This saturation point applies to multiple services listed in appliesTo
.
The soft
SLO in .slos.soft
defines the threshold for capacity planning: A capacity warning is created, when a saturation metric is forecasted to breach the soft
SLO (or has already done so).
We can include a hard
SLO, which will be plotted but otherwise has no meaning for Tamland.
For capacity planning and forecasting, parameters can be changed in capacityPlanning
:
changepoints_count
: Maximum number of changepoints to include in the model.
Default: 25
Example:
"capacityPlanning": {
"changepoints_count": 25,
},
forecast_days
: Forecast this many days into the future.
Default: 90
Example:
"capacityPlanning": {
"forecast_days": 90,
},
historical_days
: Use this many days of history to build the forecasting model.
Default: 365
Example:
"capacityPlanning": {
"historical_days": 365,
},
strategy
: Use a specific query template from .defaults.prometheus.queryTemplates
to retrieve the saturation metric. Specify exclude
to exclude the saturation point from forecasting.
Other strategies could be: - quantile95_1h - quantile99_1h - quantile95_1w - quantile99_1w
Example:
"capacityPlanning": {
"strategy": "exclude",
},
saturation_dimensions
: A list of PromQL selectors, where each item is formatted as an object, containing a selector and an optional label as a string.
Example:
"capacityPlanning": {
"saturation_dimensions": [
{ "selector": "shard=\"private\"" },
{ "selector": "shard=\"shared-gitlab-org\"" },
{ "selector": "shard=\"saas-linux-small-amd64\"" },
{ "selector": "shard=\"saas-linux-medium-amd64\"" },
{ "selector": "shard=\"saas-linux-medium-arm64\"" },
{ "selector": "shard=\"saas-linux-medium-amd64-gpu-standard\"" },
{ "selector": "shard=\"saas-linux-large-amd64\"" },
{ "selector": "shard=\"saas-linux-large-arm64\"" },
{ "selector": "shard=\"saas-linux-xlarge-amd64\"" },
{ "selector": "shard=\"saas-linux-2xlarge-amd64\"" },
{ "selector": "shard=\"saas-macos-medium-m1\"" },
{ "selector": "shard=\"saas-macos-large-m2pro\"" },
{ "selector": "shard=\"windows-shared\"" },
{
"selector": "shard!~\"private|saas-linux-2xlarge-amd64|saas-linux-large-amd64|saas-linux-large-arm64|saas-linux-medium-amd64|saas-linux-medium-amd64-gpu-standard|saas-linux-medium-arm64|saas-linux-small-amd64|saas-linux-xlarge-amd64|saas-macos-large-m2pro|saas-macos-medium-m1|shared-gitlab-org|windows-shared\"",
"label": "shard=*"
}
],
},
saturation_dimension_dynamic_lookup_query
: A PromQL query template to dynamically lookup labels and use as saturation dimensions.
Example:
"capacityPlanning": {
"saturation_dimension_dynamic_lookup_query": "sum by(shard) (\n last_over_time(gitlab_component_saturation:ratio{component=\"sidekiq_thread_contention\", %(selector)s}[1w])\n)\n"
},
Observation: when static and dynamic saturation dimensions are available, the values will be merged and deduplicated.
saturation_dimension_dynamic_lookup_limit
: A limit to the number of dynamically looked up labels to use as saturation dimensions, where 0 means unlimited.
Default: 10
Example:
"capacityPlanning": {
"saturation_dimension_dynamic_lookup_limit": 0
},
experiment_saturation_ratio_raw_query
: enable the component to forecast using the raw PromQL query.
Default: false
Example:
"capacityPlanning": {
"experiment_saturation_ratio_raw_query": true
},
Services¶
There is a list of capacity planning parameters that can be set or overriden by services’ parameters:
saturation_dimensions_keep_aggregate
: boolean flag to indicate whether components should also be forecasted as an aggregate of all their dimensions.
Default: true
Example:
"capacityPlanning": {
"saturation_dimensions_keep_aggregate": false,
},
saturation_dimensions
: a list of saturation dimensions that will be concatenated to the saturation_dimensions
defined for each saturation point target by the service. Duplicated dimensions will be removed at runtime.
components
: more below on tuning the forecast:
Tuning the forecast¶
There are 2 ways to tune a forecast: (1) by specifying outliers and/or (2) adding custom trend changepoints.
Outliers¶
Read more about Prophet’s outlier detection here.
Outliers are specified in the manifest file within component’s parameters. For example:
"gitaly": {
"capacityPlanning": {
"components": [
{
"name": "node_schedstat_waiting",
"parameters": {
"ignore_outliers": [
{
"end": "2022-06-15",
"start": "2022-05-23"
},
{
"end": "2022-07-01",
"start": "2022-06-25"
},
{
"end": "2023-05-10",
"start": "2023-03-31"
}
]
}
}
]
},
"label": "Gitaly",
"name": "gitaly",
"owner": "reliability_practices"
},
Changepoints¶
Read more about Prophet’s changepoint detection here.
Changepoints are specified in the manifest file within component’s parameters. For example:
"sidekiq": {
"capacityPlanning": {
"components": [
{
"name": "rails_db_connection_pool",
"parameters": {
"changepoints": [
"2023-04-03"
]
}
}
]
},
}
Query templates¶
PromQL query templates can be defined in .defaults.prometheus.queryTemplates
.
Those are used to retrieve saturation data for a specific component.
In the example below, we specify different query templates to offer a selection of strategies to smoothen the saturation data.
This can be used in a saturation point through .capacityPlanning.strategy
, see above.
{
"defaults": {
"prometheus": {
"baseURL": "https://prometheus.local",
"defaultSelectors": {
"env": "gprd"
},
"queryTemplates": {
"quantile95_1h": "max(quantile_over_time(0.95, gitlab_component_saturation:ratio{%s}[1h]))",
"quantile95_1w": "max(gitlab_component_saturation:ratio_quantile95_1w{%s})",
"quantile99_1h": "max(gitlab_component_saturation:ratio_quantile99_1h{%s})",
"quantile99_1w": "max(gitlab_component_saturation:ratio_quantile99_1w{%s})"
},
"serviceLabel": "type"
}
}
}
Tamland uses a query template to retrieve saturation data for a particular component.
This component’s service label (see .serviceLabel
) as well as default selectors (see .defaultSelectors
) will be injected into the promQL query.