Commit Graph

155 Commits

Author SHA1 Message Date
Stephen Taylor
9f3b9f4f56 [ceph-client] Add pool rename support for Ceph pools
A new value "rename" has been added to the Ceph pool spec to allow
pools to be renamed in a brownfield deployment. For greenfield the
pool will be created and renamed in a single deployment step, and
for a brownfield deployment in which the pool has already been
renamed previously no changes will be made to pool names.

Change-Id: I3fba88d2f94e1c7102af91f18343346a72872fde
2021-05-11 14:56:06 -06:00
Parsons, Cliff (cp769u)
d4f253ef9f Make Ceph pool init job consistent with helm test
The current pool init job only allows the finding of PGs in the
"peering" or "activating" (or active) states, but it should also
allow the other possible states that can occur while the PG
autoscaler is running ("unknown" and "creating" and "recover").
The helm test is already allowing these states, so the pool init
job is being changed to also allow them to be consistent.

Change-Id: Ib2c19a459c6a30988e3348f8d073413ed687f98b
2021-05-11 15:38:18 +00:00
Parsons, Cliff (cp769u)
7bb5ff5502 Make ceph-client helm test more PG specific
This patchset makes the current ceph-client helm test more specific
about checking each of the PGs that are transitioning through inactive
states during the test. If any single PG spends more than 30 seconds in
any of these inactive states (peering, activating, creating, unknown,
etc), then the test will fail.

Also, if after the three minute PG checking period is expired, we will
no longer fail the helm test, as it is very possible that the autoscaler
could be still adjusting the PGs for several minutes after a deployment
is done.

Change-Id: I7f3209b7b3399feb7bec7598e6e88d7680f825c4
2021-04-16 22:25:53 +00:00
Parsons, Cliff (cp769u)
f20eff164f Allow Ceph RBD pool job to leave failed pods
This patchset will add the capability to configure the
Ceph RBD pool job to leave failed pods behind for debugging
purposes, if it is desired. Default is to not leave them
behind, which is the current behavior.

Change-Id: Ife63b73f89996d59b75ec617129818068b060d1c
2021-03-29 19:38:55 +00:00
Parsons, Cliff (cp769u)
167b9eb1a8 Fix ceph-client helm test
This patch resolves a helm test problem where the test was failing
if it found a PG state of "activating". It could also potentially
find a number of other states, like premerge or unknown, that
could also fail the test. Note that if these transient PG states are
found for more than 3 minutes, the helm test fails.

Change-Id: I071bcfedf7e4079e085c2f72d2fbab3adc0b027c
2021-03-22 22:06:27 +00:00
Stephen Taylor
69a7916b92 [ceph-client] Disable autoscaling before pools are created
When autoscaling is disabled after pools are created, there is an
opportunity for some autoscaling to take place before autoscaling
is disabled. This change checks to see if autoscaling needs to be
disabled before creating pools, then checks to see if it needs to
be enabled after creating pools. This ensures that autoscaling
won't happen when autoscaler is disabled and autoscaling won't
start prematurely as pools are being created when it is enabled.

Change-Id: I8803b799b51735ecd3a4878d62be45ec50bbbe19
2021-03-12 15:03:51 +00:00
bw6938
bb3ce70a10 [ceph-client] enhance logic to enable and disable the autoscaler
The autoscaler was introduced in the Nautilus release. This
change only sets the pg_num value for a pool if the autoscaler
is disabled or the Ceph release is earlier than Nautilus.

When pools are created with the autoscaler enabled, a pg_num_min
value specifies the minimum value of pg_num that the autoscaler
will target. That default was recently changed from 8 to 32
which severely limits the number of pools in a small cluster per
https://github.com/rook/rook/issues/5091. This change overrides
the default pg_num_min value of 32 with a value of 8 (matching
the default pg_num value of 8) using the optional --pg-num-min
<value> argument at pool creation and pg_num_min value for
existing pools.

Change-Id: Ie08fb367ec8b1803fcc6e8cd22dc8da43c90e5c4
2021-03-09 22:11:47 +00:00
Stephen Taylor
cf7d665e79 [ceph-client] Separate pool quotas from pg_num calculations
Currently pool quotas and pg_num calculations are both based on
percent_total_data values. This can be problematic when the amount
of data allowed in a pool doesn't necessarily match the percentage
of the cluster's data expected to be stored in the pool. It is
also more intuitive to define absolute quotas for pools.

This change adds an optional pool_quota value that defines an
explicit value in bytes to be used as a pool quota. If pool_quota
is omitted for a given pool, that pool's quota is set to 0 (no
quota).

A check_pool_quota_target() Helm test has also been added to
verify that the sum of all pool quotas does not exceed the target
quota defined for the cluster if present.

Change-Id: I959fb9e95d8f1e03c36e44aba57c552a315867d0
2021-02-26 16:49:10 +00:00
Brian Wickersham
714cfdad84 Revert "[ceph-client] enhance logic to enable the autoscaler for Octopus"
This reverts commit 910ed906d0.

Reason for revert: May be causing upstream multinode gates to fail. 

Change-Id: I1ea7349f5821b549d7c9ea88ef0089821eff3ddf
2021-02-25 17:04:37 +00:00
bw6938
910ed906d0 [ceph-client] enhance logic to enable the autoscaler for Octopus
Change-Id: I90d4d279a96cd298eba03e9c0b05a8f2a752e746
2021-02-19 21:03:45 +00:00
Stephen Taylor
1dcaffdf70 [ceph-client] Don't wait for premerge PGs in the rbd pool job
The wait_for_pgs() function in the rbd pool job waits for all PGs
to become active before proceeding, but in the event of an upgrade
that decreases pg_num values on one or more pools it sees PGs in
the clean+premerge+peered state as peering and waits for "peering"
to complete. Since these PGs are in the process of merging into
active PGs, waiting for the merge to complete is unnecessary. This
change will reduce the wait time in this job significantly in
these cases.

Change-Id: I9a2985855a25cdb98ef6fe011ba473587ea7a4c9
2021-02-05 09:55:21 -07:00
Chinasubbareddy Mallavarapu
da289c78cb [CEPH] Uplift from Nautilus to Octopus release
This is to uplift ceph charts from 14.X release to 15.X

Change-Id: I4f7913967185dd52d4301c218450cfad9d0e2b2b
2021-02-03 22:34:53 +00:00
Stephen Taylor
6cf614d7a8 [ceph-client] Fix Helm test check_pgs() check for inactive PGs
The 'ceph pg dump_stuck' command that looks for PGs that are stuck
inactive doesn't include the 'inactive' keyword, so it also finds
PGs that are active that it believes are stuck. This change adds
the 'inactive' keyword to the command so only inactive PGs are
considered.

Change-Id: Id276deb3e5cb8c7e30f5a55140b8dbba52a33900
2021-01-25 17:54:26 +00:00
Parsons, Cliff (cp769u)
970c23acf4 Improvements for ceph-client helm tests
This commit introduces the following helm test improvement for the
ceph-client chart:

1) Reworks the pg_validation function so that it allows some time for
peering PGs to finish peering, but fail if any other critical errors are
seen. The actual pg validation was split out into a function called
check_pgs(), and the pg_validation function manages the looping aspects.

2) The check_cluster_status function now calls pv_validation if the
cluster status is not OK. This is very similar to what was happening
before, except now, the logic will not be repeated.

Change-Id: I65906380817441bd2ff9ff9cfbf9586b6fdd2ba7
2021-01-18 16:12:33 +00:00
Frank Ritchie
abf8d1bc6e Run as ceph user and disallow privilege escalation
This PS is to address security best practices concerning running
containers as a non-privileged user and disallowing privilege
escalation. Ceph-client is used for the mgr and mds pods.

Change-Id: Idbd87408c17907eaae9c6398fbc942f203b51515
2021-01-04 12:58:09 -05:00
Chinasubbareddy Mallavarapu
c3f921c916 [ceph-client] fix the logic to disable the autoscaler on pools
This is to fix the logic to disable the autosclaer on pools as
its not considering newly created pools.

Change-Id: I76fe106918d865b6443453b13e3a4bd6fc35206a
2020-10-16 21:17:07 +00:00
Andrii Ostapenko
1532958c80 Change helm-toolkit dependency version to ">= 0.1.0"
Since we introduced chart version check in gates, requirements are not
satisfied with strict check of 0.1.0

Change-Id: I15950b735b4f8566bc0018fe4f4ea9ba729235fc
Signed-off-by: Andrii Ostapenko <andrii.ostapenko@att.com>
2020-09-24 12:19:28 -05:00
Brian Wickersham
11ab577099 [ceph-client] Fix issue with checking if autoscaler should be enabled
This corrects an issue in the create_pool function with checking
if the pg autoscaler should be enabled.

Change-Id: Id9be162fd59cc452477f5cc5c5698de7ae5bb141
2020-09-18 13:19:55 +00:00
Zuul
2bfce96304 Merge "Run chart-testing on all charts" 2020-09-17 14:38:19 +00:00
Mohammed Naser
c7a45f166f Run chart-testing on all charts
Added chart lint in zuul CI to enhance the stability for charts.
Fixed some lint errors in the current charts.

Change-Id: I9df4024c7ccf8b3510e665fc07ba0f38871fcbdb
2020-09-11 18:02:38 +03:00
Kabanov, Dmitrii
78137fd4ce [ceph-client] Update queries in wait_for_pgs function
The PS updates queries in wait_for_pgs function (init pool script). The queries
were updated to handle the cases when PGs have "activating" and "peered"
statuses.

Change-Id: Ie93797fcb72462f61bca3a007f6649ab46ef4f97
2020-09-10 21:54:36 +00:00
Chinasubbareddy Mallavarapu
8adc6216bc [CEPH] Disable ceph pg autoscaler on pools by reading from values
This is to disable unintentionally enabled pg autoscaler on pools
by reading it from values.

Change-Id: Ib919ae7786ec1d4cbe7a309d28fd6571aa6195de
2020-08-21 16:55:33 -05:00
Chinasubbareddy Mallavarapu
4214e85a77 [CEPH] Add missing ceph cluster name for helm tests
This is to export the ceph cluster name as environment variable
since its getting referred by scripts.
also to fix the query to get inactive pgs.

Change-Id: I1db5cfbd594c0cc6d54f748f22af5856d9594922
2020-08-14 16:09:19 -05:00
Kabanov, Dmitrii
4557f6fbe8 [ceph] Update queries to filter pgs correctly
The PS updates queries in wait_for_pgs function in ceph-client and
ceph-osd charts. It allows more accurately check the status of PGs.
The output of the "ceph pg ls" command may contain many PG statuses,
like "active+clean", "active+undersized+degraded", "active+recovering",
"peering" and etc. But along with these statuses there may be such as
"stale+active+clean". To avoid the wrong interpretation of the status
of the PSs the filter was changed from "startswith(active+)" to
"contains(active)".
Also PS adds a delay after restart of the pods to post-apply job.
It allows to reduce the number of useless queries to kubernetes.

Change-Id: I0eff2ce036ad543bf2554bd586c2a2d3e91c052b
2020-08-13 22:45:01 -07:00
Zuul
c19ee4ab94 Merge "[ceph-client] Fix crush weight comparison in reweight_osds()" 2020-08-13 20:40:46 +00:00
Taylor, Stephen (st053q)
f66f9fe560 [ceph-client] Fix crush weight comparison in reweight_osds()
The recently-added crush weight comparison in reweight_osds() that
checks weights for zero isn't working correctly because the
expected weight is being calculated to two decimal places and then
compared against "0" as a string. This updates the comparison
string to "0.00" to match the calculation.

Change-Id: I29387a597a21180bb7fba974b4daeadf6ffc182d
2020-08-13 12:00:32 -06:00
Chinasubbareddy Mallavarapu
64b423cee0 [ceph] Check for osds deployed with zero crush weight
This is to check for osds deployed with zero crush weight from
helm tests.

Change-Id: Ie8d9c65b33bf7a026a342d1d7e81ec37cb981db3
2020-08-13 14:39:38 +00:00
Taylor, Stephen (st053q)
f1e9a6ba83 [ceph-client] Refrain from reweighting OSDs to 0
If circumstances are such that the reweight function believes
OSD disks have zero size, refrain from reweighting OSDs to 0.
This can happen if OSDs are deployed with the noup flag set.

Also move the setting and unsetting of flags above this
calculation as an additional precautionary measure.

Change-Id: Ibc23494e0e75cfdd7654f5c0d3b6048b146280f7
2020-08-11 09:48:53 -06:00
Zuul
9ed951aa32 Merge "[Ceph-client] Add check of target osd value" 2020-08-03 21:31:09 +00:00
Zuul
c0b86523a7 Merge "[ceph-client] update logic of inactive pgs check" 2020-08-03 20:12:06 +00:00
Frank Ritchie
5909bcbdef Use hostPID for ceph-mgr deployment
This change is to address a memory leak in the ceph-mgr deployment.
The leak has also been noted in:

https://review.opendev.org/#/c/711085

Without this change memory usage for the active ceph-mgr pod will
steadily increase by roughly 100MiB per hour until all available
memory has been exhausted. Reset messages will also be seen in the
active and standby ceph-mgr pod logs.

Sample messages:

---

0 client.0 ms_handle_reset on v2:10.0.0.226:6808/1
0 client.0 ms_handle_reset on v2:10.0.0.226:6808/1
0 client.0 ms_handle_reset on v2:10.0.0.226:6808/1

---

The root cause of the resets and associated memory leak appears to
be due to multiple ceph pods sharing the same IP address (due to
hostNetwork being true) and PID (due to hostPID being false).
In the messages above the "1" at the end of the line is the PID.
Ceph appears to use the Version:IP:Port/PID (v2:10.0.0.226:6808/1)
tuple as a unique identifier. When hostPID is false conflicts arise.

Setting hostPID to true stops the reset messages and memory leak.

Change-Id: I9821637e75e8f89b59cf39842a6eb7e66518fa2c
2020-08-03 17:35:51 +00:00
Kabanov, Dmitrii
f6d6ae051d [ceph-client] update logic of inactive pgs check
The PS updates wait_for_inactive_pgs function:
- Changed the name of the function to wait_for_pgs
- Added a query for getting status of pgs
- All pgs should be in "active+" state at least three times in a row

Change-Id: Iecc79ebbdfaa74886bca989b23f7741a1c3dca16
2020-08-03 08:42:58 -07:00
Kabanov, Dmitrii
47ce52a5cf [Ceph-client] Add check of target osd value
The PS adds the check of target osd value. The expected amount of OSDs
should be always more or equal to existing OSDs. If there is more OSDs
than expected it means that the value is not correct.

Change-Id: I117a189a18dbb740585b343db9ac9b596a34b929
2020-08-03 15:38:14 +00:00
Stephen Taylor
84f1557566 [ceph-client] Fix a helm test issue and disable PG autoscaler
Currently the Ceph helm tests pass when the deployed Ceph cluster
is unhealthy. This change expands the cluster status testing
logic to pass when all PGs are active and fail if any PG is
inactive.

The PG autoscaler is currently causing the deployment to deploy
unhealthy Ceph clusters. This change also disables it. It should
be re-enabled once those issues are resolved.

Change-Id: Iea1ff5006fc00e4570cf67c6af5ef6746a538058
2020-07-31 14:46:10 +00:00
Kabanov, Dmitrii
b736a74e39 [ceph] Add noup flag check to helm tests
The PS adds noup flag check to Ceph-client and Ceph-OSD helm tests.
It allows successfully pass the tests even if noup flag is set.

Change-Id: Ida43d83902d26bef3434c47e71959bb2086ad82a
2020-07-22 15:30:51 +00:00
Kabanov, Dmitrii
ffb4f86796 [ceph-client] Add OSD check before pool creation
The PS adds the check of count of OSDs. It ensures that expected amount
of OSDs is present at the moment of creation of a pool.
The expected amount of OSDs is calculated based on target amount of OSDs
and required percent of OSDs.

Change-Id: Iadf36dbeca61c47d9a9db60cf5335e4e1cb7b74b
2020-07-21 17:54:16 +00:00
Stephen Taylor
aaf52acc27 [ceph-client] Add back a new version of reweight_osds()
https://review.opendev.org/733193 removed the reweight_osds()
function from the ceph-client and weighted OSDs as they are added
in the ceph-osd chart instead. Since then some situations have
come up where OSDs were already deployed with incorrect weights
and this function is needed in order to weight them properly later
on. This new version calculates an expected weight for each OSD,
compares it to the OSD's actual weight, and makes an adjustment if
necessary.

Change-Id: I58bc16fc03b9234a08847d29aa14067bec05f1f1
2020-07-20 19:42:52 +00:00
Kabanov, Dmitrii
eecf56b8a9 [Ceph-client, ceph-osd] Update helm test
The PS updates helm test and replaces "expected_osds" variable
by the amount of OSDs available in the cluster (ceph-client).
Also the PS updates the logic of calculation of minimum amount of OSDs.

Change-Id: Ic8402d668d672f454f062bed369cac516ed1573e
2020-07-09 15:53:49 +00:00
Andrii Ostapenko
824f168efc Undo octal-values restriction together with corresponding code
Unrestrict octal values rule since benefits of file modes readability
exceed possible issues with yaml 1.2 adoption in future k8s versions.
These issues will be addressed when/if they occur.

Also ensure osh-infra is a required project for lint job, that matters
when running job against another project.

Change-Id: Ic5e327cf40c4b09c90738baff56419a6cef132da
Signed-off-by: Andrii Ostapenko <andrii.ostapenko@att.com>
2020-07-07 15:42:53 +00:00
Zuul
0a35fd827e Merge "Enable key-duplicates and octal-values yamllint checks" 2020-06-18 04:49:03 +00:00
Zuul
6217a5eda3 Merge "[ceph-osd, ceph-client] Weight OSDs as they are added" 2020-06-18 02:22:53 +00:00
Stephen Taylor
59b825ae48 [ceph-osd, ceph-client] Weight OSDs as they are added
Currently OSDs are added by the ceph-osd chart with zero weight
and they get reweighted to proper weights in the ceph-client chart
after all OSDs have been deployed. This causes a problem when a
deployment is partially completed and additional OSDs are added
later. In this case the ceph-client chart has already run and the
new OSDs don't ever get weighted correctly. This change weights
OSDs properly as they are deployed instead. As noted in the
script, the noin flag may be set during the deployment to prevent
rebalancing as OSDs are added if necessary.

Added the ability to set and unset Ceph cluster flags in the
ceph-client chart.

Change-Id: Ic9a3d8d5625af49b093976a855dd66e5705d2c29
2020-06-17 21:49:39 +00:00
Andrii Ostapenko
83e27e600c Enable key-duplicates and octal-values yamllint checks
With corresponding code changes.

Change-Id: I11cde8971b3effbb6eb2b69a7d31ecf12140434e
2020-06-17 13:14:30 -05:00
Andrii Ostapenko
dfb32ccf60 Enable yamllint rules for templates
- braces
- brackets
- colons
- commas
- comments
- comments-indentation
- document-start
- hyphens
- indentation

With corresponding code changes.

Also idempotency fix for lint script.

Change-Id: Ibe5281cbb4ad7970e92f3d1f921abb1efc89dc3b
2020-06-17 13:13:53 -05:00
Andrii Ostapenko
8f24a74bc7 Introduces templates linting
This commit rewrites lint job to make template linting available.
Currently yamllint is run in warning mode against all templates
rendered with default values. Duplicates detected and issues will be
addressed in subsequent commits.

Also all y*ml files are added for linting and corresponding code changes
are made. For non-templates warning rules are disabled to improve
readability. Chart and requirements yamls are also modified in the name
of consistency.

Change-Id: Ife6727c5721a00c65902340d95b7edb0a9c77365
2020-06-11 23:29:42 -05:00
Zuul
08ca4eb8d9 Merge "ceph: Add metadata labels to CronJob" 2020-06-02 19:37:39 +00:00
Andrii Ostapenko
731a6b4cfa Enable yamllint checks
- document-end
- document-start
- empty-lines
- hyphens
- indentation
- key-duplicates
- new-line-at-end-of-file
- new-lines
- octal-values

with corresponding code adjustment.

Change-Id: I92d6aa20df82aa0fe198f8ccd535cfcaf613f43a
2020-05-29 19:49:05 +00:00
Kabanov, Dmitrii
46930fcd06 [Ceph] Upgrade Ceph from 14.2.8 to 14.2.9 version
The PS upgrades Ceph to 14.2.9 version.

Change-Id: I72a2e39a7b4294ac8fd42b1dbc78579c2c0ae791
2020-05-28 15:46:47 +00:00
Tin Lam
d95259936f Revert "[ceph-osd, ceph-client] Weight OSDs as they are added"
This patch currently breaks cinder helm test in the OSH cinder jobs
blocking the gate. Proposing to revert to unblock the jobs.

This reverts commit f59cb11932.

Change-Id: I73012ec6f4c3d751131f1c26eea9266f7abc1809
2020-05-25 21:09:15 +00:00
Steve Taylor
f59cb11932 [ceph-osd, ceph-client] Weight OSDs as they are added
Currently OSDs are added by the ceph-osd chart with zero weight
and they get reweighted to proper weights in the ceph-client chart
after all OSDs have been deployed. This causes a problem when a
deployment is partially completed and additional OSDs are added
later. In this case the ceph-client chart has already run and the
new OSDs don't ever get weighted correctly. This change weights
OSDs properly as they are deployed instead. As noted in the
script, the noin flag may be set during the deployment to prevent
rebalancing as OSDs are added if necessary.

Added the ability to set and unset Ceph cluster flags in the
ceph-client chart.

Change-Id: Iac50352c857d874f3956776c733d09e0034a0285
2020-05-22 09:21:44 -06:00