Merge pull request #128346 from dims/update-to-latest-advisor-for-1.32

Update to latest cadvisor - `v0.51.0`
This commit is contained in:
Kubernetes Prow Robot
2024-10-30 23:45:26 +00:00
committed by GitHub
106 changed files with 13 additions and 20577 deletions

View File

@@ -1,205 +0,0 @@
= vendor/github.com/checkpoint-restore/go-criu/v5 licensed under: =
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
= vendor/github.com/checkpoint-restore/go-criu/v5/LICENSE e3fc50a88d0a364313df4b21ef20c29e

View File

@@ -1,195 +0,0 @@
= vendor/github.com/containerd/console licensed under: =
Apache License
Version 2.0, January 2004
https://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Copyright The containerd Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
= vendor/github.com/containerd/console/LICENSE 1269f40c0d099c21a871163984590d89

View File

@@ -1,26 +0,0 @@
= vendor/github.com/seccomp/libseccomp-golang licensed under: =
Copyright (c) 2015 Matthew Heon <mheon@redhat.com>
Copyright (c) 2015 Paul Moore <pmoore@redhat.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
= vendor/github.com/seccomp/libseccomp-golang/LICENSE 343b433e752e8b44a543cdf61f14b628

View File

@@ -1,28 +0,0 @@
= vendor/github.com/syndtr/gocapability licensed under: =
Copyright 2013 Suryandaru Triandana <syndtr@gmail.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
= vendor/github.com/syndtr/gocapability/LICENSE a7304f5073e7be4ba7bffabbf9f2bbca

5
go.mod
View File

@@ -34,7 +34,7 @@ require (
github.com/gogo/protobuf v1.3.2
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da
github.com/golang/protobuf v1.5.4
github.com/google/cadvisor v0.50.0
github.com/google/cadvisor v0.51.0
github.com/google/cel-go v0.21.0
github.com/google/gnostic-models v0.6.8
github.com/google/go-cmp v0.6.0
@@ -134,9 +134,7 @@ require (
github.com/cenkalti/backoff/v4 v4.3.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/chai2010/gettext-go v1.0.2 // indirect
github.com/checkpoint-restore/go-criu/v5 v5.3.0 // indirect
github.com/cilium/ebpf v0.11.0 // indirect
github.com/containerd/console v1.0.4 // indirect
github.com/containerd/containerd/api v1.7.19 // indirect
github.com/containerd/errdefs v0.1.0 // indirect
github.com/containerd/log v0.1.0 // indirect
@@ -196,7 +194,6 @@ require (
github.com/soheilhy/cmux v0.1.5 // indirect
github.com/stoewer/go-strcase v1.3.0 // indirect
github.com/stretchr/objx v0.5.2 // indirect
github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635 // indirect
github.com/tmc/grpc-websocket-proxy v0.0.0-20220101234140-673ab2c3ae75 // indirect
github.com/x448/float16 v0.8.4 // indirect
github.com/xiang90/probing v0.0.0-20221125231312-a49e3df8f510 // indirect

15
go.sum
View File

@@ -169,7 +169,6 @@ github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UF
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
github.com/chai2010/gettext-go v1.0.2 h1:1Lwwip6Q2QGsAdl/ZKPCwTe9fe0CjlUbqj5bFNSjIRk=
github.com/chai2010/gettext-go v1.0.2/go.mod h1:y+wnP2cHYaVj19NZhYKAwEMH2CI1gNHeQQ+5AjwawxA=
github.com/checkpoint-restore/go-criu/v5 v5.3.0 h1:wpFFOoomK3389ue2lAb0Boag6XPht5QYpipxmSNL4d8=
github.com/checkpoint-restore/go-criu/v5 v5.3.0/go.mod h1:E/eQpaFtUKGOOSEBZgmKAcn+zUUwWxqcaKZlF54wK8E=
github.com/chzyer/readline v1.5.1/go.mod h1:Eh+b79XXUwfKfcPLepksvw2tcLE/Ct21YObkaSkeBlk=
github.com/cilium/ebpf v0.11.0 h1:V8gS/bTCCjX9uUnkUFUpPsksM8n1lXBAvHcpiFk1X2Y=
@@ -183,8 +182,7 @@ github.com/container-storage-interface/spec v1.9.0 h1:zKtX4STsq31Knz3gciCYCi1SXt
github.com/container-storage-interface/spec v1.9.0/go.mod h1:ZfDu+3ZRyeVqxZM0Ds19MVLkN2d1XJ5MAfi1L3VjlT0=
github.com/containerd/cgroups v1.1.0 h1:v8rEWFl6EoqHB+swVNjVoCJE8o3jX7e8nqBGPLaDFBM=
github.com/containerd/cgroups v1.1.0/go.mod h1:6ppBcbh/NOOUU+dMKrykgaBnK9lCIBxHqJDGwsa1mIw=
github.com/containerd/console v1.0.4 h1:F2g4+oChYvBTsASRTz8NP6iIAi97J3TtSAsLbIFn4ro=
github.com/containerd/console v1.0.4/go.mod h1:YynlIjWYF8myEu6sdkwKIvGQq+cOckRm6So2avqoYAk=
github.com/containerd/console v1.0.3/go.mod h1:7LqA/THxQ86k76b8c/EMSiaJ3h1eZkMkXar0TQ1gf3U=
github.com/containerd/containerd/api v1.7.19 h1:VWbJL+8Ap4Ju2mx9c9qS1uFSB1OVYr5JJrW2yT5vFoA=
github.com/containerd/containerd/api v1.7.19/go.mod h1:fwGavl3LNwAV5ilJ0sbrABL44AQxmNjDRcwheXDb6Ig=
github.com/containerd/errdefs v0.1.0 h1:m0wCRBiu1WJT/Fr+iOoQHMQS/eP5myQ8lCv4Dz5ZURM=
@@ -299,7 +297,6 @@ github.com/golang/protobuf v1.4.0-rc.4.0.20200313231945-b860323f09d0/go.mod h1:W
github.com/golang/protobuf v1.4.0/go.mod h1:jodUvKwWbYaEsadDk5Fwe5c77LiNKVO9IDvqG2KuDX0=
github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QDs8UjoX8=
github.com/golang/protobuf v1.4.3/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI=
github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=
github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
github.com/golangplus/bytes v0.0.0-20160111154220-45c989fe5450/go.mod h1:Bk6SMAONeMXrxql8uvOKuAZSu8aM5RUGv+1C6IJaEho=
@@ -309,8 +306,8 @@ github.com/golangplus/testing v1.0.0 h1:+ZeeiKZENNOMkTTELoSySazi+XaEhVO0mb+eanrS
github.com/golangplus/testing v1.0.0/go.mod h1:ZDreixUV3YzhoVraIDyOzHrr76p6NUh6k/pPg/Q3gYA=
github.com/google/btree v1.0.1 h1:gK4Kx5IaGY9CD5sPJ36FHiBJ6ZXl0kilRiiCj+jdYp4=
github.com/google/btree v1.0.1/go.mod h1:xXMiIv4Fb/0kKde4SpL7qlzvu5cMJDRkFDxJfI9uaxA=
github.com/google/cadvisor v0.50.0 h1:7w/hKIbJKBWqQsRTy+Hpj2vj+fnxrLXcEXFy+LW0Bsg=
github.com/google/cadvisor v0.50.0/go.mod h1:VxCDwZalpFyENvmfabFqaIGsqNKLtDzE62a19rfVTB8=
github.com/google/cadvisor v0.51.0 h1:BspqSPdZoLKrnvuZNOvM/KiJ/A+RdixwagN20n+2H8k=
github.com/google/cadvisor v0.51.0/go.mod h1:czGE/c/P/i0QFpVNKTFrIEzord9Y10YfpwuaSWXELc0=
github.com/google/cel-go v0.21.0 h1:cl6uW/gxN+Hy50tNYvI691+sXxioCnstFzLp2WO4GCI=
github.com/google/cel-go v0.21.0/go.mod h1:rHUlWCcBKgyEk+eV03RPdZUekPp6YcJwV0FxuUksYxc=
github.com/google/gnostic-models v0.6.8 h1:yo/ABAfM5IMRsS1VnXjTBvUb61tFIHozhlYvRgGre9I=
@@ -321,7 +318,6 @@ github.com/google/go-cmp v0.3.1/go.mod h1:8QqcDgzrUqlUb/G2PQTWiueGozuR1884gddMyw
github.com/google/go-cmp v0.4.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.0/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.3/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=
github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=
@@ -492,7 +488,6 @@ github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635 h1:kdXcSzyDtseVEc4yCz2qF8ZrQvIDBJLl4S1c3GCXmoI=
github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=
github.com/tmc/grpc-websocket-proxy v0.0.0-20220101234140-673ab2c3ae75 h1:6fotK7otjonDflCTK0BCfls4SPy3NcCVb5dqqmbRknE=
github.com/tmc/grpc-websocket-proxy v0.0.0-20220101234140-673ab2c3ae75/go.mod h1:KO6IkyS8Y3j8OdNO85qEYBsRPuteD+YciPomcXdrMnk=
@@ -614,12 +609,10 @@ golang.org/x/sys v0.0.0-20191026070338-33540a1f6037/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210124154548-22da62e12c0c/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20210510120138-977fb7262007/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20210616094352-59db8d763f22/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.1.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.2.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.10.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
golang.org/x/sys v0.26.0 h1:KHjCJyddX0LoSTb3J+vWpupP9p0oznkqVk/IfjymZbo=
@@ -681,8 +674,6 @@ google.golang.org/protobuf v1.22.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2
google.golang.org/protobuf v1.23.0/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU=
google.golang.org/protobuf v1.23.1-0.20200526195155-81db48ad09cc/go.mod h1:EGpADcykh3NcUnDUJcl1+ZksZNG86OlYog2l/sGQquU=
google.golang.org/protobuf v1.25.0/go.mod h1:9JNX74DMeImyA3h4bdi1ymwjUzf21/xIlbajtzgsN7c=
google.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=
google.golang.org/protobuf v1.27.1/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc=
google.golang.org/protobuf v1.34.2 h1:6xV6lTsCfpGD21XK49h7MhtcApnLqkfYgPcdHftf6hg=
google.golang.org/protobuf v1.34.2/go.mod h1:qYOHts0dSfpeUzUFpOMr/WGzszTmLH+DiWniOlNbLDw=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=

View File

@@ -134,10 +134,12 @@ github.com/aws/aws-sdk-go-v2/service/ssooidc v1.26.2 h1:ORnrOK0C4WmYV/uYt3koHEWB
github.com/aws/aws-sdk-go-v2/service/sts v1.30.1 h1:+woJ607dllHJQtsnJLi52ycuqHMwlW+Wqm2Ppsfp4nQ=
github.com/aws/smithy-go v1.20.3 h1:ryHwveWzPV5BIof6fyDvor6V3iUL7nTfiTKXHiW05nE=
github.com/census-instrumentation/opencensus-proto v0.4.1 h1:iKLQ0xPNFxR/2hzXZMrBo8f1j86j5WHzznCCQxV/b8g=
github.com/checkpoint-restore/go-criu/v5 v5.3.0 h1:wpFFOoomK3389ue2lAb0Boag6XPht5QYpipxmSNL4d8=
github.com/chzyer/readline v1.5.1 h1:upd/6fQk4src78LMRzh5vItIt361/o4uq553V8B5sGI=
github.com/client9/misspell v0.3.4 h1:ta993UF76GwbvJcIo3Y68y/M3WxlpEHPWIGDkJYwzJI=
github.com/cncf/udpa/go v0.0.0-20191209042840-269d4d468f6f h1:WBZRG4aNOuI15bLRrCgN8fCq8E5Xuty6jGbmSNEvSsU=
github.com/cncf/xds/go v0.0.0-20240423153145-555b57ec207b h1:ga8SEFjZ60pxLcmhnThWgvH2wg8376yUJmPhEH4H3kw=
github.com/containerd/console v1.0.3 h1:lIr7SlA5PxZyMV30bDW0MGbiOPXwc63yRuCP0ARubLw=
github.com/envoyproxy/go-control-plane v0.12.0 h1:4X+VP1GHd1Mhj6IB5mMeGbLCleqxjletLK6K0rbxyZI=
github.com/envoyproxy/protoc-gen-validate v1.0.4 h1:gVPz/FMfvh57HdSJQyvBtF00j8JU4zdyUgIUNhlgg0A=
github.com/flynn/go-shlex v0.0.0-20150515145356-3f9db97f8568 h1:BHsljHzVlRcyQhjrss6TZTdY2VfCqZPbv5k3iBFa2ZQ=
@@ -165,6 +167,7 @@ github.com/opentracing/opentracing-go v1.1.0 h1:pWlfV3Bxv7k65HYwkikxat0+s3pV4bsq
github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e h1:aoZm08cpOy4WuID//EZDgcC4zIxODThtZNPirFr42+A=
github.com/rogpeppe/fastuuid v1.2.0 h1:Ppwyp6VYCF1nvBTXL3trRso7mXMlRrw9ooo375wvi2s=
github.com/shurcooL/sanitized_anchor_name v1.0.0 h1:PdmoCO6wvbs+7yrJyMORt4/BmY5IYyJwS/kOiWx8mHo=
github.com/syndtr/gocapability v0.0.0-20200815063812-42c35b437635 h1:kdXcSzyDtseVEc4yCz2qF8ZrQvIDBJLl4S1c3GCXmoI=
github.com/urfave/cli v1.22.1 h1:+mkCCcOFKPnCmVYVcURKps1Xe+3zP90gSYGNfRkjoIY=
github.com/xhit/go-str2duration/v2 v2.1.0 h1:lxklc02Drh6ynqX+DdPyp5pCKLUQpRT8bp8Ydu2Bstc=
github.com/yuin/goldmark v1.4.13 h1:fVcFKWvrslecOb/tg+Cc05dkeYx540o0FuFt3nUVDoE=

View File

@@ -1,6 +0,0 @@
test/test
test/test.coverage
test/piggie/piggie
test/phaul/phaul
test/phaul/phaul.coverage
image

View File

@@ -1,12 +0,0 @@
run:
skip_dirs:
- rpc
- stats
linters:
disable-all: false
presets:
- bugs
- performance
- unused
- format

View File

@@ -1,201 +0,0 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@@ -1,107 +0,0 @@
SHELL = /bin/bash
GO ?= go
CC ?= gcc
COVERAGE_PATH ?= $(shell pwd)/.coverage
CRIU_FEATURE_MEM_TRACK = $(shell if criu check --feature mem_dirty_track > /dev/null; then echo 1; else echo 0; fi)
CRIU_FEATURE_LAZY_PAGES = $(shell if criu check --feature uffd-noncoop > /dev/null; then echo 1; else echo 0; fi)
CRIU_FEATURE_PIDFD_STORE = $(shell if criu check --feature pidfd_store > /dev/null; then echo 1; else echo 0; fi)
export CRIU_FEATURE_MEM_TRACK CRIU_FEATURE_LAZY_PAGES CRIU_FEATURE_PIDFD_STORE
all: build test phaul-test
lint:
golangci-lint run ./...
build:
$(GO) build -v ./...
TEST_PAYLOAD := test/piggie/piggie
TEST_BINARIES := test/test $(TEST_PAYLOAD) test/phaul/phaul
COVERAGE_BINARIES := test/test.coverage test/phaul/phaul.coverage
test-bin: $(TEST_BINARIES)
test/piggie/piggie: test/piggie/piggie.c
$(CC) $^ -o $@
test/test: test/main.go
$(GO) build -v -o $@ $^
test: $(TEST_BINARIES)
mkdir -p image
PID=$$(test/piggie/piggie) && { \
test/test dump $$PID image && \
test/test restore image; \
pkill -9 piggie; \
}
rm -rf image
test/phaul/phaul: test/phaul/main.go
$(GO) build -v -o $@ $^
phaul-test: $(TEST_BINARIES)
rm -rf image
PID=$$(test/piggie/piggie) && { \
test/phaul/phaul $$PID; \
pkill -9 piggie; \
}
test/test.coverage: test/*.go
$(GO) test \
-covermode=count \
-coverpkg=./... \
-mod=vendor \
-tags coverage \
-buildmode=pie -c -o $@ $^
test/phaul/phaul.coverage: test/phaul/*.go
$(GO) test \
-covermode=count \
-coverpkg=./... \
-mod=vendor \
-tags coverage \
-buildmode=pie -c -o $@ $^
coverage: $(COVERAGE_BINARIES) $(TEST_PAYLOAD)
mkdir -p $(COVERAGE_PATH)
mkdir -p image
PID=$$(test/piggie/piggie) && { \
test/test.coverage -test.coverprofile=coverprofile.integration.$$RANDOM -test.outputdir=${COVERAGE_PATH} COVERAGE dump $$PID image && \
test/test.coverage -test.coverprofile=coverprofile.integration.$$RANDOM -test.outputdir=${COVERAGE_PATH} COVERAGE restore image; \
pkill -9 piggie; \
}
rm -rf image
PID=$$(test/piggie/piggie) && { \
test/phaul/phaul.coverage -test.coverprofile=coverprofile.integration.$$RANDOM -test.outputdir=${COVERAGE_PATH} COVERAGE $$PID; \
pkill -9 piggie; \
}
echo "mode: set" > .coverage/coverage.out && cat .coverage/coverprofile* | \
grep -v mode: | sort -r | awk '{if($$1 != last) {print $$0;last=$$1}}' >> .coverage/coverage.out
clean:
@rm -f $(TEST_BINARIES) $(COVERAGE_BINARIES) codecov
@rm -rf image $(COVERAGE_PATH)
rpc/rpc.proto:
curl -sSL https://raw.githubusercontent.com/checkpoint-restore/criu/master/images/rpc.proto -o $@
stats/stats.proto:
curl -sSL https://raw.githubusercontent.com/checkpoint-restore/criu/master/images/stats.proto -o $@
rpc/rpc.pb.go: rpc/rpc.proto
protoc --go_out=. --go_opt=M$^=rpc/ $^
stats/stats.pb.go: stats/stats.proto
protoc --go_out=. --go_opt=M$^=stats/ $^
vendor:
GO111MODULE=on $(GO) mod tidy
GO111MODULE=on $(GO) mod vendor
GO111MODULE=on $(GO) mod verify
codecov:
curl -Os https://uploader.codecov.io/latest/linux/codecov
chmod +x codecov
./codecov -f '.coverage/coverage.out'
.PHONY: build test phaul-test test-bin clean lint vendor coverage codecov

View File

@@ -1,96 +0,0 @@
[![test](https://github.com/checkpoint-restore/go-criu/workflows/ci/badge.svg?branch=master)](https://github.com/checkpoint-restore/go-criu/actions?query=workflow%3Aci)
[![verify](https://github.com/checkpoint-restore/go-criu/workflows/verify/badge.svg?branch=master)](https://github.com/checkpoint-restore/go-criu/actions?query=workflow%3Averify)
[![Go Reference](https://pkg.go.dev/badge/github.com/checkpoint-restore/go-criu.svg)](https://pkg.go.dev/github.com/checkpoint-restore/go-criu)
## go-criu -- Go bindings for CRIU
This repository provides Go bindings for [CRIU](https://criu.org/). The code is based on the Go-based PHaul
implementation from the CRIU repository. For easier inclusion into other Go projects the
CRIU Go bindings have been moved to this repository.
The Go bindings provide an easy way to use the CRIU RPC calls from Go without the need
to set up all the infrastructure to make the actual RPC connection to CRIU.
The following example would print the version of CRIU:
```go
import (
"log"
"github.com/checkpoint-restore/go-criu/v5"
)
func main() {
c := criu.MakeCriu()
version, err := c.GetCriuVersion()
if err != nil {
log.Fatalln(err)
}
log.Println(version)
}
```
or to just check if at least a certain CRIU version is installed:
```go
c := criu.MakeCriu()
result, err := c.IsCriuAtLeast(31100)
```
## Releases
The first go-criu release was 3.11 based on CRIU 3.11. The initial plan
was to follow CRIU so that go-criu would carry the same version number as
CRIU.
As go-criu is imported in other projects and as Go modules are expected
to follow Semantic Versioning go-criu will also follow Semantic Versioning
starting with the 4.0.0 release.
The following table shows the relation between go-criu and criu versions:
| Major version | Latest release | CRIU version |
| -------------- | -------------- | ------------ |
| v5             | 5.2.0         | 3.16         |
| v5             | 5.0.0         | 3.15         |
| v4             | 4.1.0         | 3.14         |
## How to contribute
While bug fixes can first be identified via an "issue", that is not required.
It's ok to just open up a PR with the fix, but make sure you include the same
information you would have included in an issue - like how to reproduce it.
PRs for new features should include some background on what use cases the
new code is trying to address. When possible and when it makes sense, try to
break-up larger PRs into smaller ones - it's easier to review smaller
code changes. But only if those smaller ones make sense as stand-alone PRs.
Regardless of the type of PR, all PRs should include:
* well documented code changes
* additional testcases. Ideally, they should fail w/o your code change applied
* documentation changes
Squash your commits into logical pieces of work that might want to be reviewed
separate from the rest of the PRs. Ideally, each commit should implement a
single idea, and the PR branch should pass the tests at every commit. GitHub
makes it easy to review the cumulative effect of many commits; so, when in
doubt, use smaller commits.
PRs that fix issues should include a reference like `Closes #XXXX` in the
commit message so that github will automatically close the referenced issue
when the PR is merged.
Contributors must assert that they are in compliance with the [Developer
Certificate of Origin 1.1](http://developercertificate.org/). This is achieved
by adding a "Signed-off-by" line containing the contributor's name and e-mail
to every commit message. Your signature certifies that you wrote the patch or
otherwise have the right to pass it on as an open-source patch.
### License and copyright
Unless mentioned otherwise in a specific file's header, all code in
this project is released under the Apache 2.0 license.
The author of a change remains the copyright holder of their code
(no copyright assignment). The list of authors and contributors can be
retrieved from the git commit history and in some cases, the file headers.

View File

@@ -1,45 +0,0 @@
package criu
import (
"fmt"
"github.com/checkpoint-restore/go-criu/v5/rpc"
)
// Feature checking in go-criu is based on the libcriu feature checking function.
// Feature checking allows the user to check if CRIU supports
// certain features. There are CRIU features which do not depend
// on the version of CRIU but on kernel features or architecture.
//
// One example is memory tracking. Memory tracking can be disabled
// in the kernel or there are architectures which do not support
// it (aarch64 for example). By using the feature check a libcriu
// user can easily query CRIU if a certain feature is available.
//
// The features which should be checked can be marked in the
// structure 'struct criu_feature_check'. Each structure member
// that is set to true will result in CRIU checking for the
// availability of that feature in the current combination of
// CRIU/kernel/architecture.
//
// Available features will be set to true when the function
// returns successfully. Missing features will be set to false.
func (c *Criu) FeatureCheck(features *rpc.CriuFeatures) (*rpc.CriuFeatures, error) {
resp, err := c.doSwrkWithResp(
rpc.CriuReqType_FEATURE_CHECK,
nil,
nil,
features,
)
if err != nil {
return nil, err
}
if resp.GetType() != rpc.CriuReqType_FEATURE_CHECK {
return nil, fmt.Errorf("Unexpected CRIU RPC response")
}
return features, nil
}

View File

@@ -1,264 +0,0 @@
package criu
import (
"errors"
"fmt"
"os"
"os/exec"
"strconv"
"syscall"
"github.com/checkpoint-restore/go-criu/v5/rpc"
"google.golang.org/protobuf/proto"
)
// Criu struct
type Criu struct {
swrkCmd *exec.Cmd
swrkSk *os.File
swrkPath string
}
// MakeCriu returns the Criu object required for most operations
func MakeCriu() *Criu {
return &Criu{
swrkPath: "criu",
}
}
// SetCriuPath allows setting the path to the CRIU binary
// if it is in a non standard location
func (c *Criu) SetCriuPath(path string) {
c.swrkPath = path
}
// Prepare sets up everything for the RPC communication to CRIU
func (c *Criu) Prepare() error {
fds, err := syscall.Socketpair(syscall.AF_LOCAL, syscall.SOCK_SEQPACKET, 0)
if err != nil {
return err
}
cln := os.NewFile(uintptr(fds[0]), "criu-xprt-cln")
syscall.CloseOnExec(fds[0])
srv := os.NewFile(uintptr(fds[1]), "criu-xprt-srv")
defer srv.Close()
args := []string{"swrk", strconv.Itoa(fds[1])}
// #nosec G204
cmd := exec.Command(c.swrkPath, args...)
err = cmd.Start()
if err != nil {
cln.Close()
return err
}
c.swrkCmd = cmd
c.swrkSk = cln
return nil
}
// Cleanup cleans up
func (c *Criu) Cleanup() {
if c.swrkCmd != nil {
c.swrkSk.Close()
c.swrkSk = nil
_ = c.swrkCmd.Wait()
c.swrkCmd = nil
}
}
func (c *Criu) sendAndRecv(reqB []byte) ([]byte, int, error) {
cln := c.swrkSk
_, err := cln.Write(reqB)
if err != nil {
return nil, 0, err
}
respB := make([]byte, 2*4096)
n, err := cln.Read(respB)
if err != nil {
return nil, 0, err
}
return respB, n, nil
}
func (c *Criu) doSwrk(reqType rpc.CriuReqType, opts *rpc.CriuOpts, nfy Notify) error {
resp, err := c.doSwrkWithResp(reqType, opts, nfy, nil)
if err != nil {
return err
}
respType := resp.GetType()
if respType != reqType {
return errors.New("unexpected CRIU RPC response")
}
return nil
}
func (c *Criu) doSwrkWithResp(reqType rpc.CriuReqType, opts *rpc.CriuOpts, nfy Notify, features *rpc.CriuFeatures) (*rpc.CriuResp, error) {
var resp *rpc.CriuResp
req := rpc.CriuReq{
Type: &reqType,
Opts: opts,
}
if nfy != nil {
opts.NotifyScripts = proto.Bool(true)
}
if features != nil {
req.Features = features
}
if c.swrkCmd == nil {
err := c.Prepare()
if err != nil {
return nil, err
}
defer c.Cleanup()
}
for {
reqB, err := proto.Marshal(&req)
if err != nil {
return nil, err
}
respB, respS, err := c.sendAndRecv(reqB)
if err != nil {
return nil, err
}
resp = &rpc.CriuResp{}
err = proto.Unmarshal(respB[:respS], resp)
if err != nil {
return nil, err
}
if !resp.GetSuccess() {
return resp, fmt.Errorf("operation failed (msg:%s err:%d)",
resp.GetCrErrmsg(), resp.GetCrErrno())
}
respType := resp.GetType()
if respType != rpc.CriuReqType_NOTIFY {
break
}
if nfy == nil {
return resp, errors.New("unexpected notify")
}
notify := resp.GetNotify()
switch notify.GetScript() {
case "pre-dump":
err = nfy.PreDump()
case "post-dump":
err = nfy.PostDump()
case "pre-restore":
err = nfy.PreRestore()
case "post-restore":
err = nfy.PostRestore(notify.GetPid())
case "network-lock":
err = nfy.NetworkLock()
case "network-unlock":
err = nfy.NetworkUnlock()
case "setup-namespaces":
err = nfy.SetupNamespaces(notify.GetPid())
case "post-setup-namespaces":
err = nfy.PostSetupNamespaces()
case "post-resume":
err = nfy.PostResume()
default:
err = nil
}
if err != nil {
return resp, err
}
req = rpc.CriuReq{
Type: &respType,
NotifySuccess: proto.Bool(true),
}
}
return resp, nil
}
// Dump dumps a process
func (c *Criu) Dump(opts *rpc.CriuOpts, nfy Notify) error {
return c.doSwrk(rpc.CriuReqType_DUMP, opts, nfy)
}
// Restore restores a process
func (c *Criu) Restore(opts *rpc.CriuOpts, nfy Notify) error {
return c.doSwrk(rpc.CriuReqType_RESTORE, opts, nfy)
}
// PreDump does a pre-dump
func (c *Criu) PreDump(opts *rpc.CriuOpts, nfy Notify) error {
return c.doSwrk(rpc.CriuReqType_PRE_DUMP, opts, nfy)
}
// StartPageServer starts the page server
func (c *Criu) StartPageServer(opts *rpc.CriuOpts) error {
return c.doSwrk(rpc.CriuReqType_PAGE_SERVER, opts, nil)
}
// StartPageServerChld starts the page server and returns PID and port
func (c *Criu) StartPageServerChld(opts *rpc.CriuOpts) (int, int, error) {
resp, err := c.doSwrkWithResp(rpc.CriuReqType_PAGE_SERVER_CHLD, opts, nil, nil)
if err != nil {
return 0, 0, err
}
return int(resp.Ps.GetPid()), int(resp.Ps.GetPort()), nil
}
// GetCriuVersion executes the VERSION RPC call and returns the version
// as an integer. Major * 10000 + Minor * 100 + SubLevel
func (c *Criu) GetCriuVersion() (int, error) {
resp, err := c.doSwrkWithResp(rpc.CriuReqType_VERSION, nil, nil, nil)
if err != nil {
return 0, err
}
if resp.GetType() != rpc.CriuReqType_VERSION {
return 0, fmt.Errorf("Unexpected CRIU RPC response")
}
version := int(*resp.GetVersion().MajorNumber) * 10000
version += int(*resp.GetVersion().MinorNumber) * 100
if resp.GetVersion().Sublevel != nil {
version += int(*resp.GetVersion().Sublevel)
}
if resp.GetVersion().Gitid != nil {
// taken from runc: if it is a git release -> increase minor by 1
version -= (version % 100)
version += 100
}
return version, nil
}
// IsCriuAtLeast checks if the version is at least the same
// as the parameter version
func (c *Criu) IsCriuAtLeast(version int) (bool, error) {
criuVersion, err := c.GetCriuVersion()
if err != nil {
return false, err
}
if criuVersion >= version {
return true, nil
}
return false, nil
}

View File

@@ -1,62 +0,0 @@
package criu
// Notify interface
type Notify interface {
PreDump() error
PostDump() error
PreRestore() error
PostRestore(pid int32) error
NetworkLock() error
NetworkUnlock() error
SetupNamespaces(pid int32) error
PostSetupNamespaces() error
PostResume() error
}
// NoNotify struct
type NoNotify struct{}
// PreDump NoNotify
func (c NoNotify) PreDump() error {
return nil
}
// PostDump NoNotify
func (c NoNotify) PostDump() error {
return nil
}
// PreRestore NoNotify
func (c NoNotify) PreRestore() error {
return nil
}
// PostRestore NoNotify
func (c NoNotify) PostRestore(pid int32) error {
return nil
}
// NetworkLock NoNotify
func (c NoNotify) NetworkLock() error {
return nil
}
// NetworkUnlock NoNotify
func (c NoNotify) NetworkUnlock() error {
return nil
}
// SetupNamespaces NoNotify
func (c NoNotify) SetupNamespaces(pid int32) error {
return nil
}
// PostSetupNamespaces NoNotify
func (c NoNotify) PostSetupNamespaces() error {
return nil
}
// PostResume NoNotify
func (c NoNotify) PostResume() error {
return nil
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,239 +0,0 @@
// SPDX-License-Identifier: MIT
syntax = "proto2";
message criu_page_server_info {
optional string address = 1;
optional int32 port = 2;
optional int32 pid = 3;
optional int32 fd = 4;
}
message criu_veth_pair {
required string if_in = 1;
required string if_out = 2;
};
message ext_mount_map {
required string key = 1;
required string val = 2;
};
message join_namespace {
required string ns = 1;
required string ns_file = 2;
optional string extra_opt = 3;
}
message inherit_fd {
required string key = 1;
required int32 fd = 2;
};
message cgroup_root {
optional string ctrl = 1;
required string path = 2;
};
message unix_sk {
required uint32 inode = 1;
};
enum criu_cg_mode {
IGNORE = 0;
CG_NONE = 1;
PROPS = 2;
SOFT = 3;
FULL = 4;
STRICT = 5;
DEFAULT = 6;
};
enum criu_pre_dump_mode {
SPLICE = 1;
VM_READ = 2;
};
message criu_opts {
required int32 images_dir_fd = 1;
optional int32 pid = 2; /* if not set on dump, will dump requesting process */
optional bool leave_running = 3;
optional bool ext_unix_sk = 4;
optional bool tcp_established = 5;
optional bool evasive_devices = 6;
optional bool shell_job = 7;
optional bool file_locks = 8;
optional int32 log_level = 9 [default = 2];
optional string log_file = 10; /* No subdirs are allowed. Consider using work-dir */
optional criu_page_server_info ps = 11;
optional bool notify_scripts = 12;
optional string root = 13;
optional string parent_img = 14;
optional bool track_mem = 15;
optional bool auto_dedup = 16;
optional int32 work_dir_fd = 17;
optional bool link_remap = 18;
repeated criu_veth_pair veths = 19; /* DEPRECATED, use external instead */
optional uint32 cpu_cap = 20 [default = 0xffffffff];
optional bool force_irmap = 21;
repeated string exec_cmd = 22;
repeated ext_mount_map ext_mnt = 23; /* DEPRECATED, use external instead */
optional bool manage_cgroups = 24; /* backward compatibility */
repeated cgroup_root cg_root = 25;
optional bool rst_sibling = 26; /* swrk only */
repeated inherit_fd inherit_fd = 27; /* swrk only */
optional bool auto_ext_mnt = 28;
optional bool ext_sharing = 29;
optional bool ext_masters = 30;
repeated string skip_mnt = 31;
repeated string enable_fs = 32;
repeated unix_sk unix_sk_ino = 33; /* DEPRECATED, use external instead */
optional criu_cg_mode manage_cgroups_mode = 34;
optional uint32 ghost_limit = 35 [default = 0x100000];
repeated string irmap_scan_paths = 36;
repeated string external = 37;
optional uint32 empty_ns = 38;
repeated join_namespace join_ns = 39;
optional string cgroup_props = 41;
optional string cgroup_props_file = 42;
repeated string cgroup_dump_controller = 43;
optional string freeze_cgroup = 44;
optional uint32 timeout = 45;
optional bool tcp_skip_in_flight = 46;
optional bool weak_sysctls = 47;
optional bool lazy_pages = 48;
optional int32 status_fd = 49;
optional bool orphan_pts_master = 50;
optional string config_file = 51;
optional bool tcp_close = 52;
optional string lsm_profile = 53;
optional string tls_cacert = 54;
optional string tls_cacrl = 55;
optional string tls_cert = 56;
optional string tls_key = 57;
optional bool tls = 58;
optional bool tls_no_cn_verify = 59;
optional string cgroup_yard = 60;
optional criu_pre_dump_mode pre_dump_mode = 61 [default = SPLICE];
optional int32 pidfd_store_sk = 62;
optional string lsm_mount_context = 63;
/* optional bool check_mounts = 128; */
}
message criu_dump_resp {
optional bool restored = 1;
}
message criu_restore_resp {
required int32 pid = 1;
}
message criu_notify {
optional string script = 1;
optional int32 pid = 2;
}
enum criu_req_type {
EMPTY = 0;
DUMP = 1;
RESTORE = 2;
CHECK = 3;
PRE_DUMP = 4;
PAGE_SERVER = 5;
NOTIFY = 6;
CPUINFO_DUMP = 7;
CPUINFO_CHECK = 8;
FEATURE_CHECK = 9;
VERSION = 10;
WAIT_PID = 11;
PAGE_SERVER_CHLD = 12;
}
/*
* List of features which can queried via
* CRIU_REQ_TYPE__FEATURE_CHECK
*/
message criu_features {
optional bool mem_track = 1;
optional bool lazy_pages = 2;
optional bool pidfd_store = 3;
}
/*
* Request -- each type corresponds to must-be-there
* request arguments of respective type
*/
message criu_req {
required criu_req_type type = 1;
optional criu_opts opts = 2;
optional bool notify_success = 3;
/*
* When set service won't close the connection but
* will wait for more req-s to appear. Works not
* for all request types.
*/
optional bool keep_open = 4;
/*
* 'features' can be used to query which features
* are supported by the installed criu/kernel
* via RPC.
*/
optional criu_features features = 5;
/* 'pid' is used for WAIT_PID */
optional uint32 pid = 6;
}
/*
* Response -- it states whether the request was served
* and additional request-specific information
*/
message criu_resp {
required criu_req_type type = 1;
required bool success = 2;
optional criu_dump_resp dump = 3;
optional criu_restore_resp restore = 4;
optional criu_notify notify = 5;
optional criu_page_server_info ps = 6;
optional int32 cr_errno = 7;
optional criu_features features = 8;
optional string cr_errmsg = 9;
optional criu_version version = 10;
optional int32 status = 11;
}
/* Answer for criu_req_type.VERSION requests */
message criu_version {
required int32 major_number = 1;
required int32 minor_number = 2;
optional string gitid = 3;
optional int32 sublevel = 4;
optional int32 extra = 5;
optional string name = 6;
}

View File

@@ -1,20 +0,0 @@
linters:
enable:
- gofmt
- goimports
- ineffassign
- misspell
- revive
- staticcheck
- structcheck
- unconvert
- unused
- varcheck
- vet
disable:
- errcheck
run:
timeout: 3m
skip-dirs:
- vendor

View File

@@ -1,191 +0,0 @@
Apache License
Version 2.0, January 2004
https://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
Copyright The containerd Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

@@ -1,29 +0,0 @@
# console
[![PkgGoDev](https://pkg.go.dev/badge/github.com/containerd/console)](https://pkg.go.dev/github.com/containerd/console)
[![Build Status](https://github.com/containerd/console/workflows/CI/badge.svg)](https://github.com/containerd/console/actions?query=workflow%3ACI)
[![Go Report Card](https://goreportcard.com/badge/github.com/containerd/console)](https://goreportcard.com/report/github.com/containerd/console)
Golang package for dealing with consoles. Light on deps and a simple API.
## Modifying the current process
```go
current := console.Current()
defer current.Reset()
if err := current.SetRaw(); err != nil {
}
ws, err := current.Size()
current.Resize(ws)
```
## Project details
console is a containerd sub-project, licensed under the [Apache 2.0 license](./LICENSE).
As a containerd sub-project, you will find the:
* [Project governance](https://github.com/containerd/project/blob/main/GOVERNANCE.md),
* [Maintainers](https://github.com/containerd/project/blob/main/MAINTAINERS),
* and [Contributing guidelines](https://github.com/containerd/project/blob/main/CONTRIBUTING.md)
information in our [`containerd/project`](https://github.com/containerd/project) repository.

View File

@@ -1,90 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"errors"
"io"
"os"
)
var (
ErrNotAConsole = errors.New("provided file is not a console")
ErrNotImplemented = errors.New("not implemented")
)
type File interface {
io.ReadWriteCloser
// Fd returns its file descriptor
Fd() uintptr
// Name returns its file name
Name() string
}
type Console interface {
File
// Resize resizes the console to the provided window size
Resize(WinSize) error
// ResizeFrom resizes the calling console to the size of the
// provided console
ResizeFrom(Console) error
// SetRaw sets the console in raw mode
SetRaw() error
// DisableEcho disables echo on the console
DisableEcho() error
// Reset restores the console to its original state
Reset() error
// Size returns the window size of the console
Size() (WinSize, error)
}
// WinSize specifies the window size of the console
type WinSize struct {
// Height of the console
Height uint16
// Width of the console
Width uint16
x uint16
y uint16
}
// Current returns the current process' console
func Current() (c Console) {
var err error
// Usually all three streams (stdin, stdout, and stderr)
// are open to the same console, but some might be redirected,
// so try all three.
for _, s := range []*os.File{os.Stderr, os.Stdout, os.Stdin} {
if c, err = ConsoleFromFile(s); err == nil {
return c
}
}
// One of the std streams should always be a console
// for the design of this function.
panic(err)
}
// ConsoleFromFile returns a console using the provided file
// nolint:revive
func ConsoleFromFile(f File) (Console, error) {
if err := checkConsole(f); err != nil {
return nil, err
}
return newMaster(f)
}

View File

@@ -1,281 +0,0 @@
//go:build linux
// +build linux
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"io"
"os"
"sync"
"golang.org/x/sys/unix"
)
const (
maxEvents = 128
)
// Epoller manages multiple epoll consoles using edge-triggered epoll api so we
// dont have to deal with repeated wake-up of EPOLLER or EPOLLHUP.
// For more details, see:
// - https://github.com/systemd/systemd/pull/4262
// - https://github.com/moby/moby/issues/27202
//
// Example usage of Epoller and EpollConsole can be as follow:
//
// epoller, _ := NewEpoller()
// epollConsole, _ := epoller.Add(console)
// go epoller.Wait()
// var (
// b bytes.Buffer
// wg sync.WaitGroup
// )
// wg.Add(1)
// go func() {
// io.Copy(&b, epollConsole)
// wg.Done()
// }()
// // perform I/O on the console
// epollConsole.Shutdown(epoller.CloseConsole)
// wg.Wait()
// epollConsole.Close()
type Epoller struct {
efd int
mu sync.Mutex
fdMapping map[int]*EpollConsole
closeOnce sync.Once
}
// NewEpoller returns an instance of epoller with a valid epoll fd.
func NewEpoller() (*Epoller, error) {
efd, err := unix.EpollCreate1(unix.EPOLL_CLOEXEC)
if err != nil {
return nil, err
}
return &Epoller{
efd: efd,
fdMapping: make(map[int]*EpollConsole),
}, nil
}
// Add creates an epoll console based on the provided console. The console will
// be registered with EPOLLET (i.e. using edge-triggered notification) and its
// file descriptor will be set to non-blocking mode. After this, user should use
// the return console to perform I/O.
func (e *Epoller) Add(console Console) (*EpollConsole, error) {
sysfd := int(console.Fd())
// Set sysfd to non-blocking mode
if err := unix.SetNonblock(sysfd, true); err != nil {
return nil, err
}
ev := unix.EpollEvent{
Events: unix.EPOLLIN | unix.EPOLLOUT | unix.EPOLLRDHUP | unix.EPOLLET,
Fd: int32(sysfd),
}
if err := unix.EpollCtl(e.efd, unix.EPOLL_CTL_ADD, sysfd, &ev); err != nil {
return nil, err
}
ef := &EpollConsole{
Console: console,
sysfd: sysfd,
readc: sync.NewCond(&sync.Mutex{}),
writec: sync.NewCond(&sync.Mutex{}),
}
e.mu.Lock()
e.fdMapping[sysfd] = ef
e.mu.Unlock()
return ef, nil
}
// Wait starts the loop to wait for its consoles' notifications and signal
// appropriate console that it can perform I/O.
func (e *Epoller) Wait() error {
events := make([]unix.EpollEvent, maxEvents)
for {
n, err := unix.EpollWait(e.efd, events, -1)
if err != nil {
// EINTR: The call was interrupted by a signal handler before either
// any of the requested events occurred or the timeout expired
if err == unix.EINTR {
continue
}
return err
}
for i := 0; i < n; i++ {
ev := &events[i]
// the console is ready to be read from
if ev.Events&(unix.EPOLLIN|unix.EPOLLHUP|unix.EPOLLERR) != 0 {
if epfile := e.getConsole(int(ev.Fd)); epfile != nil {
epfile.signalRead()
}
}
// the console is ready to be written to
if ev.Events&(unix.EPOLLOUT|unix.EPOLLHUP|unix.EPOLLERR) != 0 {
if epfile := e.getConsole(int(ev.Fd)); epfile != nil {
epfile.signalWrite()
}
}
}
}
}
// CloseConsole unregisters the console's file descriptor from epoll interface
func (e *Epoller) CloseConsole(fd int) error {
e.mu.Lock()
defer e.mu.Unlock()
delete(e.fdMapping, fd)
return unix.EpollCtl(e.efd, unix.EPOLL_CTL_DEL, fd, &unix.EpollEvent{})
}
func (e *Epoller) getConsole(sysfd int) *EpollConsole {
e.mu.Lock()
f := e.fdMapping[sysfd]
e.mu.Unlock()
return f
}
// Close closes the epoll fd
func (e *Epoller) Close() error {
closeErr := os.ErrClosed // default to "file already closed"
e.closeOnce.Do(func() {
closeErr = unix.Close(e.efd)
})
return closeErr
}
// EpollConsole acts like a console but registers its file descriptor with an
// epoll fd and uses epoll API to perform I/O.
type EpollConsole struct {
Console
readc *sync.Cond
writec *sync.Cond
sysfd int
closed bool
}
// Read reads up to len(p) bytes into p. It returns the number of bytes read
// (0 <= n <= len(p)) and any error encountered.
//
// If the console's read returns EAGAIN or EIO, we assume that it's a
// temporary error because the other side went away and wait for the signal
// generated by epoll event to continue.
func (ec *EpollConsole) Read(p []byte) (n int, err error) {
var read int
ec.readc.L.Lock()
defer ec.readc.L.Unlock()
for {
read, err = ec.Console.Read(p[n:])
n += read
if err != nil {
var hangup bool
if perr, ok := err.(*os.PathError); ok {
hangup = (perr.Err == unix.EAGAIN || perr.Err == unix.EIO)
} else {
hangup = (err == unix.EAGAIN || err == unix.EIO)
}
// if the other end disappear, assume this is temporary and wait for the
// signal to continue again. Unless we didnt read anything and the
// console is already marked as closed then we should exit
if hangup && !(n == 0 && len(p) > 0 && ec.closed) {
ec.readc.Wait()
continue
}
}
break
}
// if we didnt read anything then return io.EOF to end gracefully
if n == 0 && len(p) > 0 && err == nil {
err = io.EOF
}
// signal for others that we finished the read
ec.readc.Signal()
return n, err
}
// Writes len(p) bytes from p to the console. It returns the number of bytes
// written from p (0 <= n <= len(p)) and any error encountered that caused
// the write to stop early.
//
// If writes to the console returns EAGAIN or EIO, we assume that it's a
// temporary error because the other side went away and wait for the signal
// generated by epoll event to continue.
func (ec *EpollConsole) Write(p []byte) (n int, err error) {
var written int
ec.writec.L.Lock()
defer ec.writec.L.Unlock()
for {
written, err = ec.Console.Write(p[n:])
n += written
if err != nil {
var hangup bool
if perr, ok := err.(*os.PathError); ok {
hangup = (perr.Err == unix.EAGAIN || perr.Err == unix.EIO)
} else {
hangup = (err == unix.EAGAIN || err == unix.EIO)
}
// if the other end disappears, assume this is temporary and wait for the
// signal to continue again.
if hangup {
ec.writec.Wait()
continue
}
}
// unrecoverable error, break the loop and return the error
break
}
if n < len(p) && err == nil {
err = io.ErrShortWrite
}
// signal for others that we finished the write
ec.writec.Signal()
return n, err
}
// Shutdown closes the file descriptor and signals call waiters for this fd.
// It accepts a callback which will be called with the console's fd. The
// callback typically will be used to do further cleanup such as unregister the
// console's fd from the epoll interface.
// User should call Shutdown and wait for all I/O operation to be finished
// before closing the console.
func (ec *EpollConsole) Shutdown(close func(int) error) error {
ec.readc.L.Lock()
defer ec.readc.L.Unlock()
ec.writec.L.Lock()
defer ec.writec.L.Unlock()
ec.readc.Broadcast()
ec.writec.Broadcast()
ec.closed = true
return close(ec.sysfd)
}
// signalRead signals that the console is readable.
func (ec *EpollConsole) signalRead() {
ec.readc.L.Lock()
ec.readc.Signal()
ec.readc.L.Unlock()
}
// signalWrite signals that the console is writable.
func (ec *EpollConsole) signalWrite() {
ec.writec.L.Lock()
ec.writec.Signal()
ec.writec.L.Unlock()
}

View File

@@ -1,36 +0,0 @@
//go:build !darwin && !freebsd && !linux && !netbsd && !openbsd && !solaris && !windows && !zos
// +build !darwin,!freebsd,!linux,!netbsd,!openbsd,!solaris,!windows,!zos
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
// NewPty creates a new pty pair
// The master is returned as the first console and a string
// with the path to the pty slave is returned as the second
func NewPty() (Console, string, error) {
return nil, "", ErrNotImplemented
}
// checkConsole checks if the provided file is a console
func checkConsole(f File) error {
return ErrNotAConsole
}
func newMaster(f File) (Console, error) {
return nil, ErrNotImplemented
}

View File

@@ -1,157 +0,0 @@
//go:build darwin || freebsd || linux || netbsd || openbsd || zos
// +build darwin freebsd linux netbsd openbsd zos
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"golang.org/x/sys/unix"
)
// NewPty creates a new pty pair
// The master is returned as the first console and a string
// with the path to the pty slave is returned as the second
func NewPty() (Console, string, error) {
f, err := openpt()
if err != nil {
return nil, "", err
}
slave, err := ptsname(f)
if err != nil {
return nil, "", err
}
if err := unlockpt(f); err != nil {
return nil, "", err
}
m, err := newMaster(f)
if err != nil {
return nil, "", err
}
return m, slave, nil
}
type master struct {
f File
original *unix.Termios
}
func (m *master) Read(b []byte) (int, error) {
return m.f.Read(b)
}
func (m *master) Write(b []byte) (int, error) {
return m.f.Write(b)
}
func (m *master) Close() error {
return m.f.Close()
}
func (m *master) Resize(ws WinSize) error {
return tcswinsz(m.f.Fd(), ws)
}
func (m *master) ResizeFrom(c Console) error {
ws, err := c.Size()
if err != nil {
return err
}
return m.Resize(ws)
}
func (m *master) Reset() error {
if m.original == nil {
return nil
}
return tcset(m.f.Fd(), m.original)
}
func (m *master) getCurrent() (unix.Termios, error) {
var termios unix.Termios
if err := tcget(m.f.Fd(), &termios); err != nil {
return unix.Termios{}, err
}
return termios, nil
}
func (m *master) SetRaw() error {
rawState, err := m.getCurrent()
if err != nil {
return err
}
rawState = cfmakeraw(rawState)
rawState.Oflag = rawState.Oflag | unix.OPOST
return tcset(m.f.Fd(), &rawState)
}
func (m *master) DisableEcho() error {
rawState, err := m.getCurrent()
if err != nil {
return err
}
rawState.Lflag = rawState.Lflag &^ unix.ECHO
return tcset(m.f.Fd(), &rawState)
}
func (m *master) Size() (WinSize, error) {
return tcgwinsz(m.f.Fd())
}
func (m *master) Fd() uintptr {
return m.f.Fd()
}
func (m *master) Name() string {
return m.f.Name()
}
// checkConsole checks if the provided file is a console
func checkConsole(f File) error {
var termios unix.Termios
if tcget(f.Fd(), &termios) != nil {
return ErrNotAConsole
}
return nil
}
func newMaster(f File) (Console, error) {
m := &master{
f: f,
}
t, err := m.getCurrent()
if err != nil {
return nil, err
}
m.original = &t
return m, nil
}
// ClearONLCR sets the necessary tty_ioctl(4)s to ensure that a pty pair
// created by us acts normally. In particular, a not-very-well-known default of
// Linux unix98 ptys is that they have +onlcr by default. While this isn't a
// problem for terminal emulators, because we relay data from the terminal we
// also relay that funky line discipline.
func ClearONLCR(fd uintptr) error {
return setONLCR(fd, false)
}
// SetONLCR sets the necessary tty_ioctl(4)s to ensure that a pty pair
// created by us acts as intended for a terminal emulator.
func SetONLCR(fd uintptr) error {
return setONLCR(fd, true)
}

View File

@@ -1,219 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"errors"
"fmt"
"os"
"golang.org/x/sys/windows"
)
var vtInputSupported bool
func (m *master) initStdios() {
// Note: We discard console mode warnings, because in/out can be redirected.
//
// TODO: Investigate opening CONOUT$/CONIN$ to handle this correctly
m.in = windows.Handle(os.Stdin.Fd())
if err := windows.GetConsoleMode(m.in, &m.inMode); err == nil {
// Validate that windows.ENABLE_VIRTUAL_TERMINAL_INPUT is supported, but do not set it.
if err = windows.SetConsoleMode(m.in, m.inMode|windows.ENABLE_VIRTUAL_TERMINAL_INPUT); err == nil {
vtInputSupported = true
}
// Unconditionally set the console mode back even on failure because SetConsoleMode
// remembers invalid bits on input handles.
windows.SetConsoleMode(m.in, m.inMode)
}
m.out = windows.Handle(os.Stdout.Fd())
if err := windows.GetConsoleMode(m.out, &m.outMode); err == nil {
if err := windows.SetConsoleMode(m.out, m.outMode|windows.ENABLE_VIRTUAL_TERMINAL_PROCESSING); err == nil {
m.outMode |= windows.ENABLE_VIRTUAL_TERMINAL_PROCESSING
} else {
windows.SetConsoleMode(m.out, m.outMode)
}
}
m.err = windows.Handle(os.Stderr.Fd())
if err := windows.GetConsoleMode(m.err, &m.errMode); err == nil {
if err := windows.SetConsoleMode(m.err, m.errMode|windows.ENABLE_VIRTUAL_TERMINAL_PROCESSING); err == nil {
m.errMode |= windows.ENABLE_VIRTUAL_TERMINAL_PROCESSING
} else {
windows.SetConsoleMode(m.err, m.errMode)
}
}
}
type master struct {
in windows.Handle
inMode uint32
out windows.Handle
outMode uint32
err windows.Handle
errMode uint32
}
func (m *master) SetRaw() error {
if err := makeInputRaw(m.in, m.inMode); err != nil {
return err
}
// Set StdOut and StdErr to raw mode, we ignore failures since
// windows.DISABLE_NEWLINE_AUTO_RETURN might not be supported on this version of
// Windows.
windows.SetConsoleMode(m.out, m.outMode|windows.DISABLE_NEWLINE_AUTO_RETURN)
windows.SetConsoleMode(m.err, m.errMode|windows.DISABLE_NEWLINE_AUTO_RETURN)
return nil
}
func (m *master) Reset() error {
var errs []error
for _, s := range []struct {
fd windows.Handle
mode uint32
}{
{m.in, m.inMode},
{m.out, m.outMode},
{m.err, m.errMode},
} {
if err := windows.SetConsoleMode(s.fd, s.mode); err != nil {
// we can't just abort on the first error, otherwise we might leave
// the console in an unexpected state.
errs = append(errs, fmt.Errorf("unable to restore console mode: %w", err))
}
}
if len(errs) > 0 {
return errs[0]
}
return nil
}
func (m *master) Size() (WinSize, error) {
var info windows.ConsoleScreenBufferInfo
err := windows.GetConsoleScreenBufferInfo(m.out, &info)
if err != nil {
return WinSize{}, fmt.Errorf("unable to get console info: %w", err)
}
winsize := WinSize{
Width: uint16(info.Window.Right - info.Window.Left + 1),
Height: uint16(info.Window.Bottom - info.Window.Top + 1),
}
return winsize, nil
}
func (m *master) Resize(ws WinSize) error {
return ErrNotImplemented
}
func (m *master) ResizeFrom(c Console) error {
return ErrNotImplemented
}
func (m *master) DisableEcho() error {
mode := m.inMode &^ windows.ENABLE_ECHO_INPUT
mode |= windows.ENABLE_PROCESSED_INPUT
mode |= windows.ENABLE_LINE_INPUT
if err := windows.SetConsoleMode(m.in, mode); err != nil {
return fmt.Errorf("unable to set console to disable echo: %w", err)
}
return nil
}
func (m *master) Close() error {
return nil
}
func (m *master) Read(b []byte) (int, error) {
return os.Stdin.Read(b)
}
func (m *master) Write(b []byte) (int, error) {
return os.Stdout.Write(b)
}
func (m *master) Fd() uintptr {
return uintptr(m.in)
}
// on windows, console can only be made from os.Std{in,out,err}, hence there
// isnt a single name here we can use. Return a dummy "console" value in this
// case should be sufficient.
func (m *master) Name() string {
return "console"
}
// makeInputRaw puts the terminal (Windows Console) connected to the given
// file descriptor into raw mode
func makeInputRaw(fd windows.Handle, mode uint32) error {
// See
// -- https://msdn.microsoft.com/en-us/library/windows/desktop/ms686033(v=vs.85).aspx
// -- https://msdn.microsoft.com/en-us/library/windows/desktop/ms683462(v=vs.85).aspx
// Disable these modes
mode &^= windows.ENABLE_ECHO_INPUT
mode &^= windows.ENABLE_LINE_INPUT
mode &^= windows.ENABLE_MOUSE_INPUT
mode &^= windows.ENABLE_WINDOW_INPUT
mode &^= windows.ENABLE_PROCESSED_INPUT
// Enable these modes
mode |= windows.ENABLE_EXTENDED_FLAGS
mode |= windows.ENABLE_INSERT_MODE
mode |= windows.ENABLE_QUICK_EDIT_MODE
if vtInputSupported {
mode |= windows.ENABLE_VIRTUAL_TERMINAL_INPUT
}
if err := windows.SetConsoleMode(fd, mode); err != nil {
return fmt.Errorf("unable to set console to raw mode: %w", err)
}
return nil
}
func checkConsole(f File) error {
var mode uint32
if err := windows.GetConsoleMode(windows.Handle(f.Fd()), &mode); err != nil {
return err
}
return nil
}
func newMaster(f File) (Console, error) {
if f != os.Stdin && f != os.Stdout && f != os.Stderr {
return nil, errors.New("creating a console from a file is not supported on windows")
}
m := &master{}
m.initStdios()
return m, nil
}

View File

@@ -1,46 +0,0 @@
//go:build freebsd && cgo
// +build freebsd,cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
)
/*
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
*/
import "C"
// openpt allocates a new pseudo-terminal and establishes a connection with its
// control device.
func openpt() (*os.File, error) {
fd, err := C.posix_openpt(C.O_RDWR)
if err != nil {
return nil, fmt.Errorf("posix_openpt: %w", err)
}
if _, err := C.grantpt(fd); err != nil {
C.close(fd)
return nil, fmt.Errorf("grantpt: %w", err)
}
return os.NewFile(uintptr(fd), ""), nil
}

View File

@@ -1,37 +0,0 @@
//go:build freebsd && !cgo
// +build freebsd,!cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"os"
)
//
// Implementing the functions below requires cgo support. Non-cgo stubs
// versions are defined below to enable cross-compilation of source code
// that depends on these functions, but the resultant cross-compiled
// binaries cannot actually be used. If the stub function(s) below are
// actually invoked they will display an error message and cause the
// calling process to exit.
//
func openpt() (*os.File, error) {
panic("openpt() support requires cgo.")
}

View File

@@ -1,31 +0,0 @@
//go:build darwin || linux || netbsd || openbsd
// +build darwin linux netbsd openbsd
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"os"
"golang.org/x/sys/unix"
)
// openpt allocates a new pseudo-terminal by opening the /dev/ptmx device
func openpt() (*os.File, error) {
return os.OpenFile("/dev/ptmx", unix.O_RDWR|unix.O_NOCTTY|unix.O_CLOEXEC, 0)
}

View File

@@ -1,43 +0,0 @@
//go:build zos
// +build zos
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
)
// openpt allocates a new pseudo-terminal by opening the first available /dev/ptypXX device
func openpt() (*os.File, error) {
var f *os.File
var err error
for i := 0; ; i++ {
ptyp := fmt.Sprintf("/dev/ptyp%04d", i)
f, err = os.OpenFile(ptyp, os.O_RDWR, 0600)
if err == nil {
break
}
if os.IsNotExist(err) {
return nil, err
}
// else probably Resource Busy
}
return f, nil
}

View File

@@ -1,44 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
func unlockpt(f *os.File) error {
return unix.IoctlSetPointerInt(int(f.Fd()), unix.TIOCPTYUNLK, 0)
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
n, err := unix.IoctlGetInt(int(f.Fd()), unix.TIOCPTYGNAME)
if err != nil {
return "", err
}
return fmt.Sprintf("/dev/pts/%d", n), nil
}

View File

@@ -1,58 +0,0 @@
//go:build freebsd && cgo
// +build freebsd,cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
"golang.org/x/sys/unix"
)
/*
#include <stdlib.h>
#include <unistd.h>
*/
import "C"
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
func unlockpt(f *os.File) error {
fd := C.int(f.Fd())
if _, err := C.unlockpt(fd); err != nil {
C.close(fd)
return fmt.Errorf("unlockpt: %w", err)
}
return nil
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
n, err := unix.IoctlGetInt(int(f.Fd()), unix.TIOCGPTN)
if err != nil {
return "", err
}
return fmt.Sprintf("/dev/pts/%d", n), nil
}

View File

@@ -1,56 +0,0 @@
//go:build freebsd && !cgo
// +build freebsd,!cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
//
// Implementing the functions below requires cgo support. Non-cgo stubs
// versions are defined below to enable cross-compilation of source code
// that depends on these functions, but the resultant cross-compiled
// binaries cannot actually be used. If the stub function(s) below are
// actually invoked they will display an error message and cause the
// calling process to exit.
//
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
func unlockpt(f *os.File) error {
panic("unlockpt() support requires cgo.")
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
n, err := unix.IoctlGetInt(int(f.Fd()), unix.TIOCGPTN)
if err != nil {
return "", err
}
return fmt.Sprintf("/dev/pts/%d", n), nil
}

View File

@@ -1,51 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"fmt"
"os"
"unsafe"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TCGETS
cmdTcSet = unix.TCSETS
)
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
func unlockpt(f *os.File) error {
var u int32
// XXX do not use unix.IoctlSetPointerInt here, see commit dbd69c59b81.
if _, _, err := unix.Syscall(unix.SYS_IOCTL, f.Fd(), unix.TIOCSPTLCK, uintptr(unsafe.Pointer(&u))); err != 0 {
return err
}
return nil
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
var u uint32
// XXX do not use unix.IoctlGetInt here, see commit dbd69c59b81.
if _, _, err := unix.Syscall(unix.SYS_IOCTL, f.Fd(), unix.TIOCGPTN, uintptr(unsafe.Pointer(&u))); err != 0 {
return "", err
}
return fmt.Sprintf("/dev/pts/%d", u), nil
}

View File

@@ -1,45 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"bytes"
"os"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
// This does not exist on NetBSD, it does not allocate controlling terminals on open
func unlockpt(f *os.File) error {
return nil
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
ptm, err := unix.IoctlGetPtmget(int(f.Fd()), unix.TIOCPTSNAME)
if err != nil {
return "", err
}
return string(ptm.Sn[:bytes.IndexByte(ptm.Sn[:], 0)]), nil
}

View File

@@ -1,52 +0,0 @@
//go:build openbsd && cgo
// +build openbsd,cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"os"
"golang.org/x/sys/unix"
)
//#include <stdlib.h>
import "C"
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
ptspath, err := C.ptsname(C.int(f.Fd()))
if err != nil {
return "", err
}
return C.GoString(ptspath), nil
}
// unlockpt unlocks the slave pseudoterminal device corresponding to the master pseudoterminal referred to by f.
// unlockpt should be called before opening the slave side of a pty.
func unlockpt(f *os.File) error {
if _, err := C.grantpt(C.int(f.Fd())); err != nil {
return err
}
return nil
}

View File

@@ -1,48 +0,0 @@
//go:build openbsd && !cgo
// +build openbsd,!cgo
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
//
// Implementing the functions below requires cgo support. Non-cgo stubs
// versions are defined below to enable cross-compilation of source code
// that depends on these functions, but the resultant cross-compiled
// binaries cannot actually be used. If the stub function(s) below are
// actually invoked they will display an error message and cause the
// calling process to exit.
//
package console
import (
"os"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TIOCGETA
cmdTcSet = unix.TIOCSETA
)
func ptsname(f *os.File) (string, error) {
panic("ptsname() support requires cgo.")
}
func unlockpt(f *os.File) error {
panic("unlockpt() support requires cgo.")
}

View File

@@ -1,92 +0,0 @@
//go:build darwin || freebsd || linux || netbsd || openbsd || zos
// +build darwin freebsd linux netbsd openbsd zos
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"golang.org/x/sys/unix"
)
func tcget(fd uintptr, p *unix.Termios) error {
termios, err := unix.IoctlGetTermios(int(fd), cmdTcGet)
if err != nil {
return err
}
*p = *termios
return nil
}
func tcset(fd uintptr, p *unix.Termios) error {
return unix.IoctlSetTermios(int(fd), cmdTcSet, p)
}
func tcgwinsz(fd uintptr) (WinSize, error) {
var ws WinSize
uws, err := unix.IoctlGetWinsize(int(fd), unix.TIOCGWINSZ)
if err != nil {
return ws, err
}
// Translate from unix.Winsize to console.WinSize
ws.Height = uws.Row
ws.Width = uws.Col
ws.x = uws.Xpixel
ws.y = uws.Ypixel
return ws, nil
}
func tcswinsz(fd uintptr, ws WinSize) error {
// Translate from console.WinSize to unix.Winsize
var uws unix.Winsize
uws.Row = ws.Height
uws.Col = ws.Width
uws.Xpixel = ws.x
uws.Ypixel = ws.y
return unix.IoctlSetWinsize(int(fd), unix.TIOCSWINSZ, &uws)
}
func setONLCR(fd uintptr, enable bool) error {
var termios unix.Termios
if err := tcget(fd, &termios); err != nil {
return err
}
if enable {
// Set +onlcr so we can act like a real terminal
termios.Oflag |= unix.ONLCR
} else {
// Set -onlcr so we don't have to deal with \r.
termios.Oflag &^= unix.ONLCR
}
return tcset(fd, &termios)
}
func cfmakeraw(t unix.Termios) unix.Termios {
t.Iflag &^= (unix.IGNBRK | unix.BRKINT | unix.PARMRK | unix.ISTRIP | unix.INLCR | unix.IGNCR | unix.ICRNL | unix.IXON)
t.Oflag &^= unix.OPOST
t.Lflag &^= (unix.ECHO | unix.ECHONL | unix.ICANON | unix.ISIG | unix.IEXTEN)
t.Cflag &^= (unix.CSIZE | unix.PARENB)
t.Cflag |= unix.CS8
t.Cc[unix.VMIN] = 1
t.Cc[unix.VTIME] = 0
return t
}

View File

@@ -1,39 +0,0 @@
/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package console
import (
"os"
"strings"
"golang.org/x/sys/unix"
)
const (
cmdTcGet = unix.TCGETS
cmdTcSet = unix.TCSETS
)
// unlockpt is a no-op on zos.
func unlockpt(_ *os.File) error {
return nil
}
// ptsname retrieves the name of the first available pts for the given master.
func ptsname(f *os.File) (string, error) {
return "/dev/ttyp" + strings.TrimPrefix(f.Name(), "/dev/ptyp"), nil
}

View File

@@ -28,7 +28,6 @@ import (
"strings"
"time"
"github.com/opencontainers/runc/libcontainer"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs2"
"k8s.io/klog/v2"
@@ -89,10 +88,7 @@ func (h *Handler) GetStats() (*info.ContainerStats, error) {
}
klog.V(4).Infof("Ignoring errors when gathering stats for root cgroup since some controllers don't have stats on the root cgroup: %v", err)
}
libcontainerStats := &libcontainer.Stats{
CgroupStats: cgroupStats,
}
stats := newContainerStats(libcontainerStats, h.includedMetrics)
stats := newContainerStats(cgroupStats, h.includedMetrics)
if h.includedMetrics.Has(container.ProcessSchedulerMetrics) {
stats.Cpu.Schedstat, err = h.schedulerStatsFromProcs()
@@ -888,28 +884,6 @@ func setHugepageStats(s *cgroups.Stats, ret *info.ContainerStats) {
}
}
func setNetworkStats(libcontainerStats *libcontainer.Stats, ret *info.ContainerStats) {
ret.Network.Interfaces = make([]info.InterfaceStats, len(libcontainerStats.Interfaces))
for i := range libcontainerStats.Interfaces {
ret.Network.Interfaces[i] = info.InterfaceStats{
Name: libcontainerStats.Interfaces[i].Name,
RxBytes: libcontainerStats.Interfaces[i].RxBytes,
RxPackets: libcontainerStats.Interfaces[i].RxPackets,
RxErrors: libcontainerStats.Interfaces[i].RxErrors,
RxDropped: libcontainerStats.Interfaces[i].RxDropped,
TxBytes: libcontainerStats.Interfaces[i].TxBytes,
TxPackets: libcontainerStats.Interfaces[i].TxPackets,
TxErrors: libcontainerStats.Interfaces[i].TxErrors,
TxDropped: libcontainerStats.Interfaces[i].TxDropped,
}
}
// Add to base struct for backwards compatibility.
if len(ret.Network.Interfaces) > 0 {
ret.Network.InterfaceStats = ret.Network.Interfaces[0]
}
}
// read from pids path not cpu
func setThreadsStats(s *cgroups.Stats, ret *info.ContainerStats) {
if s != nil {
@@ -918,12 +892,12 @@ func setThreadsStats(s *cgroups.Stats, ret *info.ContainerStats) {
}
}
func newContainerStats(libcontainerStats *libcontainer.Stats, includedMetrics container.MetricSet) *info.ContainerStats {
func newContainerStats(cgroupStats *cgroups.Stats, includedMetrics container.MetricSet) *info.ContainerStats {
ret := &info.ContainerStats{
Timestamp: time.Now(),
}
if s := libcontainerStats.CgroupStats; s != nil {
if s := cgroupStats; s != nil {
setCPUStats(s, ret, includedMetrics.Has(container.PerCpuUsageMetrics))
if includedMetrics.Has(container.DiskIOMetrics) {
setDiskIoStats(s, ret)
@@ -939,8 +913,5 @@ func newContainerStats(libcontainerStats *libcontainer.Stats, includedMetrics co
setCPUSetStats(s, ret)
}
}
if len(libcontainerStats.Interfaces) > 0 {
setNetworkStats(libcontainerStats, ret)
}
return ret
}

View File

@@ -254,10 +254,8 @@ func (fs *realSysFs) IsBlockDeviceHidden(name string) (bool, error) {
if err != nil {
return false, fmt.Errorf("failed to read %s: %w", devHiddenPath, err)
}
if string(hidden) == "1" {
return true, nil
}
return false, nil
return strings.TrimSpace(string(hidden)) == "1", nil
}
func (fs *realSysFs) GetBlockDeviceScheduler(name string) (string, error) {

View File

@@ -1,318 +0,0 @@
# libcontainer
[![Go Reference](https://pkg.go.dev/badge/github.com/opencontainers/runc/libcontainer.svg)](https://pkg.go.dev/github.com/opencontainers/runc/libcontainer)
Libcontainer provides a native Go implementation for creating containers
with namespaces, cgroups, capabilities, and filesystem access controls.
It allows you to manage the lifecycle of the container performing additional operations
after the container is created.
#### Container
A container is a self contained execution environment that shares the kernel of the
host system and which is (optionally) isolated from other containers in the system.
#### Using libcontainer
Because containers are spawned in a two step process you will need a binary that
will be executed as the init process for the container. In libcontainer, we use
the current binary (/proc/self/exe) to be executed as the init process, and use
arg "init", we call the first step process "bootstrap", so you always need a "init"
function as the entry of "bootstrap".
In addition to the go init function the early stage bootstrap is handled by importing
[nsenter](https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/README.md).
```go
import (
_ "github.com/opencontainers/runc/libcontainer/nsenter"
)
func init() {
if len(os.Args) > 1 && os.Args[1] == "init" {
runtime.GOMAXPROCS(1)
runtime.LockOSThread()
factory, _ := libcontainer.New("")
if err := factory.StartInitialization(); err != nil {
logrus.Fatal(err)
}
panic("--this line should have never been executed, congratulations--")
}
}
```
Then to create a container you first have to initialize an instance of a factory
that will handle the creation and initialization for a container.
```go
factory, err := libcontainer.New("/var/lib/container", libcontainer.Cgroupfs, libcontainer.InitArgs(os.Args[0], "init"))
if err != nil {
logrus.Fatal(err)
return
}
```
Once you have an instance of the factory created we can create a configuration
struct describing how the container is to be created. A sample would look similar to this:
```go
defaultMountFlags := unix.MS_NOEXEC | unix.MS_NOSUID | unix.MS_NODEV
var devices []*configs.DeviceRule
for _, device := range specconv.AllowedDevices {
devices = append(devices, &device.Rule)
}
config := &configs.Config{
Rootfs: "/your/path/to/rootfs",
Capabilities: &configs.Capabilities{
Bounding: []string{
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
Effective: []string{
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
Permitted: []string{
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
Ambient: []string{
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
},
Namespaces: configs.Namespaces([]configs.Namespace{
{Type: configs.NEWNS},
{Type: configs.NEWUTS},
{Type: configs.NEWIPC},
{Type: configs.NEWPID},
{Type: configs.NEWUSER},
{Type: configs.NEWNET},
{Type: configs.NEWCGROUP},
}),
Cgroups: &configs.Cgroup{
Name: "test-container",
Parent: "system",
Resources: &configs.Resources{
MemorySwappiness: nil,
Devices: devices,
},
},
MaskPaths: []string{
"/proc/kcore",
"/sys/firmware",
},
ReadonlyPaths: []string{
"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
},
Devices: specconv.AllowedDevices,
Hostname: "testing",
Mounts: []*configs.Mount{
{
Source: "proc",
Destination: "/proc",
Device: "proc",
Flags: defaultMountFlags,
},
{
Source: "tmpfs",
Destination: "/dev",
Device: "tmpfs",
Flags: unix.MS_NOSUID | unix.MS_STRICTATIME,
Data: "mode=755",
},
{
Source: "devpts",
Destination: "/dev/pts",
Device: "devpts",
Flags: unix.MS_NOSUID | unix.MS_NOEXEC,
Data: "newinstance,ptmxmode=0666,mode=0620,gid=5",
},
{
Device: "tmpfs",
Source: "shm",
Destination: "/dev/shm",
Data: "mode=1777,size=65536k",
Flags: defaultMountFlags,
},
{
Source: "mqueue",
Destination: "/dev/mqueue",
Device: "mqueue",
Flags: defaultMountFlags,
},
{
Source: "sysfs",
Destination: "/sys",
Device: "sysfs",
Flags: defaultMountFlags | unix.MS_RDONLY,
},
},
UidMappings: []configs.IDMap{
{
ContainerID: 0,
HostID: 1000,
Size: 65536,
},
},
GidMappings: []configs.IDMap{
{
ContainerID: 0,
HostID: 1000,
Size: 65536,
},
},
Networks: []*configs.Network{
{
Type: "loopback",
Address: "127.0.0.1/0",
Gateway: "localhost",
},
},
Rlimits: []configs.Rlimit{
{
Type: unix.RLIMIT_NOFILE,
Hard: uint64(1025),
Soft: uint64(1025),
},
},
}
```
Once you have the configuration populated you can create a container:
```go
container, err := factory.Create("container-id", config)
if err != nil {
logrus.Fatal(err)
return
}
```
To spawn bash as the initial process inside the container and have the
processes pid returned in order to wait, signal, or kill the process:
```go
process := &libcontainer.Process{
Args: []string{"/bin/bash"},
Env: []string{"PATH=/bin"},
User: "daemon",
Stdin: os.Stdin,
Stdout: os.Stdout,
Stderr: os.Stderr,
Init: true,
}
err := container.Run(process)
if err != nil {
container.Destroy()
logrus.Fatal(err)
return
}
// wait for the process to finish.
_, err := process.Wait()
if err != nil {
logrus.Fatal(err)
}
// destroy the container.
container.Destroy()
```
Additional ways to interact with a running container are:
```go
// return all the pids for all processes running inside the container.
processes, err := container.Processes()
// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()
// pause all processes inside the container.
container.Pause()
// resume all paused processes.
container.Resume()
// send signal to container's init process.
container.Signal(signal)
// update container resource constraints.
container.Set(config)
// get current status of the container.
status, err := container.Status()
// get current container's state information.
state, err := container.State()
```
#### Checkpoint & Restore
libcontainer now integrates [CRIU](http://criu.org/) for checkpointing and restoring containers.
This lets you save the state of a process running inside a container to disk, and then restore
that state into a new process, on the same machine or on another machine.
`criu` version 1.5.2 or higher is required to use checkpoint and restore.
If you don't already have `criu` installed, you can build it from source, following the
[online instructions](http://criu.org/Installation). `criu` is also installed in the docker image
generated when building libcontainer with docker.
## Copyright and license
Code and documentation copyright 2014 Docker, inc.
The code and documentation are released under the [Apache 2.0 license](../LICENSE).
The documentation is also released under Creative Commons Attribution 4.0 International License.
You may obtain a copy of the license, titled CC-BY-4.0, at http://creativecommons.org/licenses/by/4.0/.

View File

@@ -1,465 +0,0 @@
## Container Specification - v1
This is the standard configuration for version 1 containers. It includes
namespaces, standard filesystem setup, a default Linux capability set, and
information about resource reservations. It also has information about any
populated environment settings for the processes running inside a container.
Along with the configuration of how a container is created the standard also
discusses actions that can be performed on a container to manage and inspect
information about the processes running inside.
The v1 profile is meant to be able to accommodate the majority of applications
with a strong security configuration.
### System Requirements and Compatibility
Minimum requirements:
* Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
* Mounted cgroups with each subsystem in its own hierarchy
### Namespaces
| Flag | Enabled |
| --------------- | ------- |
| CLONE_NEWPID | 1 |
| CLONE_NEWUTS | 1 |
| CLONE_NEWIPC | 1 |
| CLONE_NEWNET | 1 |
| CLONE_NEWNS | 1 |
| CLONE_NEWUSER | 1 |
| CLONE_NEWCGROUP | 1 |
Namespaces are created for the container via the `unshare` syscall.
### Filesystem
A root filesystem must be provided to a container for execution. The container
will use this root filesystem (rootfs) to jail and spawn processes inside where
the binaries and system libraries are local to that directory. Any binaries
to be executed must be contained within this rootfs.
Mounts that happen inside the container are automatically cleaned up when the
container exits as the mount namespace is destroyed and the kernel will
unmount all the mounts that were setup within that namespace.
For a container to execute properly there are certain filesystems that
are required to be mounted within the rootfs that the runtime will setup.
| Path | Type | Flags | Data |
| ----------- | ------ | -------------------------------------- | ---------------------------------------- |
| /proc | proc | MS_NOEXEC,MS_NOSUID,MS_NODEV | |
| /dev | tmpfs | MS_NOEXEC,MS_STRICTATIME | mode=755 |
| /dev/shm | tmpfs | MS_NOEXEC,MS_NOSUID,MS_NODEV | mode=1777,size=65536k |
| /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV | |
| /dev/pts | devpts | MS_NOEXEC,MS_NOSUID | newinstance,ptmxmode=0666,mode=620,gid=5 |
| /sys | sysfs | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY | |
After a container's filesystems are mounted within the newly created
mount namespace `/dev` will need to be populated with a set of device nodes.
It is expected that a rootfs does not need to have any device nodes specified
for `/dev` within the rootfs as the container will setup the correct devices
that are required for executing a container's process.
| Path | Mode | Access |
| ------------ | ---- | ---------- |
| /dev/null | 0666 | rwm |
| /dev/zero | 0666 | rwm |
| /dev/full | 0666 | rwm |
| /dev/tty | 0666 | rwm |
| /dev/random | 0666 | rwm |
| /dev/urandom | 0666 | rwm |
**ptmx**
`/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
the container.
The use of a pseudo TTY is optional within a container and it should support both.
If a pseudo is provided to the container `/dev/console` will need to be
setup by binding the console in `/dev/` after it has been populated and mounted
in tmpfs.
| Source | Destination | UID GID | Mode | Type |
| --------------- | ------------ | ------- | ---- | ---- |
| *pty host path* | /dev/console | 0 0 | 0600 | bind |
After `/dev/null` has been setup we check for any external links between
the container's io, STDIN, STDOUT, STDERR. If the container's io is pointing
to `/dev/null` outside the container we close and `dup2` the `/dev/null`
that is local to the container's rootfs.
After the container has `/proc` mounted a few standard symlinks are setup
within `/dev/` for the io.
| Source | Destination |
| --------------- | ----------- |
| /proc/self/fd | /dev/fd |
| /proc/self/fd/0 | /dev/stdin |
| /proc/self/fd/1 | /dev/stdout |
| /proc/self/fd/2 | /dev/stderr |
A `pivot_root` is used to change the root for the process, effectively
jailing the process inside the rootfs.
```c
put_old = mkdir(...);
pivot_root(rootfs, put_old);
chdir("/");
unmount(put_old, MS_DETACH);
rmdir(put_old);
```
For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
with a `chroot` is required as `pivot_root` is not supported in `ramfs`.
```c
mount(rootfs, "/", NULL, MS_MOVE, NULL);
chroot(".");
chdir("/");
```
The `umask` is set back to `0022` after the filesystem setup has been completed.
### Resources
Cgroups are used to handle resource allocation for containers. This includes
system resources like cpu, memory, and device access.
| Subsystem | Enabled |
| ---------- | ------- |
| devices | 1 |
| memory | 1 |
| cpu | 1 |
| cpuacct | 1 |
| cpuset | 1 |
| blkio | 1 |
| perf_event | 1 |
| freezer | 1 |
| hugetlb | 1 |
| pids | 1 |
All cgroup subsystem are joined so that statistics can be collected from
each of the subsystems. Freezer does not expose any stats but is joined
so that containers can be paused and resumed.
The parent process of the container's init must place the init pid inside
the correct cgroups before the initialization begins. This is done so
that no processes or threads escape the cgroups. This sync is
done via a pipe ( specified in the runtime section below ) that the container's
init process will block waiting for the parent to finish setup.
### IntelRdt
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are
two sub-features of RDT.
Cache Allocation Technology (CAT) provides a way for the software to restrict
cache allocation to a defined 'subset' of L3 cache which may be overlapping
with other 'subsets'. The different subsets are identified by class of
service (CLOS) and each CLOS has a capacity bitmask (CBM).
Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle
over memory bandwidth for the software. A user controls the resource by
indicating the percentage of maximum memory bandwidth or memory bandwidth limit
in MBps unit if MBA Software Controller is enabled.
It can be used to handle L3 cache and memory bandwidth resources allocation
for containers if hardware and kernel support Intel RDT CAT and MBA features.
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via
"resource control" filesystem.
Intel RDT "resource control" filesystem hierarchy:
```
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
```
For runc, we can make use of `tasks` and `schemata` configuration for L3
cache and memory bandwidth resources constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent.
The file `schemata` has a list of all the resources available to this group.
Each resource (L3 cache, memory bandwidth) has its own line and format.
L3 cache schema:
It has allocation bitmasks/values for L3 cache on each socket, which
contains L3 cache id and capacity bitmask (CBM).
```
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
```
For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0"
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel CPU models. Kernel will check if it is valid when writing.
e.g., default value 0xfffff in root indicates the max bits of CBM is 20
bits, which mapping to entire L3 cache capacity. Some valid CBM values to
set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which contains
L3 cache id and memory bandwidth.
```
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
```
For example, on a two-socket machine, the schema line could be "MB:0=20;1=70"
The minimum bandwidth percentage value for each CPU model is predefined and
can be looked up through "info/MB/min_bandwidth". The bandwidth granularity
that is allocated is also dependent on the CPU model and can be looked up at
"info/MB/bandwidth_gran". The available bandwidth control steps are:
min_bw + N * bw_gran. Intermediate values are rounded to the next control
step available on the hardware.
If MBA Software Controller is enabled through mount option "-o mba_MBps"
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit
instead of "percentages". The kernel underneath would use a software feedback
mechanism or a "Software Controller" which reads the actual bandwidth using
MBM counters and adjust the memory bandwidth percentages to ensure:
"actual memory bandwidth < user specified memory bandwidth".
For example, on a two-socket machine, the schema line could be
"MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0
and 7000 MBps memory bandwidth limit on socket 1.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
```
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10%
with a memory bandwidth granularity of 10%.
Tasks inside the container only have access to the "upper" 7/11 of L3 cache
on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a
maximum memory bandwidth of 20% on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"closID": "guaranteed_group",
"l3CacheSchema": "L3:0=7f0;1=1f",
"memBwSchema": "MB:0=20;1=70"
}
}
```
### Security
The standard set of Linux capabilities that are set in a container
provide a good default for security and flexibility for the applications.
| Capability | Enabled |
| -------------------- | ------- |
| CAP_NET_RAW | 1 |
| CAP_NET_BIND_SERVICE | 1 |
| CAP_AUDIT_READ | 1 |
| CAP_AUDIT_WRITE | 1 |
| CAP_DAC_OVERRIDE | 1 |
| CAP_SETFCAP | 1 |
| CAP_SETPCAP | 1 |
| CAP_SETGID | 1 |
| CAP_SETUID | 1 |
| CAP_MKNOD | 1 |
| CAP_CHOWN | 1 |
| CAP_FOWNER | 1 |
| CAP_FSETID | 1 |
| CAP_KILL | 1 |
| CAP_SYS_CHROOT | 1 |
| CAP_NET_BROADCAST | 0 |
| CAP_SYS_MODULE | 0 |
| CAP_SYS_RAWIO | 0 |
| CAP_SYS_PACCT | 0 |
| CAP_SYS_ADMIN | 0 |
| CAP_SYS_NICE | 0 |
| CAP_SYS_RESOURCE | 0 |
| CAP_SYS_TIME | 0 |
| CAP_SYS_TTY_CONFIG | 0 |
| CAP_AUDIT_CONTROL | 0 |
| CAP_MAC_OVERRIDE | 0 |
| CAP_MAC_ADMIN | 0 |
| CAP_NET_ADMIN | 0 |
| CAP_SYSLOG | 0 |
| CAP_DAC_READ_SEARCH | 0 |
| CAP_LINUX_IMMUTABLE | 0 |
| CAP_IPC_LOCK | 0 |
| CAP_IPC_OWNER | 0 |
| CAP_SYS_PTRACE | 0 |
| CAP_SYS_BOOT | 0 |
| CAP_LEASE | 0 |
| CAP_WAKE_ALARM | 0 |
| CAP_BLOCK_SUSPEND | 0 |
Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
the containers. A container should support setting an apparmor profile or
selinux process and mount labels if provided in the configuration.
Standard apparmor profile:
```c
#include <tunables/global>
profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
network,
capability,
file,
umount,
deny @{PROC}/sys/fs/** wklx,
deny @{PROC}/sysrq-trigger rwklx,
deny @{PROC}/mem rwklx,
deny @{PROC}/kmem rwklx,
deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
deny @{PROC}/sys/kernel/*/** wklx,
deny mount,
deny /sys/[^f]*/** wklx,
deny /sys/f[^s]*/** wklx,
deny /sys/fs/[^c]*/** wklx,
deny /sys/fs/c[^g]*/** wklx,
deny /sys/fs/cg[^r]*/** wklx,
deny /sys/firmware/efi/efivars/** rwklx,
deny /sys/kernel/security/** rwklx,
}
```
*TODO: seccomp work is being done to find a good default config*
### Runtime and Init Process
During container creation the parent process needs to talk to the container's init
process and have a form of synchronization. This is accomplished by creating
a pipe that is passed to the container's init. When the init process first spawns
it will block on its side of the pipe until the parent closes its side. This
allows the parent to have time to set the new process inside a cgroup hierarchy
and/or write any uid/gid mappings required for user namespaces.
The pipe is passed to the init process via FD 3.
The application consuming libcontainer should be compiled statically. libcontainer
does not define any init process and the arguments provided are used to `exec` the
process inside the application. There should be no long running init within the
container spec.
If a pseudo tty is provided to a container it will open and `dup2` the console
as the container's STDIN, STDOUT, STDERR as well as mounting the console
as `/dev/console`.
An extra set of mounts are provided to a container and setup for use. A container's
rootfs can contain some non portable files inside that can cause side effects during
execution of a process. These files are usually created and populated with the container
specific information via the runtime.
**Extra runtime files:**
* /etc/hosts
* /etc/resolv.conf
* /etc/hostname
* /etc/localtime
#### Defaults
There are a few defaults that can be overridden by users, but in their omission
these apply to processes within a container.
| Type | Value |
| ------------------- | ------------------------------ |
| Parent Death Signal | SIGKILL |
| UID | 0 |
| GID | 0 |
| GROUPS | 0, NULL |
| CWD | "/" |
| $HOME | Current user's home dir or "/" |
| Readonly rootfs | false |
| Pseudo TTY | false |
## Actions
After a container is created there is a standard set of actions that can
be done to the container. These actions are part of the public API for
a container.
| Action | Description |
| -------------- | ------------------------------------------------------------------ |
| Get processes | Return all the pids for processes running inside a container |
| Get Stats | Return resource statistics for the container as a whole |
| Wait | Waits on the container's init process ( pid 1 ) |
| Wait Process | Wait on any of the container's processes returning the exit status |
| Destroy | Kill the container's init process and remove any filesystem state |
| Signal | Send a signal to the container's init process |
| Signal Process | Send a signal to any of the container's processes |
| Pause | Pause all processes inside the container |
| Resume | Resume all processes inside the container if paused |
| Exec | Execute a new process inside of the container ( requires setns ) |
| Set | Setup configs of the container after it's created |
### Execute a new process inside of a running container
User can execute a new process inside of a running container. Any binaries to be
executed must be accessible within the container's rootfs.
The started process will run inside the container's rootfs. Any changes
made by the process to the container's filesystem will persist after the
process finished executing.
The started process will join all the container's existing namespaces. When the
container is paused, the process will also be paused and will resume when
the container is unpaused. The started process will only run when the container's
primary process (PID 1) is running, and will not be restarted when the container
is restarted.
#### Planned additions
The started process will have its own cgroups nested inside the container's
cgroups. This is used for process tracking and optionally resource allocation
handling for the new process. Freezer cgroup is required, the rest of the cgroups
are optional. The process executor must place its pid inside the correct
cgroups before starting the process. This is done so that no child processes or
threads can escape the cgroups.
When the process is stopped, the process executor will try (in a best-effort way)
to stop all its children and remove the sub-cgroups.

View File

@@ -1,16 +0,0 @@
package apparmor
import "errors"
var (
// IsEnabled returns true if apparmor is enabled for the host.
IsEnabled = isEnabled
// ApplyProfile will apply the profile with the specified name to the process after
// the next exec. It is only supported on Linux and produces an ErrApparmorNotEnabled
// on other platforms.
ApplyProfile = applyProfile
// ErrApparmorNotEnabled indicates that AppArmor is not enabled or not supported.
ErrApparmorNotEnabled = errors.New("apparmor: config provided but apparmor not supported")
)

View File

@@ -1,68 +0,0 @@
package apparmor
import (
"errors"
"fmt"
"os"
"sync"
"github.com/opencontainers/runc/libcontainer/utils"
)
var (
appArmorEnabled bool
checkAppArmor sync.Once
)
// isEnabled returns true if apparmor is enabled for the host.
func isEnabled() bool {
checkAppArmor.Do(func() {
if _, err := os.Stat("/sys/kernel/security/apparmor"); err == nil {
buf, err := os.ReadFile("/sys/module/apparmor/parameters/enabled")
appArmorEnabled = err == nil && len(buf) > 1 && buf[0] == 'Y'
}
})
return appArmorEnabled
}
func setProcAttr(attr, value string) error {
// Under AppArmor you can only change your own attr, so use /proc/self/
// instead of /proc/<tid>/ like libapparmor does
attrPath := "/proc/self/attr/apparmor/" + attr
if _, err := os.Stat(attrPath); errors.Is(err, os.ErrNotExist) {
// fall back to the old convention
attrPath = "/proc/self/attr/" + attr
}
f, err := os.OpenFile(attrPath, os.O_WRONLY, 0)
if err != nil {
return err
}
defer f.Close()
if err := utils.EnsureProcHandle(f); err != nil {
return err
}
_, err = f.WriteString(value)
return err
}
// changeOnExec reimplements aa_change_onexec from libapparmor in Go
func changeOnExec(name string) error {
if err := setProcAttr("exec", "exec "+name); err != nil {
return fmt.Errorf("apparmor failed to apply profile: %w", err)
}
return nil
}
// applyProfile will apply the profile with the specified name to the process after
// the next exec. It is only supported on Linux and produces an error on other
// platforms.
func applyProfile(name string) error {
if name == "" {
return nil
}
return changeOnExec(name)
}

View File

@@ -1,15 +0,0 @@
//go:build !linux
// +build !linux
package apparmor
func isEnabled() bool {
return false
}
func applyProfile(name string) error {
if name != "" {
return ErrApparmorNotEnabled
}
return nil
}

View File

@@ -1,123 +0,0 @@
//go:build linux
// +build linux
package capabilities
import (
"sort"
"strings"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/sirupsen/logrus"
"github.com/syndtr/gocapability/capability"
)
const allCapabilityTypes = capability.CAPS | capability.BOUNDING | capability.AMBIENT
var (
capabilityMap map[string]capability.Cap
capTypes = []capability.CapType{
capability.BOUNDING,
capability.PERMITTED,
capability.INHERITABLE,
capability.EFFECTIVE,
capability.AMBIENT,
}
)
func init() {
capabilityMap = make(map[string]capability.Cap, capability.CAP_LAST_CAP+1)
for _, c := range capability.List() {
if c > capability.CAP_LAST_CAP {
continue
}
capabilityMap["CAP_"+strings.ToUpper(c.String())] = c
}
}
// KnownCapabilities returns the list of the known capabilities.
// Used by `runc features`.
func KnownCapabilities() []string {
list := capability.List()
res := make([]string, len(list))
for i, c := range list {
res[i] = "CAP_" + strings.ToUpper(c.String())
}
return res
}
// New creates a new Caps from the given Capabilities config. Unknown Capabilities
// or Capabilities that are unavailable in the current environment are ignored,
// printing a warning instead.
func New(capConfig *configs.Capabilities) (*Caps, error) {
var (
err error
c Caps
)
unknownCaps := make(map[string]struct{})
c.caps = map[capability.CapType][]capability.Cap{
capability.BOUNDING: capSlice(capConfig.Bounding, unknownCaps),
capability.EFFECTIVE: capSlice(capConfig.Effective, unknownCaps),
capability.INHERITABLE: capSlice(capConfig.Inheritable, unknownCaps),
capability.PERMITTED: capSlice(capConfig.Permitted, unknownCaps),
capability.AMBIENT: capSlice(capConfig.Ambient, unknownCaps),
}
if c.pid, err = capability.NewPid2(0); err != nil {
return nil, err
}
if err = c.pid.Load(); err != nil {
return nil, err
}
if len(unknownCaps) > 0 {
logrus.Warn("ignoring unknown or unavailable capabilities: ", mapKeys(unknownCaps))
}
return &c, nil
}
// capSlice converts the slice of capability names in caps, to their numeric
// equivalent, and returns them as a slice. Unknown or unavailable capabilities
// are not returned, but appended to unknownCaps.
func capSlice(caps []string, unknownCaps map[string]struct{}) []capability.Cap {
var out []capability.Cap
for _, c := range caps {
if v, ok := capabilityMap[c]; !ok {
unknownCaps[c] = struct{}{}
} else {
out = append(out, v)
}
}
return out
}
// mapKeys returns the keys of input in sorted order
func mapKeys(input map[string]struct{}) []string {
var keys []string
for c := range input {
keys = append(keys, c)
}
sort.Strings(keys)
return keys
}
// Caps holds the capabilities for a container.
type Caps struct {
pid capability.Capabilities
caps map[capability.CapType][]capability.Cap
}
// ApplyBoundingSet sets the capability bounding set to those specified in the whitelist.
func (c *Caps) ApplyBoundingSet() error {
c.pid.Clear(capability.BOUNDING)
c.pid.Set(capability.BOUNDING, c.caps[capability.BOUNDING]...)
return c.pid.Apply(capability.BOUNDING)
}
// Apply sets all the capabilities for the current process in the config.
func (c *Caps) ApplyCaps() error {
c.pid.Clear(allCapabilityTypes)
for _, g := range capTypes {
c.pid.Set(g, c.caps[g]...)
}
return c.pid.Apply(allCapabilityTypes)
}

View File

@@ -1,4 +0,0 @@
//go:build !linux
// +build !linux
package capabilities

View File

@@ -1,86 +0,0 @@
package validate
import (
"errors"
"fmt"
"strings"
"github.com/opencontainers/runc/libcontainer/configs"
)
// rootlessEUID makes sure that the config can be applied when runc
// is being executed as a non-root user (euid != 0) in the current user namespace.
func (v *ConfigValidator) rootlessEUID(config *configs.Config) error {
if !config.RootlessEUID {
return nil
}
if err := rootlessEUIDMappings(config); err != nil {
return err
}
if err := rootlessEUIDMount(config); err != nil {
return err
}
// XXX: We currently can't verify the user config at all, because
// configs.Config doesn't store the user-related configs. So this
// has to be verified by setupUser() in init_linux.go.
return nil
}
func rootlessEUIDMappings(config *configs.Config) error {
if !config.Namespaces.Contains(configs.NEWUSER) {
return errors.New("rootless container requires user namespaces")
}
// We only require mappings if we are not joining another userns.
if path := config.Namespaces.PathOf(configs.NEWUSER); path == "" {
if len(config.UidMappings) == 0 {
return errors.New("rootless containers requires at least one UID mapping")
}
if len(config.GidMappings) == 0 {
return errors.New("rootless containers requires at least one GID mapping")
}
}
return nil
}
// mount verifies that the user isn't trying to set up any mounts they don't have
// the rights to do. In addition, it makes sure that no mount has a `uid=` or
// `gid=` option that doesn't resolve to root.
func rootlessEUIDMount(config *configs.Config) error {
// XXX: We could whitelist allowed devices at this point, but I'm not
// convinced that's a good idea. The kernel is the best arbiter of
// access control.
for _, mount := range config.Mounts {
// Check that the options list doesn't contain any uid= or gid= entries
// that don't resolve to root.
for _, opt := range strings.Split(mount.Data, ",") {
if strings.HasPrefix(opt, "uid=") {
var uid int
n, err := fmt.Sscanf(opt, "uid=%d", &uid)
if n != 1 || err != nil {
// Ignore unknown mount options.
continue
}
if _, err := config.HostUID(uid); err != nil {
return fmt.Errorf("cannot specify uid=%d mount option for rootless container: %w", uid, err)
}
}
if strings.HasPrefix(opt, "gid=") {
var gid int
n, err := fmt.Sscanf(opt, "gid=%d", &gid)
if n != 1 || err != nil {
// Ignore unknown mount options.
continue
}
if _, err := config.HostGID(gid); err != nil {
return fmt.Errorf("cannot specify gid=%d mount option for rootless container: %w", gid, err)
}
}
}
}
return nil
}

View File

@@ -1,306 +0,0 @@
package validate
import (
"errors"
"fmt"
"os"
"path/filepath"
"strings"
"sync"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/intelrdt"
selinux "github.com/opencontainers/selinux/go-selinux"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
type Validator interface {
Validate(*configs.Config) error
}
func New() Validator {
return &ConfigValidator{}
}
type ConfigValidator struct{}
type check func(config *configs.Config) error
func (v *ConfigValidator) Validate(config *configs.Config) error {
checks := []check{
v.cgroups,
v.rootfs,
v.network,
v.hostname,
v.security,
v.usernamespace,
v.cgroupnamespace,
v.sysctl,
v.intelrdt,
v.rootlessEUID,
}
for _, c := range checks {
if err := c(config); err != nil {
return err
}
}
// Relaxed validation rules for backward compatibility
warns := []check{
v.mounts, // TODO (runc v1.x.x): make this an error instead of a warning
}
for _, c := range warns {
if err := c(config); err != nil {
logrus.WithError(err).Warn("invalid configuration")
}
}
return nil
}
// rootfs validates if the rootfs is an absolute path and is not a symlink
// to the container's root filesystem.
func (v *ConfigValidator) rootfs(config *configs.Config) error {
if _, err := os.Stat(config.Rootfs); err != nil {
return fmt.Errorf("invalid rootfs: %w", err)
}
cleaned, err := filepath.Abs(config.Rootfs)
if err != nil {
return fmt.Errorf("invalid rootfs: %w", err)
}
if cleaned, err = filepath.EvalSymlinks(cleaned); err != nil {
return fmt.Errorf("invalid rootfs: %w", err)
}
if filepath.Clean(config.Rootfs) != cleaned {
return errors.New("invalid rootfs: not an absolute path, or a symlink")
}
return nil
}
func (v *ConfigValidator) network(config *configs.Config) error {
if !config.Namespaces.Contains(configs.NEWNET) {
if len(config.Networks) > 0 || len(config.Routes) > 0 {
return errors.New("unable to apply network settings without a private NET namespace")
}
}
return nil
}
func (v *ConfigValidator) hostname(config *configs.Config) error {
if config.Hostname != "" && !config.Namespaces.Contains(configs.NEWUTS) {
return errors.New("unable to set hostname without a private UTS namespace")
}
return nil
}
func (v *ConfigValidator) security(config *configs.Config) error {
// restrict sys without mount namespace
if (len(config.MaskPaths) > 0 || len(config.ReadonlyPaths) > 0) &&
!config.Namespaces.Contains(configs.NEWNS) {
return errors.New("unable to restrict sys entries without a private MNT namespace")
}
if config.ProcessLabel != "" && !selinux.GetEnabled() {
return errors.New("selinux label is specified in config, but selinux is disabled or not supported")
}
return nil
}
func (v *ConfigValidator) usernamespace(config *configs.Config) error {
if config.Namespaces.Contains(configs.NEWUSER) {
if _, err := os.Stat("/proc/self/ns/user"); os.IsNotExist(err) {
return errors.New("user namespaces aren't enabled in the kernel")
}
hasPath := config.Namespaces.PathOf(configs.NEWUSER) != ""
hasMappings := config.UidMappings != nil || config.GidMappings != nil
if !hasPath && !hasMappings {
return errors.New("user namespaces enabled, but no namespace path to join nor mappings to apply specified")
}
// The hasPath && hasMappings validation case is handled in specconv --
// we cache the mappings in Config during specconv in the hasPath case,
// so we cannot do that validation here.
} else {
if config.UidMappings != nil || config.GidMappings != nil {
return errors.New("user namespace mappings specified, but user namespace isn't enabled in the config")
}
}
return nil
}
func (v *ConfigValidator) cgroupnamespace(config *configs.Config) error {
if config.Namespaces.Contains(configs.NEWCGROUP) {
if _, err := os.Stat("/proc/self/ns/cgroup"); os.IsNotExist(err) {
return errors.New("cgroup namespaces aren't enabled in the kernel")
}
}
return nil
}
// convertSysctlVariableToDotsSeparator can return sysctl variables in dots separator format.
// The '/' separator is also accepted in place of a '.'.
// Convert the sysctl variables to dots separator format for validation.
// More info: sysctl(8), sysctl.d(5).
//
// For example:
// Input sysctl variable "net/ipv4/conf/eno2.100.rp_filter"
// will return the converted value "net.ipv4.conf.eno2/100.rp_filter"
func convertSysctlVariableToDotsSeparator(val string) string {
if val == "" {
return val
}
firstSepIndex := strings.IndexAny(val, "./")
if firstSepIndex == -1 || val[firstSepIndex] == '.' {
return val
}
f := func(r rune) rune {
switch r {
case '.':
return '/'
case '/':
return '.'
}
return r
}
return strings.Map(f, val)
}
// sysctl validates that the specified sysctl keys are valid or not.
// /proc/sys isn't completely namespaced and depending on which namespaces
// are specified, a subset of sysctls are permitted.
func (v *ConfigValidator) sysctl(config *configs.Config) error {
validSysctlMap := map[string]bool{
"kernel.msgmax": true,
"kernel.msgmnb": true,
"kernel.msgmni": true,
"kernel.sem": true,
"kernel.shmall": true,
"kernel.shmmax": true,
"kernel.shmmni": true,
"kernel.shm_rmid_forced": true,
}
var (
netOnce sync.Once
hostnet bool
hostnetErr error
)
for s := range config.Sysctl {
s := convertSysctlVariableToDotsSeparator(s)
if validSysctlMap[s] || strings.HasPrefix(s, "fs.mqueue.") {
if config.Namespaces.Contains(configs.NEWIPC) {
continue
} else {
return fmt.Errorf("sysctl %q is not allowed in the hosts ipc namespace", s)
}
}
if strings.HasPrefix(s, "net.") {
// Is container using host netns?
// Here "host" means "current", not "initial".
netOnce.Do(func() {
if !config.Namespaces.Contains(configs.NEWNET) {
hostnet = true
return
}
path := config.Namespaces.PathOf(configs.NEWNET)
if path == "" {
// own netns, so hostnet = false
return
}
hostnet, hostnetErr = isHostNetNS(path)
})
if hostnetErr != nil {
return fmt.Errorf("invalid netns path: %w", hostnetErr)
}
if hostnet {
return fmt.Errorf("sysctl %q not allowed in host network namespace", s)
}
continue
}
if config.Namespaces.Contains(configs.NEWUTS) {
switch s {
case "kernel.domainname":
// This is namespaced and there's no explicit OCI field for it.
continue
case "kernel.hostname":
// This is namespaced but there's a conflicting (dedicated) OCI field for it.
return fmt.Errorf("sysctl %q is not allowed as it conflicts with the OCI %q field", s, "hostname")
}
}
return fmt.Errorf("sysctl %q is not in a separate kernel namespace", s)
}
return nil
}
func (v *ConfigValidator) intelrdt(config *configs.Config) error {
if config.IntelRdt != nil {
if config.IntelRdt.ClosID == "." || config.IntelRdt.ClosID == ".." || strings.Contains(config.IntelRdt.ClosID, "/") {
return fmt.Errorf("invalid intelRdt.ClosID %q", config.IntelRdt.ClosID)
}
if !intelrdt.IsCATEnabled() && config.IntelRdt.L3CacheSchema != "" {
return errors.New("intelRdt.l3CacheSchema is specified in config, but Intel RDT/CAT is not enabled")
}
if !intelrdt.IsMBAEnabled() && config.IntelRdt.MemBwSchema != "" {
return errors.New("intelRdt.memBwSchema is specified in config, but Intel RDT/MBA is not enabled")
}
}
return nil
}
func (v *ConfigValidator) cgroups(config *configs.Config) error {
c := config.Cgroups
if c == nil {
return nil
}
if (c.Name != "" || c.Parent != "") && c.Path != "" {
return fmt.Errorf("cgroup: either Path or Name and Parent should be used, got %+v", c)
}
r := c.Resources
if r == nil {
return nil
}
if !cgroups.IsCgroup2UnifiedMode() && r.Unified != nil {
return cgroups.ErrV1NoUnified
}
if cgroups.IsCgroup2UnifiedMode() {
_, err := cgroups.ConvertMemorySwapToCgroupV2Value(r.MemorySwap, r.Memory)
if err != nil {
return err
}
}
return nil
}
func (v *ConfigValidator) mounts(config *configs.Config) error {
for _, m := range config.Mounts {
if !filepath.IsAbs(m.Destination) {
return fmt.Errorf("invalid mount %+v: mount destination not absolute", m)
}
}
return nil
}
func isHostNetNS(path string) (bool, error) {
const currentProcessNetns = "/proc/self/ns/net"
var st1, st2 unix.Stat_t
if err := unix.Stat(currentProcessNetns, &st1); err != nil {
return false, &os.PathError{Op: "stat", Path: currentProcessNetns, Err: err}
}
if err := unix.Stat(path, &st2); err != nil {
return false, &os.PathError{Op: "stat", Path: path, Err: err}
}
return (st1.Dev == st2.Dev) && (st1.Ino == st2.Ino), nil
}

View File

@@ -1,41 +0,0 @@
package libcontainer
import (
"os"
"golang.org/x/sys/unix"
)
// mount initializes the console inside the rootfs mounting with the specified mount label
// and applying the correct ownership of the console.
func mountConsole(slavePath string) error {
oldMask := unix.Umask(0o000)
defer unix.Umask(oldMask)
f, err := os.Create("/dev/console")
if err != nil && !os.IsExist(err) {
return err
}
if f != nil {
f.Close()
}
return mount(slavePath, "/dev/console", "", "bind", unix.MS_BIND, "")
}
// dupStdio opens the slavePath for the console and dups the fds to the current
// processes stdio, fd 0,1,2.
func dupStdio(slavePath string) error {
fd, err := unix.Open(slavePath, unix.O_RDWR, 0)
if err != nil {
return &os.PathError{
Op: "open",
Path: slavePath,
Err: err,
}
}
for _, i := range []int{0, 1, 2} {
if err := unix.Dup3(fd, i, 0); err != nil {
return err
}
}
return nil
}

View File

@@ -1,130 +0,0 @@
// Package libcontainer provides a native Go implementation for creating containers
// with namespaces, cgroups, capabilities, and filesystem access controls.
// It allows you to manage the lifecycle of the container performing additional operations
// after the container is created.
package libcontainer
import (
"os"
"time"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runtime-spec/specs-go"
)
// Status is the status of a container.
type Status int
const (
// Created is the status that denotes the container exists but has not been run yet.
Created Status = iota
// Running is the status that denotes the container exists and is running.
Running
// Pausing is the status that denotes the container exists, it is in the process of being paused.
Pausing
// Paused is the status that denotes the container exists, but all its processes are paused.
Paused
// Stopped is the status that denotes the container does not have a created or running process.
Stopped
)
func (s Status) String() string {
switch s {
case Created:
return "created"
case Running:
return "running"
case Pausing:
return "pausing"
case Paused:
return "paused"
case Stopped:
return "stopped"
default:
return "unknown"
}
}
// BaseState represents the platform agnostic pieces relating to a
// running container's state
type BaseState struct {
// ID is the container ID.
ID string `json:"id"`
// InitProcessPid is the init process id in the parent namespace.
InitProcessPid int `json:"init_process_pid"`
// InitProcessStartTime is the init process start time in clock cycles since boot time.
InitProcessStartTime uint64 `json:"init_process_start"`
// Created is the unix timestamp for the creation time of the container in UTC
Created time.Time `json:"created"`
// Config is the container's configuration.
Config configs.Config `json:"config"`
}
// BaseContainer is a libcontainer container object.
//
// Each container is thread-safe within the same process. Since a container can
// be destroyed by a separate process, any function may return that the container
// was not found. BaseContainer includes methods that are platform agnostic.
type BaseContainer interface {
// Returns the ID of the container
ID() string
// Returns the current status of the container.
Status() (Status, error)
// State returns the current container's state information.
State() (*State, error)
// OCIState returns the current container's state information.
OCIState() (*specs.State, error)
// Returns the current config of the container.
Config() configs.Config
// Returns the PIDs inside this container. The PIDs are in the namespace of the calling process.
//
// Some of the returned PIDs may no longer refer to processes in the Container, unless
// the Container state is PAUSED in which case every PID in the slice is valid.
Processes() ([]int, error)
// Returns statistics for the container.
Stats() (*Stats, error)
// Set resources of container as configured
//
// We can use this to change resources when containers are running.
//
Set(config configs.Config) error
// Start a process inside the container. Returns error if process fails to
// start. You can track process lifecycle with passed Process structure.
Start(process *Process) (err error)
// Run immediately starts the process inside the container. Returns error if process
// fails to start. It does not block waiting for the exec fifo after start returns but
// opens the fifo after start returns.
Run(process *Process) (err error)
// Destroys the container, if its in a valid state, after killing any
// remaining running processes.
//
// Any event registrations are removed before the container is destroyed.
// No error is returned if the container is already destroyed.
//
// Running containers must first be stopped using Signal(..).
// Paused containers must first be resumed using Resume(..).
Destroy() error
// Signal sends the provided signal code to the container's initial process.
//
// If all is specified the signal is sent to all processes in the container
// including the initial process.
Signal(s os.Signal, all bool) error
// Exec signals the container to exec the users process at the end of the init.
Exec() error
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,34 +0,0 @@
package libcontainer
import criu "github.com/checkpoint-restore/go-criu/v5/rpc"
type CriuPageServerInfo struct {
Address string // IP address of CRIU page server
Port int32 // port number of CRIU page server
}
type VethPairName struct {
ContainerInterfaceName string
HostInterfaceName string
}
type CriuOpts struct {
ImagesDirectory string // directory for storing image files
WorkDirectory string // directory to cd and write logs/pidfiles/stats to
ParentImage string // directory for storing parent image files in pre-dump and dump
LeaveRunning bool // leave container in running state after checkpoint
TcpEstablished bool // checkpoint/restore established TCP connections
ExternalUnixConnections bool // allow external unix connections
ShellJob bool // allow to dump and restore shell jobs
FileLocks bool // handle file locks, for safety
PreDump bool // call criu predump to perform iterative checkpoint
PageServer CriuPageServerInfo // allow to dump to criu page server
VethPairs []VethPairName // pass the veth to criu when restore
ManageCgroupsMode criu.CriuCgMode // dump or restore cgroup mode
EmptyNs uint32 // don't c/r properties for namespace from this mask
AutoDedup bool // auto deduplication for incremental dumps
LazyPages bool // restore memory pages lazily using userfaultfd
StatusFd int // fd for feedback when lazy server is ready
LsmProfile string // LSM profile used to restore the container
LsmMountContext string // LSM mount context value to use during restore
}

View File

@@ -1,17 +0,0 @@
//go:build !go1.20
// +build !go1.20
package libcontainer
import "golang.org/x/sys/unix"
func eaccess(path string) error {
// This check is similar to access(2) with X_OK except for
// setuid/setgid binaries where it checks against the effective
// (rather than real) uid and gid. It is not needed in go 1.20
// and beyond and will be removed later.
// Relies on code added in https://go-review.googlesource.com/c/sys/+/468877
// and older CLs linked from there.
return unix.Faccessat(unix.AT_FDCWD, path, unix.X_OK, unix.AT_EACCESS)
}

View File

@@ -1,10 +0,0 @@
//go:build go1.20
package libcontainer
func eaccess(path string) error {
// Not needed in Go 1.20+ as the functionality is already in there
// (added by https://go.dev/cl/416115, https://go.dev/cl/414824,
// and fixed in Go 1.20.2 by https://go.dev/cl/469956).
return nil
}

View File

@@ -1,13 +0,0 @@
package libcontainer
import "errors"
var (
ErrExist = errors.New("container with given ID already exists")
ErrInvalidID = errors.New("invalid container ID format")
ErrNotExist = errors.New("container does not exist")
ErrPaused = errors.New("container paused")
ErrRunning = errors.New("container still running")
ErrNotRunning = errors.New("container not running")
ErrNotPaused = errors.New("container not paused")
)

View File

@@ -1,30 +0,0 @@
package libcontainer
import (
"github.com/opencontainers/runc/libcontainer/configs"
)
type Factory interface {
// Creates a new container with the given id and starts the initial process inside it.
// id must be a string containing only letters, digits and underscores and must contain
// between 1 and 1024 characters, inclusive.
//
// The id must not already be in use by an existing container. Containers created using
// a factory with the same path (and filesystem) must have distinct ids.
//
// Returns the new container with a running process.
//
// On error, any partially created container parts are cleaned up (the operation is atomic).
Create(id string, config *configs.Config) (Container, error)
// Load takes an ID for an existing container and returns the container information
// from the state. This presents a read only view of the container.
Load(id string) (Container, error)
// StartInitialization is an internal API to libcontainer used during the reexec of the
// container.
StartInitialization() error
// Type returns info string about factory type (e.g. lxc, libcontainer...)
Type() string
}

View File

@@ -1,402 +0,0 @@
package libcontainer
import (
"encoding/json"
"errors"
"fmt"
"os"
"path/filepath"
"regexp"
"runtime/debug"
"strconv"
securejoin "github.com/cyphar/filepath-securejoin"
"github.com/moby/sys/mountinfo"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/cgroups/manager"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/configs/validate"
"github.com/opencontainers/runc/libcontainer/intelrdt"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/sirupsen/logrus"
)
const (
stateFilename = "state.json"
execFifoFilename = "exec.fifo"
)
var idRegex = regexp.MustCompile(`^[\w+-\.]+$`)
// InitArgs returns an options func to configure a LinuxFactory with the
// provided init binary path and arguments.
func InitArgs(args ...string) func(*LinuxFactory) error {
return func(l *LinuxFactory) (err error) {
if len(args) > 0 {
// Resolve relative paths to ensure that its available
// after directory changes.
if args[0], err = filepath.Abs(args[0]); err != nil {
// The only error returned from filepath.Abs is
// the one from os.Getwd, i.e. a system error.
return err
}
}
l.InitArgs = args
return nil
}
}
// TmpfsRoot is an option func to mount LinuxFactory.Root to tmpfs.
func TmpfsRoot(l *LinuxFactory) error {
mounted, err := mountinfo.Mounted(l.Root)
if err != nil {
return err
}
if !mounted {
if err := mount("tmpfs", l.Root, "", "tmpfs", 0, ""); err != nil {
return err
}
}
return nil
}
// CriuPath returns an option func to configure a LinuxFactory with the
// provided criupath
func CriuPath(criupath string) func(*LinuxFactory) error {
return func(l *LinuxFactory) error {
l.CriuPath = criupath
return nil
}
}
// New returns a linux based container factory based in the root directory and
// configures the factory with the provided option funcs.
func New(root string, options ...func(*LinuxFactory) error) (Factory, error) {
if root != "" {
if err := os.MkdirAll(root, 0o700); err != nil {
return nil, err
}
}
l := &LinuxFactory{
Root: root,
InitPath: "/proc/self/exe",
InitArgs: []string{os.Args[0], "init"},
Validator: validate.New(),
CriuPath: "criu",
}
for _, opt := range options {
if opt == nil {
continue
}
if err := opt(l); err != nil {
return nil, err
}
}
return l, nil
}
// LinuxFactory implements the default factory interface for linux based systems.
type LinuxFactory struct {
// Root directory for the factory to store state.
Root string
// InitPath is the path for calling the init responsibilities for spawning
// a container.
InitPath string
// InitArgs are arguments for calling the init responsibilities for spawning
// a container.
InitArgs []string
// CriuPath is the path to the criu binary used for checkpoint and restore of
// containers.
CriuPath string
// New{u,g}idmapPath is the path to the binaries used for mapping with
// rootless containers.
NewuidmapPath string
NewgidmapPath string
// Validator provides validation to container configurations.
Validator validate.Validator
}
func (l *LinuxFactory) Create(id string, config *configs.Config) (Container, error) {
if l.Root == "" {
return nil, errors.New("root not set")
}
if err := l.validateID(id); err != nil {
return nil, err
}
if err := l.Validator.Validate(config); err != nil {
return nil, err
}
containerRoot, err := securejoin.SecureJoin(l.Root, id)
if err != nil {
return nil, err
}
if _, err := os.Stat(containerRoot); err == nil {
return nil, ErrExist
} else if !os.IsNotExist(err) {
return nil, err
}
cm, err := manager.New(config.Cgroups)
if err != nil {
return nil, err
}
// Check that cgroup does not exist or empty (no processes).
// Note for cgroup v1 this check is not thorough, as there are multiple
// separate hierarchies, while both Exists() and GetAllPids() only use
// one for "devices" controller (assuming others are the same, which is
// probably true in almost all scenarios). Checking all the hierarchies
// would be too expensive.
if cm.Exists() {
pids, err := cm.GetAllPids()
// Reading PIDs can race with cgroups removal, so ignore ENOENT and ENODEV.
if err != nil && !errors.Is(err, os.ErrNotExist) && !errors.Is(err, unix.ENODEV) {
return nil, fmt.Errorf("unable to get cgroup PIDs: %w", err)
}
if len(pids) != 0 {
if config.Cgroups.Systemd {
// systemd cgroup driver can't add a pid to an
// existing systemd unit and will return an
// error anyway, so let's error out early.
return nil, fmt.Errorf("container's cgroup is not empty: %d process(es) found", len(pids))
}
// TODO: return an error.
logrus.Warnf("container's cgroup is not empty: %d process(es) found", len(pids))
logrus.Warn("DEPRECATED: running container in a non-empty cgroup won't be supported in runc 1.2; https://github.com/opencontainers/runc/issues/3132")
}
}
// Check that cgroup is not frozen. Do not use Exists() here
// since in cgroup v1 it only checks "devices" controller.
st, err := cm.GetFreezerState()
if err != nil {
return nil, fmt.Errorf("unable to get cgroup freezer state: %w", err)
}
if st == configs.Frozen {
return nil, errors.New("container's cgroup unexpectedly frozen")
}
if err := os.MkdirAll(containerRoot, 0o711); err != nil {
return nil, err
}
if err := os.Chown(containerRoot, unix.Geteuid(), unix.Getegid()); err != nil {
return nil, err
}
c := &linuxContainer{
id: id,
root: containerRoot,
config: config,
initPath: l.InitPath,
initArgs: l.InitArgs,
criuPath: l.CriuPath,
newuidmapPath: l.NewuidmapPath,
newgidmapPath: l.NewgidmapPath,
cgroupManager: cm,
intelRdtManager: intelrdt.NewManager(config, id, ""),
}
c.state = &stoppedState{c: c}
return c, nil
}
func (l *LinuxFactory) Load(id string) (Container, error) {
if l.Root == "" {
return nil, errors.New("root not set")
}
// when load, we need to check id is valid or not.
if err := l.validateID(id); err != nil {
return nil, err
}
containerRoot, err := securejoin.SecureJoin(l.Root, id)
if err != nil {
return nil, err
}
state, err := l.loadState(containerRoot)
if err != nil {
return nil, err
}
r := &nonChildProcess{
processPid: state.InitProcessPid,
processStartTime: state.InitProcessStartTime,
fds: state.ExternalDescriptors,
}
cm, err := manager.NewWithPaths(state.Config.Cgroups, state.CgroupPaths)
if err != nil {
return nil, err
}
c := &linuxContainer{
initProcess: r,
initProcessStartTime: state.InitProcessStartTime,
id: id,
config: &state.Config,
initPath: l.InitPath,
initArgs: l.InitArgs,
criuPath: l.CriuPath,
newuidmapPath: l.NewuidmapPath,
newgidmapPath: l.NewgidmapPath,
cgroupManager: cm,
intelRdtManager: intelrdt.NewManager(&state.Config, id, state.IntelRdtPath),
root: containerRoot,
created: state.Created,
}
c.state = &loadedState{c: c}
if err := c.refreshState(); err != nil {
return nil, err
}
return c, nil
}
func (l *LinuxFactory) Type() string {
return "libcontainer"
}
// StartInitialization loads a container by opening the pipe fd from the parent to read the configuration and state
// This is a low level implementation detail of the reexec and should not be consumed externally
func (l *LinuxFactory) StartInitialization() (err error) {
// Get the INITPIPE.
envInitPipe := os.Getenv("_LIBCONTAINER_INITPIPE")
pipefd, err := strconv.Atoi(envInitPipe)
if err != nil {
err = fmt.Errorf("unable to convert _LIBCONTAINER_INITPIPE: %w", err)
logrus.Error(err)
return err
}
pipe := os.NewFile(uintptr(pipefd), "pipe")
defer pipe.Close()
defer func() {
// We have an error during the initialization of the container's init,
// send it back to the parent process in the form of an initError.
if werr := writeSync(pipe, procError); werr != nil {
fmt.Fprintln(os.Stderr, err)
return
}
if werr := utils.WriteJSON(pipe, &initError{Message: err.Error()}); werr != nil {
fmt.Fprintln(os.Stderr, err)
return
}
}()
// Only init processes have FIFOFD.
fifofd := -1
envInitType := os.Getenv("_LIBCONTAINER_INITTYPE")
it := initType(envInitType)
if it == initStandard {
envFifoFd := os.Getenv("_LIBCONTAINER_FIFOFD")
if fifofd, err = strconv.Atoi(envFifoFd); err != nil {
return fmt.Errorf("unable to convert _LIBCONTAINER_FIFOFD: %w", err)
}
}
var consoleSocket *os.File
if envConsole := os.Getenv("_LIBCONTAINER_CONSOLE"); envConsole != "" {
console, err := strconv.Atoi(envConsole)
if err != nil {
return fmt.Errorf("unable to convert _LIBCONTAINER_CONSOLE: %w", err)
}
consoleSocket = os.NewFile(uintptr(console), "console-socket")
defer consoleSocket.Close()
}
logPipeFdStr := os.Getenv("_LIBCONTAINER_LOGPIPE")
logPipeFd, err := strconv.Atoi(logPipeFdStr)
if err != nil {
return fmt.Errorf("unable to convert _LIBCONTAINER_LOGPIPE: %w", err)
}
// Get mount files (O_PATH).
mountFds, err := parseMountFds()
if err != nil {
return err
}
// clear the current process's environment to clean any libcontainer
// specific env vars.
os.Clearenv()
defer func() {
if e := recover(); e != nil {
if ee, ok := e.(error); ok {
err = fmt.Errorf("panic from initialization: %w, %s", ee, debug.Stack())
} else {
err = fmt.Errorf("panic from initialization: %v, %s", e, debug.Stack())
}
}
}()
i, err := newContainerInit(it, pipe, consoleSocket, fifofd, logPipeFd, mountFds)
if err != nil {
return err
}
// If Init succeeds, syscall.Exec will not return, hence none of the defers will be called.
return i.Init()
}
func (l *LinuxFactory) loadState(root string) (*State, error) {
stateFilePath, err := securejoin.SecureJoin(root, stateFilename)
if err != nil {
return nil, err
}
f, err := os.Open(stateFilePath)
if err != nil {
if os.IsNotExist(err) {
return nil, ErrNotExist
}
return nil, err
}
defer f.Close()
var state *State
if err := json.NewDecoder(f).Decode(&state); err != nil {
return nil, err
}
return state, nil
}
func (l *LinuxFactory) validateID(id string) error {
if !idRegex.MatchString(id) || string(os.PathSeparator)+id != utils.CleanPath(string(os.PathSeparator)+id) {
return ErrInvalidID
}
return nil
}
// NewuidmapPath returns an option func to configure a LinuxFactory with the
// provided ..
func NewuidmapPath(newuidmapPath string) func(*LinuxFactory) error {
return func(l *LinuxFactory) error {
l.NewuidmapPath = newuidmapPath
return nil
}
}
// NewgidmapPath returns an option func to configure a LinuxFactory with the
// provided ..
func NewgidmapPath(newgidmapPath string) func(*LinuxFactory) error {
return func(l *LinuxFactory) error {
l.NewgidmapPath = newgidmapPath
return nil
}
}
func parseMountFds() ([]int, error) {
fdsJson := os.Getenv("_LIBCONTAINER_MOUNT_FDS")
if fdsJson == "" {
// Always return the nil slice if no fd is present.
return nil, nil
}
var mountFds []int
if err := json.Unmarshal([]byte(fdsJson), &mountFds); err != nil {
return nil, fmt.Errorf("Error unmarshalling _LIBCONTAINER_MOUNT_FDS: %w", err)
}
return mountFds, nil
}

View File

@@ -1,641 +0,0 @@
package libcontainer
import (
"bytes"
"encoding/json"
"errors"
"fmt"
"io"
"net"
"os"
"path/filepath"
"strings"
"syscall"
"unsafe"
"github.com/containerd/console"
"github.com/opencontainers/runtime-spec/specs-go"
"github.com/sirupsen/logrus"
"github.com/vishvananda/netlink"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/capabilities"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/system"
"github.com/opencontainers/runc/libcontainer/user"
"github.com/opencontainers/runc/libcontainer/utils"
)
type initType string
const (
initSetns initType = "setns"
initStandard initType = "standard"
)
type pid struct {
Pid int `json:"stage2_pid"`
PidFirstChild int `json:"stage1_pid"`
}
// network is an internal struct used to setup container networks.
type network struct {
configs.Network
// TempVethPeerName is a unique temporary veth peer name that was placed into
// the container's namespace.
TempVethPeerName string `json:"temp_veth_peer_name"`
}
// initConfig is used for transferring parameters from Exec() to Init()
type initConfig struct {
Args []string `json:"args"`
Env []string `json:"env"`
Cwd string `json:"cwd"`
Capabilities *configs.Capabilities `json:"capabilities"`
ProcessLabel string `json:"process_label"`
AppArmorProfile string `json:"apparmor_profile"`
NoNewPrivileges bool `json:"no_new_privileges"`
User string `json:"user"`
AdditionalGroups []string `json:"additional_groups"`
Config *configs.Config `json:"config"`
Networks []*network `json:"network"`
PassedFilesCount int `json:"passed_files_count"`
ContainerId string `json:"containerid"`
Rlimits []configs.Rlimit `json:"rlimits"`
CreateConsole bool `json:"create_console"`
ConsoleWidth uint16 `json:"console_width"`
ConsoleHeight uint16 `json:"console_height"`
RootlessEUID bool `json:"rootless_euid,omitempty"`
RootlessCgroups bool `json:"rootless_cgroups,omitempty"`
SpecState *specs.State `json:"spec_state,omitempty"`
Cgroup2Path string `json:"cgroup2_path,omitempty"`
}
type initer interface {
Init() error
}
func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd, logFd int, mountFds []int) (initer, error) {
var config *initConfig
if err := json.NewDecoder(pipe).Decode(&config); err != nil {
return nil, err
}
if err := populateProcessEnvironment(config.Env); err != nil {
return nil, err
}
// Clean the RLIMIT_NOFILE cache in go runtime.
// Issue: https://github.com/opencontainers/runc/issues/4195
maybeClearRlimitNofileCache(config.Rlimits)
switch t {
case initSetns:
// mountFds must be nil in this case. We don't mount while doing runc exec.
if mountFds != nil {
return nil, errors.New("mountFds must be nil. Can't mount while doing runc exec.")
}
return &linuxSetnsInit{
pipe: pipe,
consoleSocket: consoleSocket,
config: config,
logFd: logFd,
}, nil
case initStandard:
return &linuxStandardInit{
pipe: pipe,
consoleSocket: consoleSocket,
parentPid: unix.Getppid(),
config: config,
fifoFd: fifoFd,
logFd: logFd,
mountFds: mountFds,
}, nil
}
return nil, fmt.Errorf("unknown init type %q", t)
}
// populateProcessEnvironment loads the provided environment variables into the
// current processes's environment.
func populateProcessEnvironment(env []string) error {
for _, pair := range env {
p := strings.SplitN(pair, "=", 2)
if len(p) < 2 {
return errors.New("invalid environment variable: missing '='")
}
name, val := p[0], p[1]
if name == "" {
return errors.New("invalid environment variable: name cannot be empty")
}
if strings.IndexByte(name, 0) >= 0 {
return fmt.Errorf("invalid environment variable %q: name contains nul byte (\\x00)", name)
}
if strings.IndexByte(val, 0) >= 0 {
return fmt.Errorf("invalid environment variable %q: value contains nul byte (\\x00)", name)
}
if err := os.Setenv(name, val); err != nil {
return err
}
}
return nil
}
// verifyCwd ensures that the current directory is actually inside the mount
// namespace root of the current process.
func verifyCwd() error {
// getcwd(2) on Linux detects if cwd is outside of the rootfs of the
// current mount namespace root, and in that case prefixes "(unreachable)"
// to the returned string. glibc's getcwd(3) and Go's Getwd() both detect
// when this happens and return ENOENT rather than returning a non-absolute
// path. In both cases we can therefore easily detect if we have an invalid
// cwd by checking the return value of getcwd(3). See getcwd(3) for more
// details, and CVE-2024-21626 for the security issue that motivated this
// check.
//
// We have to use unix.Getwd() here because os.Getwd() has a workaround for
// $PWD which involves doing stat(.), which can fail if the current
// directory is inaccessible to the container process.
if wd, err := unix.Getwd(); errors.Is(err, unix.ENOENT) {
return errors.New("current working directory is outside of container mount namespace root -- possible container breakout detected")
} else if err != nil {
return fmt.Errorf("failed to verify if current working directory is safe: %w", err)
} else if !filepath.IsAbs(wd) {
// We shouldn't ever hit this, but check just in case.
return fmt.Errorf("current working directory is not absolute -- possible container breakout detected: cwd is %q", wd)
}
return nil
}
// finalizeNamespace drops the caps, sets the correct user
// and working dir, and closes any leaked file descriptors
// before executing the command inside the namespace
func finalizeNamespace(config *initConfig) error {
// Ensure that all unwanted fds we may have accidentally
// inherited are marked close-on-exec so they stay out of the
// container
if err := utils.CloseExecFrom(config.PassedFilesCount + 3); err != nil {
return fmt.Errorf("error closing exec fds: %w", err)
}
// we only do chdir if it's specified
doChdir := config.Cwd != ""
if doChdir {
// First, attempt the chdir before setting up the user.
// This could allow us to access a directory that the user running runc can access
// but the container user cannot.
err := unix.Chdir(config.Cwd)
switch {
case err == nil:
doChdir = false
case os.IsPermission(err):
// If we hit an EPERM, we should attempt again after setting up user.
// This will allow us to successfully chdir if the container user has access
// to the directory, but the user running runc does not.
// This is useful in cases where the cwd is also a volume that's been chowned to the container user.
default:
return fmt.Errorf("chdir to cwd (%q) set in config.json failed: %w", config.Cwd, err)
}
}
caps := &configs.Capabilities{}
if config.Capabilities != nil {
caps = config.Capabilities
} else if config.Config.Capabilities != nil {
caps = config.Config.Capabilities
}
w, err := capabilities.New(caps)
if err != nil {
return err
}
// drop capabilities in bounding set before changing user
if err := w.ApplyBoundingSet(); err != nil {
return fmt.Errorf("unable to apply bounding set: %w", err)
}
// preserve existing capabilities while we change users
if err := system.SetKeepCaps(); err != nil {
return fmt.Errorf("unable to set keep caps: %w", err)
}
if err := setupUser(config); err != nil {
return fmt.Errorf("unable to setup user: %w", err)
}
// Change working directory AFTER the user has been set up, if we haven't done it yet.
if doChdir {
if err := unix.Chdir(config.Cwd); err != nil {
return fmt.Errorf("chdir to cwd (%q) set in config.json failed: %w", config.Cwd, err)
}
}
// Make sure our final working directory is inside the container.
if err := verifyCwd(); err != nil {
return err
}
if err := system.ClearKeepCaps(); err != nil {
return fmt.Errorf("unable to clear keep caps: %w", err)
}
if err := w.ApplyCaps(); err != nil {
return fmt.Errorf("unable to apply caps: %w", err)
}
return nil
}
// setupConsole sets up the console from inside the container, and sends the
// master pty fd to the config.Pipe (using cmsg). This is done to ensure that
// consoles are scoped to a container properly (see runc#814 and the many
// issues related to that). This has to be run *after* we've pivoted to the new
// rootfs (and the users' configuration is entirely set up).
func setupConsole(socket *os.File, config *initConfig, mount bool) error {
defer socket.Close()
// At this point, /dev/ptmx points to something that we would expect. We
// used to change the owner of the slave path, but since the /dev/pts mount
// can have gid=X set (at the users' option). So touching the owner of the
// slave PTY is not necessary, as the kernel will handle that for us. Note
// however, that setupUser (specifically fixStdioPermissions) *will* change
// the UID owner of the console to be the user the process will run as (so
// they can actually control their console).
pty, slavePath, err := console.NewPty()
if err != nil {
return err
}
// After we return from here, we don't need the console anymore.
defer pty.Close()
if config.ConsoleHeight != 0 && config.ConsoleWidth != 0 {
err = pty.Resize(console.WinSize{
Height: config.ConsoleHeight,
Width: config.ConsoleWidth,
})
if err != nil {
return err
}
}
// Mount the console inside our rootfs.
if mount {
if err := mountConsole(slavePath); err != nil {
return err
}
}
// While we can access console.master, using the API is a good idea.
if err := utils.SendFd(socket, pty.Name(), pty.Fd()); err != nil {
return err
}
// Now, dup over all the things.
return dupStdio(slavePath)
}
// syncParentReady sends to the given pipe a JSON payload which indicates that
// the init is ready to Exec the child process. It then waits for the parent to
// indicate that it is cleared to Exec.
func syncParentReady(pipe io.ReadWriter) error {
// Tell parent.
if err := writeSync(pipe, procReady); err != nil {
return err
}
// Wait for parent to give the all-clear.
return readSync(pipe, procRun)
}
// syncParentHooks sends to the given pipe a JSON payload which indicates that
// the parent should execute pre-start hooks. It then waits for the parent to
// indicate that it is cleared to resume.
func syncParentHooks(pipe io.ReadWriter) error {
// Tell parent.
if err := writeSync(pipe, procHooks); err != nil {
return err
}
// Wait for parent to give the all-clear.
return readSync(pipe, procResume)
}
// syncParentSeccomp sends to the given pipe a JSON payload which
// indicates that the parent should pick up the seccomp fd with pidfd_getfd()
// and send it to the seccomp agent over a unix socket. It then waits for
// the parent to indicate that it is cleared to resume and closes the seccompFd.
// If the seccompFd is -1, there isn't anything to sync with the parent, so it
// returns no error.
func syncParentSeccomp(pipe io.ReadWriter, seccompFd int) error {
if seccompFd == -1 {
return nil
}
// Tell parent.
if err := writeSyncWithFd(pipe, procSeccomp, seccompFd); err != nil {
unix.Close(seccompFd)
return err
}
// Wait for parent to give the all-clear.
if err := readSync(pipe, procSeccompDone); err != nil {
unix.Close(seccompFd)
return fmt.Errorf("sync parent seccomp: %w", err)
}
if err := unix.Close(seccompFd); err != nil {
return fmt.Errorf("close seccomp fd: %w", err)
}
return nil
}
// setupUser changes the groups, gid, and uid for the user inside the container
func setupUser(config *initConfig) error {
// Set up defaults.
defaultExecUser := user.ExecUser{
Uid: 0,
Gid: 0,
Home: "/",
}
passwdPath, err := user.GetPasswdPath()
if err != nil {
return err
}
groupPath, err := user.GetGroupPath()
if err != nil {
return err
}
execUser, err := user.GetExecUserPath(config.User, &defaultExecUser, passwdPath, groupPath)
if err != nil {
return err
}
var addGroups []int
if len(config.AdditionalGroups) > 0 {
addGroups, err = user.GetAdditionalGroupsPath(config.AdditionalGroups, groupPath)
if err != nil {
return err
}
}
// Rather than just erroring out later in setuid(2) and setgid(2), check
// that the user is mapped here.
if _, err := config.Config.HostUID(execUser.Uid); err != nil {
return errors.New("cannot set uid to unmapped user in user namespace")
}
if _, err := config.Config.HostGID(execUser.Gid); err != nil {
return errors.New("cannot set gid to unmapped user in user namespace")
}
if config.RootlessEUID {
// We cannot set any additional groups in a rootless container and thus
// we bail if the user asked us to do so. TODO: We currently can't do
// this check earlier, but if libcontainer.Process.User was typesafe
// this might work.
if len(addGroups) > 0 {
return errors.New("cannot set any additional groups in a rootless container")
}
}
// Before we change to the container's user make sure that the processes
// STDIO is correctly owned by the user that we are switching to.
if err := fixStdioPermissions(execUser); err != nil {
return err
}
setgroups, err := os.ReadFile("/proc/self/setgroups")
if err != nil && !os.IsNotExist(err) {
return err
}
// This isn't allowed in an unprivileged user namespace since Linux 3.19.
// There's nothing we can do about /etc/group entries, so we silently
// ignore setting groups here (since the user didn't explicitly ask us to
// set the group).
allowSupGroups := !config.RootlessEUID && string(bytes.TrimSpace(setgroups)) != "deny"
if allowSupGroups {
suppGroups := append(execUser.Sgids, addGroups...)
if err := unix.Setgroups(suppGroups); err != nil {
return &os.SyscallError{Syscall: "setgroups", Err: err}
}
}
if err := system.Setgid(execUser.Gid); err != nil {
return err
}
if err := system.Setuid(execUser.Uid); err != nil {
return err
}
// if we didn't get HOME already, set it based on the user's HOME
if envHome := os.Getenv("HOME"); envHome == "" {
if err := os.Setenv("HOME", execUser.Home); err != nil {
return err
}
}
return nil
}
// fixStdioPermissions fixes the permissions of PID 1's STDIO within the container to the specified user.
// The ownership needs to match because it is created outside of the container and needs to be
// localized.
func fixStdioPermissions(u *user.ExecUser) error {
var null unix.Stat_t
if err := unix.Stat("/dev/null", &null); err != nil {
return &os.PathError{Op: "stat", Path: "/dev/null", Err: err}
}
for _, file := range []*os.File{os.Stdin, os.Stdout, os.Stderr} {
var s unix.Stat_t
if err := unix.Fstat(int(file.Fd()), &s); err != nil {
return &os.PathError{Op: "fstat", Path: file.Name(), Err: err}
}
// Skip chown if uid is already the one we want or any of the STDIO descriptors
// were redirected to /dev/null.
if int(s.Uid) == u.Uid || s.Rdev == null.Rdev {
continue
}
// We only change the uid (as it is possible for the mount to
// prefer a different gid, and there's no reason for us to change it).
// The reason why we don't just leave the default uid=X mount setup is
// that users expect to be able to actually use their console. Without
// this code, you couldn't effectively run as a non-root user inside a
// container and also have a console set up.
if err := file.Chown(u.Uid, int(s.Gid)); err != nil {
// If we've hit an EINVAL then s.Gid isn't mapped in the user
// namespace. If we've hit an EPERM then the inode's current owner
// is not mapped in our user namespace (in particular,
// privileged_wrt_inode_uidgid() has failed). Read-only
// /dev can result in EROFS error. In any case, it's
// better for us to just not touch the stdio rather
// than bail at this point.
if errors.Is(err, unix.EINVAL) || errors.Is(err, unix.EPERM) || errors.Is(err, unix.EROFS) {
continue
}
return err
}
}
return nil
}
// setupNetwork sets up and initializes any network interface inside the container.
func setupNetwork(config *initConfig) error {
for _, config := range config.Networks {
strategy, err := getStrategy(config.Type)
if err != nil {
return err
}
if err := strategy.initialize(config); err != nil {
return err
}
}
return nil
}
func setupRoute(config *configs.Config) error {
for _, config := range config.Routes {
_, dst, err := net.ParseCIDR(config.Destination)
if err != nil {
return err
}
src := net.ParseIP(config.Source)
if src == nil {
return fmt.Errorf("Invalid source for route: %s", config.Source)
}
gw := net.ParseIP(config.Gateway)
if gw == nil {
return fmt.Errorf("Invalid gateway for route: %s", config.Gateway)
}
l, err := netlink.LinkByName(config.InterfaceName)
if err != nil {
return err
}
route := &netlink.Route{
Scope: netlink.SCOPE_UNIVERSE,
Dst: dst,
Src: src,
Gw: gw,
LinkIndex: l.Attrs().Index,
}
if err := netlink.RouteAdd(route); err != nil {
return err
}
}
return nil
}
func maybeClearRlimitNofileCache(limits []configs.Rlimit) {
for _, rlimit := range limits {
if rlimit.Type == syscall.RLIMIT_NOFILE {
system.ClearRlimitNofileCache(&syscall.Rlimit{
Cur: rlimit.Soft,
Max: rlimit.Hard,
})
return
}
}
}
func setupRlimits(limits []configs.Rlimit, pid int) error {
for _, rlimit := range limits {
if err := unix.Prlimit(pid, rlimit.Type, &unix.Rlimit{Max: rlimit.Hard, Cur: rlimit.Soft}, nil); err != nil {
return fmt.Errorf("error setting rlimit type %v: %w", rlimit.Type, err)
}
}
return nil
}
const _P_PID = 1
//nolint:structcheck,unused
type siginfo struct {
si_signo int32
si_errno int32
si_code int32
// below here is a union; si_pid is the only field we use
si_pid int32
// Pad to 128 bytes as detailed in blockUntilWaitable
pad [96]byte
}
// isWaitable returns true if the process has exited false otherwise.
// Its based off blockUntilWaitable in src/os/wait_waitid.go
func isWaitable(pid int) (bool, error) {
si := &siginfo{}
_, _, e := unix.Syscall6(unix.SYS_WAITID, _P_PID, uintptr(pid), uintptr(unsafe.Pointer(si)), unix.WEXITED|unix.WNOWAIT|unix.WNOHANG, 0, 0)
if e != 0 {
return false, &os.SyscallError{Syscall: "waitid", Err: e}
}
return si.si_pid != 0, nil
}
// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
func signalAllProcesses(m cgroups.Manager, s os.Signal) error {
var procs []*os.Process
if err := m.Freeze(configs.Frozen); err != nil {
logrus.Warn(err)
}
pids, err := m.GetAllPids()
if err != nil {
if err := m.Freeze(configs.Thawed); err != nil {
logrus.Warn(err)
}
return err
}
for _, pid := range pids {
p, err := os.FindProcess(pid)
if err != nil {
logrus.Warn(err)
continue
}
procs = append(procs, p)
if err := p.Signal(s); err != nil {
logrus.Warn(err)
}
}
if err := m.Freeze(configs.Thawed); err != nil {
logrus.Warn(err)
}
subreaper, err := system.GetSubreaper()
if err != nil {
// The error here means that PR_GET_CHILD_SUBREAPER is not
// supported because this code might run on a kernel older
// than 3.4. We don't want to throw an error in that case,
// and we simplify things, considering there is no subreaper
// set.
subreaper = 0
}
for _, p := range procs {
if s != unix.SIGKILL {
if ok, err := isWaitable(p.Pid); err != nil {
if !errors.Is(err, unix.ECHILD) {
logrus.Warn("signalAllProcesses: ", p.Pid, err)
}
continue
} else if !ok {
// Not ready to report so don't wait
continue
}
}
// In case a subreaper has been setup, this code must not
// wait for the process. Otherwise, we cannot be sure the
// current process will be reaped by the subreaper, while
// the subreaper might be waiting for this process in order
// to retrieve its exit code.
if subreaper == 0 {
if _, err := p.Wait(); err != nil {
if !errors.Is(err, unix.ECHILD) {
logrus.Warn("wait: ", err)
}
}
}
}
return nil
}

View File

@@ -1,45 +0,0 @@
package keys
import (
"errors"
"fmt"
"strconv"
"strings"
"golang.org/x/sys/unix"
)
type KeySerial uint32
func JoinSessionKeyring(name string) (KeySerial, error) {
sessKeyID, err := unix.KeyctlJoinSessionKeyring(name)
if err != nil {
return 0, fmt.Errorf("unable to create session key: %w", err)
}
return KeySerial(sessKeyID), nil
}
// ModKeyringPerm modifies permissions on a keyring by reading the current permissions,
// anding the bits with the given mask (clearing permissions) and setting
// additional permission bits
func ModKeyringPerm(ringID KeySerial, mask, setbits uint32) error {
dest, err := unix.KeyctlString(unix.KEYCTL_DESCRIBE, int(ringID))
if err != nil {
return err
}
res := strings.Split(dest, ";")
if len(res) < 5 {
return errors.New("Destination buffer for key description is too small")
}
// parse permissions
perm64, err := strconv.ParseUint(res[3], 16, 32)
if err != nil {
return err
}
perm := (uint32(perm64) & mask) | setbits
return unix.KeyctlSetperm(int(ringID), perm)
}

View File

@@ -1,56 +0,0 @@
package logs
import (
"bufio"
"encoding/json"
"io"
"github.com/sirupsen/logrus"
)
func ForwardLogs(logPipe io.ReadCloser) chan error {
done := make(chan error, 1)
s := bufio.NewScanner(logPipe)
logger := logrus.StandardLogger()
if logger.ReportCaller {
// Need a copy of the standard logger, but with ReportCaller
// turned off, as the logs are merely forwarded and their
// true source is not this file/line/function.
logNoCaller := *logrus.StandardLogger()
logNoCaller.ReportCaller = false
logger = &logNoCaller
}
go func() {
for s.Scan() {
processEntry(s.Bytes(), logger)
}
if err := logPipe.Close(); err != nil {
logrus.Errorf("error closing log source: %v", err)
}
// The only error we want to return is when reading from
// logPipe has failed.
done <- s.Err()
close(done)
}()
return done
}
func processEntry(text []byte, logger *logrus.Logger) {
if len(text) == 0 {
return
}
var jl struct {
Level logrus.Level `json:"level"`
Msg string `json:"msg"`
}
if err := json.Unmarshal(text, &jl); err != nil {
logrus.Errorf("failed to decode %q to json: %v", text, err)
return
}
logger.Log(jl.Level, jl.Msg)
}

View File

@@ -1,97 +0,0 @@
package libcontainer
import (
"fmt"
"math"
"github.com/vishvananda/netlink/nl"
"golang.org/x/sys/unix"
)
// list of known message types we want to send to bootstrap program
// The number is randomly chosen to not conflict with known netlink types
const (
InitMsg uint16 = 62000
CloneFlagsAttr uint16 = 27281
NsPathsAttr uint16 = 27282
UidmapAttr uint16 = 27283
GidmapAttr uint16 = 27284
SetgroupAttr uint16 = 27285
OomScoreAdjAttr uint16 = 27286
RootlessEUIDAttr uint16 = 27287
UidmapPathAttr uint16 = 27288
GidmapPathAttr uint16 = 27289
MountSourcesAttr uint16 = 27290
)
type Int32msg struct {
Type uint16
Value uint32
}
// Serialize serializes the message.
// Int32msg has the following representation
// | nlattr len | nlattr type |
// | uint32 value |
func (msg *Int32msg) Serialize() []byte {
buf := make([]byte, msg.Len())
native := nl.NativeEndian()
native.PutUint16(buf[0:2], uint16(msg.Len()))
native.PutUint16(buf[2:4], msg.Type)
native.PutUint32(buf[4:8], msg.Value)
return buf
}
func (msg *Int32msg) Len() int {
return unix.NLA_HDRLEN + 4
}
// Bytemsg has the following representation
// | nlattr len | nlattr type |
// | value | pad |
type Bytemsg struct {
Type uint16
Value []byte
}
func (msg *Bytemsg) Serialize() []byte {
l := msg.Len()
if l > math.MaxUint16 {
// We cannot return nil nor an error here, so we panic with
// a specific type instead, which is handled via recover in
// bootstrapData.
panic(netlinkError{fmt.Errorf("netlink: cannot serialize bytemsg of length %d (larger than UINT16_MAX)", l)})
}
buf := make([]byte, (l+unix.NLA_ALIGNTO-1) & ^(unix.NLA_ALIGNTO-1))
native := nl.NativeEndian()
native.PutUint16(buf[0:2], uint16(l))
native.PutUint16(buf[2:4], msg.Type)
copy(buf[4:], msg.Value)
return buf
}
func (msg *Bytemsg) Len() int {
return unix.NLA_HDRLEN + len(msg.Value) + 1 // null-terminated
}
type Boolmsg struct {
Type uint16
Value bool
}
func (msg *Boolmsg) Serialize() []byte {
buf := make([]byte, msg.Len())
native := nl.NativeEndian()
native.PutUint16(buf[0:2], uint16(msg.Len()))
native.PutUint16(buf[2:4], msg.Type)
if msg.Value {
native.PutUint32(buf[4:8], uint32(1))
} else {
native.PutUint32(buf[4:8], uint32(0))
}
return buf
}
func (msg *Boolmsg) Len() int {
return unix.NLA_HDRLEN + 4 // alignment
}

View File

@@ -1,101 +0,0 @@
package libcontainer
import (
"io/fs"
"strconv"
"golang.org/x/sys/unix"
)
// mountError holds an error from a failed mount or unmount operation.
type mountError struct {
op string
source string
target string
procfd string
flags uintptr
data string
err error
}
// Error provides a string error representation.
func (e *mountError) Error() string {
out := e.op + " "
if e.source != "" {
out += e.source + ":" + e.target
} else {
out += e.target
}
if e.procfd != "" {
out += " (via " + e.procfd + ")"
}
if e.flags != uintptr(0) {
out += ", flags: 0x" + strconv.FormatUint(uint64(e.flags), 16)
}
if e.data != "" {
out += ", data: " + e.data
}
out += ": " + e.err.Error()
return out
}
// Unwrap returns the underlying error.
// This is a convention used by Go 1.13+ standard library.
func (e *mountError) Unwrap() error {
return e.err
}
// mount is a simple unix.Mount wrapper. If procfd is not empty, it is used
// instead of target (and the target is only used to add context to an error).
func mount(source, target, procfd, fstype string, flags uintptr, data string) error {
dst := target
if procfd != "" {
dst = procfd
}
if err := unix.Mount(source, dst, fstype, flags, data); err != nil {
return &mountError{
op: "mount",
source: source,
target: target,
procfd: procfd,
flags: flags,
data: data,
err: err,
}
}
return nil
}
// unmount is a simple unix.Unmount wrapper.
func unmount(target string, flags int) error {
err := unix.Unmount(target, flags)
if err != nil {
return &mountError{
op: "unmount",
target: target,
flags: uintptr(flags),
err: err,
}
}
return nil
}
// syscallMode returns the syscall-specific mode bits from Go's portable mode bits.
// Copy from https://cs.opensource.google/go/go/+/refs/tags/go1.20.7:src/os/file_posix.go;l=61-75
func syscallMode(i fs.FileMode) (o uint32) {
o |= uint32(i.Perm())
if i&fs.ModeSetuid != 0 {
o |= unix.S_ISUID
}
if i&fs.ModeSetgid != 0 {
o |= unix.S_ISGID
}
if i&fs.ModeSticky != 0 {
o |= unix.S_ISVTX
}
// No mapping for Go's ModeTemporary (plan9 only).
return
}

View File

@@ -1,100 +0,0 @@
package libcontainer
import (
"bytes"
"fmt"
"os"
"path/filepath"
"strconv"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/types"
"github.com/vishvananda/netlink"
)
var strategies = map[string]networkStrategy{
"loopback": &loopback{},
}
// networkStrategy represents a specific network configuration for
// a container's networking stack
type networkStrategy interface {
create(*network, int) error
initialize(*network) error
detach(*configs.Network) error
attach(*configs.Network) error
}
// getStrategy returns the specific network strategy for the
// provided type.
func getStrategy(tpe string) (networkStrategy, error) {
s, exists := strategies[tpe]
if !exists {
return nil, fmt.Errorf("unknown strategy type %q", tpe)
}
return s, nil
}
// Returns the network statistics for the network interfaces represented by the NetworkRuntimeInfo.
func getNetworkInterfaceStats(interfaceName string) (*types.NetworkInterface, error) {
out := &types.NetworkInterface{Name: interfaceName}
// This can happen if the network runtime information is missing - possible if the
// container was created by an old version of libcontainer.
if interfaceName == "" {
return out, nil
}
type netStatsPair struct {
// Where to write the output.
Out *uint64
// The network stats file to read.
File string
}
// Ingress for host veth is from the container. Hence tx_bytes stat on the host veth is actually number of bytes received by the container.
netStats := []netStatsPair{
{Out: &out.RxBytes, File: "tx_bytes"},
{Out: &out.RxPackets, File: "tx_packets"},
{Out: &out.RxErrors, File: "tx_errors"},
{Out: &out.RxDropped, File: "tx_dropped"},
{Out: &out.TxBytes, File: "rx_bytes"},
{Out: &out.TxPackets, File: "rx_packets"},
{Out: &out.TxErrors, File: "rx_errors"},
{Out: &out.TxDropped, File: "rx_dropped"},
}
for _, netStat := range netStats {
data, err := readSysfsNetworkStats(interfaceName, netStat.File)
if err != nil {
return nil, err
}
*(netStat.Out) = data
}
return out, nil
}
// Reads the specified statistics available under /sys/class/net/<EthInterface>/statistics
func readSysfsNetworkStats(ethInterface, statsFile string) (uint64, error) {
data, err := os.ReadFile(filepath.Join("/sys/class/net", ethInterface, "statistics", statsFile))
if err != nil {
return 0, err
}
return strconv.ParseUint(string(bytes.TrimSpace(data)), 10, 64)
}
// loopback is a network strategy that provides a basic loopback device
type loopback struct{}
func (l *loopback) create(n *network, nspid int) error {
return nil
}
func (l *loopback) initialize(config *network) error {
return netlink.LinkSetUp(&netlink.Device{LinkAttrs: netlink.LinkAttrs{Name: "lo"}})
}
func (l *loopback) attach(n *configs.Network) (err error) {
return nil
}
func (l *loopback) detach(n *configs.Network) (err error) {
return nil
}

View File

@@ -1,84 +0,0 @@
package libcontainer
import (
"errors"
"fmt"
"os"
"path/filepath"
"golang.org/x/sys/unix"
)
type PressureLevel uint
const (
LowPressure PressureLevel = iota
MediumPressure
CriticalPressure
)
func registerMemoryEvent(cgDir string, evName string, arg string) (<-chan struct{}, error) {
evFile, err := os.Open(filepath.Join(cgDir, evName))
if err != nil {
return nil, err
}
fd, err := unix.Eventfd(0, unix.EFD_CLOEXEC)
if err != nil {
evFile.Close()
return nil, err
}
eventfd := os.NewFile(uintptr(fd), "eventfd")
eventControlPath := filepath.Join(cgDir, "cgroup.event_control")
data := fmt.Sprintf("%d %d %s", eventfd.Fd(), evFile.Fd(), arg)
if err := os.WriteFile(eventControlPath, []byte(data), 0o700); err != nil {
eventfd.Close()
evFile.Close()
return nil, err
}
ch := make(chan struct{})
go func() {
defer func() {
eventfd.Close()
evFile.Close()
close(ch)
}()
buf := make([]byte, 8)
for {
if _, err := eventfd.Read(buf); err != nil {
return
}
// When a cgroup is destroyed, an event is sent to eventfd.
// So if the control path is gone, return instead of notifying.
if _, err := os.Lstat(eventControlPath); os.IsNotExist(err) {
return
}
ch <- struct{}{}
}
}()
return ch, nil
}
// notifyOnOOM returns channel on which you can expect event about OOM,
// if process died without OOM this channel will be closed.
func notifyOnOOM(dir string) (<-chan struct{}, error) {
if dir == "" {
return nil, errors.New("memory controller missing")
}
return registerMemoryEvent(dir, "memory.oom_control", "")
}
func notifyMemoryPressure(dir string, level PressureLevel) (<-chan struct{}, error) {
if dir == "" {
return nil, errors.New("memory controller missing")
}
if level > CriticalPressure {
return nil, fmt.Errorf("invalid pressure level %d", level)
}
levelStr := []string{"low", "medium", "critical"}[level]
return registerMemoryEvent(dir, "memory.pressure_level", levelStr)
}

View File

@@ -1,80 +0,0 @@
package libcontainer
import (
"fmt"
"path/filepath"
"unsafe"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
func registerMemoryEventV2(cgDir, evName, cgEvName string) (<-chan struct{}, error) {
fd, err := unix.InotifyInit()
if err != nil {
return nil, fmt.Errorf("unable to init inotify: %w", err)
}
// watching oom kill
evFd, err := unix.InotifyAddWatch(fd, filepath.Join(cgDir, evName), unix.IN_MODIFY)
if err != nil {
unix.Close(fd)
return nil, fmt.Errorf("unable to add inotify watch: %w", err)
}
// Because no `unix.IN_DELETE|unix.IN_DELETE_SELF` event for cgroup file system, so watching all process exited
cgFd, err := unix.InotifyAddWatch(fd, filepath.Join(cgDir, cgEvName), unix.IN_MODIFY)
if err != nil {
unix.Close(fd)
return nil, fmt.Errorf("unable to add inotify watch: %w", err)
}
ch := make(chan struct{})
go func() {
var (
buffer [unix.SizeofInotifyEvent + unix.PathMax + 1]byte
offset uint32
)
defer func() {
unix.Close(fd)
close(ch)
}()
for {
n, err := unix.Read(fd, buffer[:])
if err != nil {
logrus.Warnf("unable to read event data from inotify, got error: %v", err)
return
}
if n < unix.SizeofInotifyEvent {
logrus.Warnf("we should read at least %d bytes from inotify, but got %d bytes.", unix.SizeofInotifyEvent, n)
return
}
offset = 0
for offset <= uint32(n-unix.SizeofInotifyEvent) {
rawEvent := (*unix.InotifyEvent)(unsafe.Pointer(&buffer[offset]))
offset += unix.SizeofInotifyEvent + rawEvent.Len
if rawEvent.Mask&unix.IN_MODIFY != unix.IN_MODIFY {
continue
}
switch int(rawEvent.Wd) {
case evFd:
oom, err := fscommon.GetValueByKey(cgDir, evName, "oom_kill")
if err != nil || oom > 0 {
ch <- struct{}{}
}
case cgFd:
pids, err := fscommon.GetValueByKey(cgDir, cgEvName, "populated")
if err != nil || pids == 0 {
return
}
}
}
}
}()
return ch, nil
}
// notifyOnOOMV2 returns channel on which you can expect event about OOM,
// if process died without OOM this channel will be closed.
func notifyOnOOMV2(path string) (<-chan struct{}, error) {
return registerMemoryEventV2(path, "memory.events", "cgroup.events")
}

View File

@@ -1,126 +0,0 @@
package libcontainer
import (
"errors"
"io"
"math"
"os"
"github.com/opencontainers/runc/libcontainer/configs"
)
var errInvalidProcess = errors.New("invalid process")
type processOperations interface {
wait() (*os.ProcessState, error)
signal(sig os.Signal) error
pid() int
}
// Process specifies the configuration and IO for a process inside
// a container.
type Process struct {
// The command to be run followed by any arguments.
Args []string
// Env specifies the environment variables for the process.
Env []string
// User will set the uid and gid of the executing process running inside the container
// local to the container's user and group configuration.
User string
// AdditionalGroups specifies the gids that should be added to supplementary groups
// in addition to those that the user belongs to.
AdditionalGroups []string
// Cwd will change the processes current working directory inside the container's rootfs.
Cwd string
// Stdin is a pointer to a reader which provides the standard input stream.
Stdin io.Reader
// Stdout is a pointer to a writer which receives the standard output stream.
Stdout io.Writer
// Stderr is a pointer to a writer which receives the standard error stream.
Stderr io.Writer
// ExtraFiles specifies additional open files to be inherited by the container
ExtraFiles []*os.File
// Initial sizings for the console
ConsoleWidth uint16
ConsoleHeight uint16
// Capabilities specify the capabilities to keep when executing the process inside the container
// All capabilities not specified will be dropped from the processes capability mask
Capabilities *configs.Capabilities
// AppArmorProfile specifies the profile to apply to the process and is
// changed at the time the process is execed
AppArmorProfile string
// Label specifies the label to apply to the process. It is commonly used by selinux
Label string
// NoNewPrivileges controls whether processes can gain additional privileges.
NoNewPrivileges *bool
// Rlimits specifies the resource limits, such as max open files, to set in the container
// If Rlimits are not set, the container will inherit rlimits from the parent process
Rlimits []configs.Rlimit
// ConsoleSocket provides the masterfd console.
ConsoleSocket *os.File
// Init specifies whether the process is the first process in the container.
Init bool
ops processOperations
LogLevel string
// SubCgroupPaths specifies sub-cgroups to run the process in.
// Map keys are controller names, map values are paths (relative to
// container's top-level cgroup).
//
// If empty, the default top-level container's cgroup is used.
//
// For cgroup v2, the only key allowed is "".
SubCgroupPaths map[string]string
}
// Wait waits for the process to exit.
// Wait releases any resources associated with the Process
func (p Process) Wait() (*os.ProcessState, error) {
if p.ops == nil {
return nil, errInvalidProcess
}
return p.ops.wait()
}
// Pid returns the process ID
func (p Process) Pid() (int, error) {
// math.MinInt32 is returned here, because it's invalid value
// for the kill() system call.
if p.ops == nil {
return math.MinInt32, errInvalidProcess
}
return p.ops.pid(), nil
}
// Signal sends a signal to the Process.
func (p Process) Signal(sig os.Signal) error {
if p.ops == nil {
return errInvalidProcess
}
return p.ops.signal(sig)
}
// IO holds the process's STDIO
type IO struct {
Stdin io.WriteCloser
Stdout io.ReadCloser
Stderr io.ReadCloser
}

View File

@@ -1,823 +0,0 @@
package libcontainer
import (
"encoding/json"
"errors"
"fmt"
"io"
"net"
"os"
"os/exec"
"path/filepath"
"strconv"
"time"
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs2"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/intelrdt"
"github.com/opencontainers/runc/libcontainer/logs"
"github.com/opencontainers/runc/libcontainer/system"
"github.com/opencontainers/runc/libcontainer/utils"
"github.com/opencontainers/runtime-spec/specs-go"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
type parentProcess interface {
// pid returns the pid for the running process.
pid() int
// start starts the process execution.
start() error
// send a SIGKILL to the process and wait for the exit.
terminate() error
// wait waits on the process returning the process state.
wait() (*os.ProcessState, error)
// startTime returns the process start time.
startTime() (uint64, error)
signal(os.Signal) error
externalDescriptors() []string
setExternalDescriptors(fds []string)
forwardChildLogs() chan error
}
type filePair struct {
parent *os.File
child *os.File
}
type setnsProcess struct {
cmd *exec.Cmd
messageSockPair filePair
logFilePair filePair
cgroupPaths map[string]string
rootlessCgroups bool
manager cgroups.Manager
intelRdtPath string
config *initConfig
fds []string
process *Process
bootstrapData io.Reader
initProcessPid int
}
func (p *setnsProcess) startTime() (uint64, error) {
stat, err := system.Stat(p.pid())
return stat.StartTime, err
}
func (p *setnsProcess) signal(sig os.Signal) error {
s, ok := sig.(unix.Signal)
if !ok {
return errors.New("os: unsupported signal type")
}
return unix.Kill(p.pid(), s)
}
func (p *setnsProcess) start() (retErr error) {
defer p.messageSockPair.parent.Close()
// get the "before" value of oom kill count
oom, _ := p.manager.OOMKillCount()
err := p.cmd.Start()
// close the write-side of the pipes (controlled by child)
p.messageSockPair.child.Close()
p.logFilePair.child.Close()
if err != nil {
return fmt.Errorf("error starting setns process: %w", err)
}
waitInit := initWaiter(p.messageSockPair.parent)
defer func() {
if retErr != nil {
if newOom, err := p.manager.OOMKillCount(); err == nil && newOom != oom {
// Someone in this cgroup was killed, this _might_ be us.
retErr = fmt.Errorf("%w (possibly OOM-killed)", retErr)
}
werr := <-waitInit
if werr != nil {
logrus.WithError(werr).Warn()
}
err := ignoreTerminateErrors(p.terminate())
if err != nil {
logrus.WithError(err).Warn("unable to terminate setnsProcess")
}
}
}()
if p.bootstrapData != nil {
if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
return fmt.Errorf("error copying bootstrap data to pipe: %w", err)
}
}
err = <-waitInit
if err != nil {
return err
}
if err := p.execSetns(); err != nil {
return fmt.Errorf("error executing setns process: %w", err)
}
for _, path := range p.cgroupPaths {
if err := cgroups.WriteCgroupProc(path, p.pid()); err != nil && !p.rootlessCgroups {
// On cgroup v2 + nesting + domain controllers, WriteCgroupProc may fail with EBUSY.
// https://github.com/opencontainers/runc/issues/2356#issuecomment-621277643
// Try to join the cgroup of InitProcessPid.
if cgroups.IsCgroup2UnifiedMode() && p.initProcessPid != 0 {
initProcCgroupFile := fmt.Sprintf("/proc/%d/cgroup", p.initProcessPid)
initCg, initCgErr := cgroups.ParseCgroupFile(initProcCgroupFile)
if initCgErr == nil {
if initCgPath, ok := initCg[""]; ok {
initCgDirpath := filepath.Join(fs2.UnifiedMountpoint, initCgPath)
logrus.Debugf("adding pid %d to cgroups %v failed (%v), attempting to join %q (obtained from %s)",
p.pid(), p.cgroupPaths, err, initCg, initCgDirpath)
// NOTE: initCgDirPath is not guaranteed to exist because we didn't pause the container.
err = cgroups.WriteCgroupProc(initCgDirpath, p.pid())
}
}
}
if err != nil {
return fmt.Errorf("error adding pid %d to cgroups: %w", p.pid(), err)
}
}
}
if p.intelRdtPath != "" {
// if Intel RDT "resource control" filesystem path exists
_, err := os.Stat(p.intelRdtPath)
if err == nil {
if err := intelrdt.WriteIntelRdtTasks(p.intelRdtPath, p.pid()); err != nil {
return fmt.Errorf("error adding pid %d to Intel RDT: %w", p.pid(), err)
}
}
}
if err := utils.WriteJSON(p.messageSockPair.parent, p.config); err != nil {
return fmt.Errorf("error writing config to pipe: %w", err)
}
ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
switch sync.Type {
case procReady:
// Set rlimits, this has to be done here because we lose permissions
// to raise the limits once we enter a user-namespace
if err := setupRlimits(p.config.Rlimits, p.pid()); err != nil {
return fmt.Errorf("error setting rlimits for ready process: %w", err)
}
// Sync with child.
return writeSync(p.messageSockPair.parent, procRun)
case procHooks:
// This shouldn't happen.
panic("unexpected procHooks in setns")
case procSeccomp:
if p.config.Config.Seccomp.ListenerPath == "" {
return errors.New("listenerPath is not set")
}
seccompFd, err := recvSeccompFd(uintptr(p.pid()), uintptr(sync.Fd))
if err != nil {
return err
}
defer unix.Close(seccompFd)
bundle, annotations := utils.Annotations(p.config.Config.Labels)
containerProcessState := &specs.ContainerProcessState{
Version: specs.Version,
Fds: []string{specs.SeccompFdName},
Pid: p.cmd.Process.Pid,
Metadata: p.config.Config.Seccomp.ListenerMetadata,
State: specs.State{
Version: specs.Version,
ID: p.config.ContainerId,
Status: specs.StateRunning,
Pid: p.initProcessPid,
Bundle: bundle,
Annotations: annotations,
},
}
if err := sendContainerProcessState(p.config.Config.Seccomp.ListenerPath,
containerProcessState, seccompFd); err != nil {
return err
}
// Sync with child.
if err := writeSync(p.messageSockPair.parent, procSeccompDone); err != nil {
return err
}
return nil
default:
return errors.New("invalid JSON payload from child")
}
})
if err := unix.Shutdown(int(p.messageSockPair.parent.Fd()), unix.SHUT_WR); err != nil {
return &os.PathError{Op: "shutdown", Path: "(init pipe)", Err: err}
}
// Must be done after Shutdown so the child will exit and we can wait for it.
if ierr != nil {
_, _ = p.wait()
return ierr
}
return nil
}
// execSetns runs the process that executes C code to perform the setns calls
// because setns support requires the C process to fork off a child and perform the setns
// before the go runtime boots, we wait on the process to die and receive the child's pid
// over the provided pipe.
func (p *setnsProcess) execSetns() error {
status, err := p.cmd.Process.Wait()
if err != nil {
_ = p.cmd.Wait()
return fmt.Errorf("error waiting on setns process to finish: %w", err)
}
if !status.Success() {
_ = p.cmd.Wait()
return &exec.ExitError{ProcessState: status}
}
var pid *pid
if err := json.NewDecoder(p.messageSockPair.parent).Decode(&pid); err != nil {
_ = p.cmd.Wait()
return fmt.Errorf("error reading pid from init pipe: %w", err)
}
// Clean up the zombie parent process
// On Unix systems FindProcess always succeeds.
firstChildProcess, _ := os.FindProcess(pid.PidFirstChild)
// Ignore the error in case the child has already been reaped for any reason
_, _ = firstChildProcess.Wait()
process, err := os.FindProcess(pid.Pid)
if err != nil {
return err
}
p.cmd.Process = process
p.process.ops = p
return nil
}
// terminate sends a SIGKILL to the forked process for the setns routine then waits to
// avoid the process becoming a zombie.
func (p *setnsProcess) terminate() error {
if p.cmd.Process == nil {
return nil
}
err := p.cmd.Process.Kill()
if _, werr := p.wait(); err == nil {
err = werr
}
return err
}
func (p *setnsProcess) wait() (*os.ProcessState, error) {
err := p.cmd.Wait()
// Return actual ProcessState even on Wait error
return p.cmd.ProcessState, err
}
func (p *setnsProcess) pid() int {
return p.cmd.Process.Pid
}
func (p *setnsProcess) externalDescriptors() []string {
return p.fds
}
func (p *setnsProcess) setExternalDescriptors(newFds []string) {
p.fds = newFds
}
func (p *setnsProcess) forwardChildLogs() chan error {
return logs.ForwardLogs(p.logFilePair.parent)
}
type initProcess struct {
cmd *exec.Cmd
messageSockPair filePair
logFilePair filePair
config *initConfig
manager cgroups.Manager
intelRdtManager *intelrdt.Manager
container *linuxContainer
fds []string
process *Process
bootstrapData io.Reader
sharePidns bool
}
func (p *initProcess) pid() int {
return p.cmd.Process.Pid
}
func (p *initProcess) externalDescriptors() []string {
return p.fds
}
// getChildPid receives the final child's pid over the provided pipe.
func (p *initProcess) getChildPid() (int, error) {
var pid pid
if err := json.NewDecoder(p.messageSockPair.parent).Decode(&pid); err != nil {
_ = p.cmd.Wait()
return -1, err
}
// Clean up the zombie parent process
// On Unix systems FindProcess always succeeds.
firstChildProcess, _ := os.FindProcess(pid.PidFirstChild)
// Ignore the error in case the child has already been reaped for any reason
_, _ = firstChildProcess.Wait()
return pid.Pid, nil
}
func (p *initProcess) waitForChildExit(childPid int) error {
status, err := p.cmd.Process.Wait()
if err != nil {
_ = p.cmd.Wait()
return err
}
if !status.Success() {
_ = p.cmd.Wait()
return &exec.ExitError{ProcessState: status}
}
process, err := os.FindProcess(childPid)
if err != nil {
return err
}
p.cmd.Process = process
p.process.ops = p
return nil
}
func (p *initProcess) start() (retErr error) {
defer p.messageSockPair.parent.Close() //nolint: errcheck
err := p.cmd.Start()
p.process.ops = p
// close the write-side of the pipes (controlled by child)
_ = p.messageSockPair.child.Close()
_ = p.logFilePair.child.Close()
if err != nil {
p.process.ops = nil
return fmt.Errorf("unable to start init: %w", err)
}
waitInit := initWaiter(p.messageSockPair.parent)
defer func() {
if retErr != nil {
// Find out if init is killed by the kernel's OOM killer.
// Get the count before killing init as otherwise cgroup
// might be removed by systemd.
oom, err := p.manager.OOMKillCount()
if err != nil {
logrus.WithError(err).Warn("unable to get oom kill count")
} else if oom > 0 {
// Does not matter what the particular error was,
// its cause is most probably OOM, so report that.
const oomError = "container init was OOM-killed (memory limit too low?)"
if logrus.GetLevel() >= logrus.DebugLevel {
// Only show the original error if debug is set,
// as it is not generally very useful.
retErr = fmt.Errorf(oomError+": %w", retErr)
} else {
retErr = errors.New(oomError)
}
}
werr := <-waitInit
if werr != nil {
logrus.WithError(werr).Warn()
}
// Terminate the process to ensure we can remove cgroups.
if err := ignoreTerminateErrors(p.terminate()); err != nil {
logrus.WithError(err).Warn("unable to terminate initProcess")
}
_ = p.manager.Destroy()
if p.intelRdtManager != nil {
_ = p.intelRdtManager.Destroy()
}
}
}()
// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root
// because we'd be using the rootless cgroup manager in that case.
if err := p.manager.Apply(p.pid()); err != nil {
return fmt.Errorf("unable to apply cgroup configuration: %w", err)
}
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Apply(p.pid()); err != nil {
return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
}
}
if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
}
err = <-waitInit
if err != nil {
return err
}
childPid, err := p.getChildPid()
if err != nil {
return fmt.Errorf("can't get final child's PID from pipe: %w", err)
}
// Save the standard descriptor names before the container process
// can potentially move them (e.g., via dup2()). If we don't do this now,
// we won't know at checkpoint time which file descriptor to look up.
fds, err := getPipeFds(childPid)
if err != nil {
return fmt.Errorf("error getting pipe fds for pid %d: %w", childPid, err)
}
p.setExternalDescriptors(fds)
// Wait for our first child to exit
if err := p.waitForChildExit(childPid); err != nil {
return fmt.Errorf("error waiting for our first child to exit: %w", err)
}
if err := p.createNetworkInterfaces(); err != nil {
return fmt.Errorf("error creating network interfaces: %w", err)
}
if err := p.updateSpecState(); err != nil {
return fmt.Errorf("error updating spec state: %w", err)
}
if err := p.sendConfig(); err != nil {
return fmt.Errorf("error sending config to init process: %w", err)
}
var (
sentRun bool
sentResume bool
)
ierr := parseSync(p.messageSockPair.parent, func(sync *syncT) error {
switch sync.Type {
case procSeccomp:
if p.config.Config.Seccomp.ListenerPath == "" {
return errors.New("listenerPath is not set")
}
seccompFd, err := recvSeccompFd(uintptr(childPid), uintptr(sync.Fd))
if err != nil {
return err
}
defer unix.Close(seccompFd)
s, err := p.container.currentOCIState()
if err != nil {
return err
}
// initProcessStartTime hasn't been set yet.
s.Pid = p.cmd.Process.Pid
s.Status = specs.StateCreating
containerProcessState := &specs.ContainerProcessState{
Version: specs.Version,
Fds: []string{specs.SeccompFdName},
Pid: s.Pid,
Metadata: p.config.Config.Seccomp.ListenerMetadata,
State: *s,
}
if err := sendContainerProcessState(p.config.Config.Seccomp.ListenerPath,
containerProcessState, seccompFd); err != nil {
return err
}
// Sync with child.
if err := writeSync(p.messageSockPair.parent, procSeccompDone); err != nil {
return err
}
case procReady:
// Set rlimits, this has to be done here because we lose permissions
// to raise the limits once we enter a user-namespace
if err := setupRlimits(p.config.Rlimits, p.pid()); err != nil {
return fmt.Errorf("error setting rlimits for ready process: %w", err)
}
// call prestart and CreateRuntime hooks
if !p.config.Config.Namespaces.Contains(configs.NEWNS) {
// Setup cgroup before the hook, so that the prestart and CreateRuntime hook could apply cgroup permissions.
if err := p.manager.Set(p.config.Config.Cgroups.Resources); err != nil {
return fmt.Errorf("error setting cgroup config for ready process: %w", err)
}
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Set(p.config.Config); err != nil {
return fmt.Errorf("error setting Intel RDT config for ready process: %w", err)
}
}
if len(p.config.Config.Hooks) != 0 {
s, err := p.container.currentOCIState()
if err != nil {
return err
}
// initProcessStartTime hasn't been set yet.
s.Pid = p.cmd.Process.Pid
s.Status = specs.StateCreating
hooks := p.config.Config.Hooks
if err := hooks[configs.Prestart].RunHooks(s); err != nil {
return err
}
if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
return err
}
}
}
// generate a timestamp indicating when the container was started
p.container.created = time.Now().UTC()
p.container.state = &createdState{
c: p.container,
}
// NOTE: If the procRun state has been synced and the
// runc-create process has been killed for some reason,
// the runc-init[2:stage] process will be leaky. And
// the runc command also fails to parse root directory
// because the container doesn't have state.json.
//
// In order to cleanup the runc-init[2:stage] by
// runc-delete/stop, we should store the status before
// procRun sync.
state, uerr := p.container.updateState(p)
if uerr != nil {
return fmt.Errorf("unable to store init state: %w", err)
}
p.container.initProcessStartTime = state.InitProcessStartTime
// Sync with child.
if err := writeSync(p.messageSockPair.parent, procRun); err != nil {
return err
}
sentRun = true
case procHooks:
// Setup cgroup before prestart hook, so that the prestart hook could apply cgroup permissions.
if err := p.manager.Set(p.config.Config.Cgroups.Resources); err != nil {
return fmt.Errorf("error setting cgroup config for procHooks process: %w", err)
}
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Set(p.config.Config); err != nil {
return fmt.Errorf("error setting Intel RDT config for procHooks process: %w", err)
}
}
if len(p.config.Config.Hooks) != 0 {
s, err := p.container.currentOCIState()
if err != nil {
return err
}
// initProcessStartTime hasn't been set yet.
s.Pid = p.cmd.Process.Pid
s.Status = specs.StateCreating
hooks := p.config.Config.Hooks
if err := hooks[configs.Prestart].RunHooks(s); err != nil {
return err
}
if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
return err
}
}
// Sync with child.
if err := writeSync(p.messageSockPair.parent, procResume); err != nil {
return err
}
sentResume = true
default:
return errors.New("invalid JSON payload from child")
}
return nil
})
if !sentRun {
return fmt.Errorf("error during container init: %w", ierr)
}
if p.config.Config.Namespaces.Contains(configs.NEWNS) && !sentResume {
return errors.New("could not synchronise after executing prestart and CreateRuntime hooks with container process")
}
if err := unix.Shutdown(int(p.messageSockPair.parent.Fd()), unix.SHUT_WR); err != nil {
return &os.PathError{Op: "shutdown", Path: "(init pipe)", Err: err}
}
// Must be done after Shutdown so the child will exit and we can wait for it.
if ierr != nil {
_, _ = p.wait()
return ierr
}
return nil
}
func (p *initProcess) wait() (*os.ProcessState, error) {
err := p.cmd.Wait()
// we should kill all processes in cgroup when init is died if we use host PID namespace
if p.sharePidns {
_ = signalAllProcesses(p.manager, unix.SIGKILL)
}
return p.cmd.ProcessState, err
}
func (p *initProcess) terminate() error {
if p.cmd.Process == nil {
return nil
}
err := p.cmd.Process.Kill()
if _, werr := p.wait(); err == nil {
err = werr
}
return err
}
func (p *initProcess) startTime() (uint64, error) {
stat, err := system.Stat(p.pid())
return stat.StartTime, err
}
func (p *initProcess) updateSpecState() error {
s, err := p.container.currentOCIState()
if err != nil {
return err
}
p.config.SpecState = s
return nil
}
func (p *initProcess) sendConfig() error {
// send the config to the container's init process, we don't use JSON Encode
// here because there might be a problem in JSON decoder in some cases, see:
// https://github.com/docker/docker/issues/14203#issuecomment-174177790
return utils.WriteJSON(p.messageSockPair.parent, p.config)
}
func (p *initProcess) createNetworkInterfaces() error {
for _, config := range p.config.Config.Networks {
strategy, err := getStrategy(config.Type)
if err != nil {
return err
}
n := &network{
Network: *config,
}
if err := strategy.create(n, p.pid()); err != nil {
return err
}
p.config.Networks = append(p.config.Networks, n)
}
return nil
}
func (p *initProcess) signal(sig os.Signal) error {
s, ok := sig.(unix.Signal)
if !ok {
return errors.New("os: unsupported signal type")
}
return unix.Kill(p.pid(), s)
}
func (p *initProcess) setExternalDescriptors(newFds []string) {
p.fds = newFds
}
func (p *initProcess) forwardChildLogs() chan error {
return logs.ForwardLogs(p.logFilePair.parent)
}
func recvSeccompFd(childPid, childFd uintptr) (int, error) {
pidfd, _, errno := unix.Syscall(unix.SYS_PIDFD_OPEN, childPid, 0, 0)
if errno != 0 {
return -1, fmt.Errorf("performing SYS_PIDFD_OPEN syscall: %w", errno)
}
defer unix.Close(int(pidfd))
seccompFd, _, errno := unix.Syscall(unix.SYS_PIDFD_GETFD, pidfd, childFd, 0)
if errno != 0 {
return -1, fmt.Errorf("performing SYS_PIDFD_GETFD syscall: %w", errno)
}
return int(seccompFd), nil
}
func sendContainerProcessState(listenerPath string, state *specs.ContainerProcessState, fd int) error {
conn, err := net.Dial("unix", listenerPath)
if err != nil {
return fmt.Errorf("failed to connect with seccomp agent specified in the seccomp profile: %w", err)
}
socket, err := conn.(*net.UnixConn).File()
if err != nil {
return fmt.Errorf("cannot get seccomp socket: %w", err)
}
defer socket.Close()
b, err := json.Marshal(state)
if err != nil {
return fmt.Errorf("cannot marshall seccomp state: %w", err)
}
err = utils.SendFds(socket, b, fd)
if err != nil {
return fmt.Errorf("cannot send seccomp fd to %s: %w", listenerPath, err)
}
return nil
}
func getPipeFds(pid int) ([]string, error) {
fds := make([]string, 3)
dirPath := filepath.Join("/proc", strconv.Itoa(pid), "/fd")
for i := 0; i < 3; i++ {
// XXX: This breaks if the path is not a valid symlink (which can
// happen in certain particularly unlucky mount namespace setups).
f := filepath.Join(dirPath, strconv.Itoa(i))
target, err := os.Readlink(f)
if err != nil {
// Ignore permission errors, for rootless containers and other
// non-dumpable processes. if we can't get the fd for a particular
// file, there's not much we can do.
if os.IsPermission(err) {
continue
}
return fds, err
}
fds[i] = target
}
return fds, nil
}
// InitializeIO creates pipes for use with the process's stdio and returns the
// opposite side for each. Do not use this if you want to have a pseudoterminal
// set up for you by libcontainer (TODO: fix that too).
// TODO: This is mostly unnecessary, and should be handled by clients.
func (p *Process) InitializeIO(rootuid, rootgid int) (i *IO, err error) {
var fds []uintptr
i = &IO{}
// cleanup in case of an error
defer func() {
if err != nil {
for _, fd := range fds {
_ = unix.Close(int(fd))
}
}
}()
// STDIN
r, w, err := os.Pipe()
if err != nil {
return nil, err
}
fds = append(fds, r.Fd(), w.Fd())
p.Stdin, i.Stdin = r, w
// STDOUT
if r, w, err = os.Pipe(); err != nil {
return nil, err
}
fds = append(fds, r.Fd(), w.Fd())
p.Stdout, i.Stdout = w, r
// STDERR
if r, w, err = os.Pipe(); err != nil {
return nil, err
}
fds = append(fds, r.Fd(), w.Fd())
p.Stderr, i.Stderr = w, r
// change ownership of the pipes in case we are in a user namespace
for _, fd := range fds {
if err := unix.Fchown(int(fd), rootuid, rootgid); err != nil {
return nil, &os.PathError{Op: "fchown", Path: "fd " + strconv.Itoa(int(fd)), Err: err}
}
}
return i, nil
}
// initWaiter returns a channel to wait on for making sure
// runc init has finished the initial setup.
func initWaiter(r io.Reader) chan error {
ch := make(chan error, 1)
go func() {
defer close(ch)
inited := make([]byte, 1)
n, err := r.Read(inited)
if err == nil {
if n < 1 {
err = errors.New("short read")
} else if inited[0] != 0 {
err = fmt.Errorf("unexpected %d != 0", inited[0])
} else {
ch <- nil
return
}
}
ch <- fmt.Errorf("waiting for init preliminary setup: %w", err)
}()
return ch
}

View File

@@ -1,128 +0,0 @@
package libcontainer
import (
"errors"
"os"
"os/exec"
"github.com/opencontainers/runc/libcontainer/system"
)
func newRestoredProcess(cmd *exec.Cmd, fds []string) (*restoredProcess, error) {
var err error
pid := cmd.Process.Pid
stat, err := system.Stat(pid)
if err != nil {
return nil, err
}
return &restoredProcess{
cmd: cmd,
processStartTime: stat.StartTime,
fds: fds,
}, nil
}
type restoredProcess struct {
cmd *exec.Cmd
processStartTime uint64
fds []string
}
func (p *restoredProcess) start() error {
return errors.New("restored process cannot be started")
}
func (p *restoredProcess) pid() int {
return p.cmd.Process.Pid
}
func (p *restoredProcess) terminate() error {
err := p.cmd.Process.Kill()
if _, werr := p.wait(); err == nil {
err = werr
}
return err
}
func (p *restoredProcess) wait() (*os.ProcessState, error) {
// TODO: how do we wait on the actual process?
// maybe use --exec-cmd in criu
err := p.cmd.Wait()
if err != nil {
var exitErr *exec.ExitError
if !errors.As(err, &exitErr) {
return nil, err
}
}
st := p.cmd.ProcessState
return st, nil
}
func (p *restoredProcess) startTime() (uint64, error) {
return p.processStartTime, nil
}
func (p *restoredProcess) signal(s os.Signal) error {
return p.cmd.Process.Signal(s)
}
func (p *restoredProcess) externalDescriptors() []string {
return p.fds
}
func (p *restoredProcess) setExternalDescriptors(newFds []string) {
p.fds = newFds
}
func (p *restoredProcess) forwardChildLogs() chan error {
return nil
}
// nonChildProcess represents a process where the calling process is not
// the parent process. This process is created when a factory loads a container from
// a persisted state.
type nonChildProcess struct {
processPid int
processStartTime uint64
fds []string
}
func (p *nonChildProcess) start() error {
return errors.New("restored process cannot be started")
}
func (p *nonChildProcess) pid() int {
return p.processPid
}
func (p *nonChildProcess) terminate() error {
return errors.New("restored process cannot be terminated")
}
func (p *nonChildProcess) wait() (*os.ProcessState, error) {
return nil, errors.New("restored process cannot be waited on")
}
func (p *nonChildProcess) startTime() (uint64, error) {
return p.processStartTime, nil
}
func (p *nonChildProcess) signal(s os.Signal) error {
proc, err := os.FindProcess(p.processPid)
if err != nil {
return err
}
return proc.Signal(s)
}
func (p *nonChildProcess) externalDescriptors() []string {
return p.fds
}
func (p *nonChildProcess) setExternalDescriptors(newFds []string) {
p.fds = newFds
}
func (p *nonChildProcess) forwardChildLogs() chan error {
return nil
}

File diff suppressed because it is too large Load Diff

View File

@@ -1,113 +0,0 @@
package seccomp
import (
"fmt"
"sort"
"github.com/opencontainers/runc/libcontainer/configs"
)
var operators = map[string]configs.Operator{
"SCMP_CMP_NE": configs.NotEqualTo,
"SCMP_CMP_LT": configs.LessThan,
"SCMP_CMP_LE": configs.LessThanOrEqualTo,
"SCMP_CMP_EQ": configs.EqualTo,
"SCMP_CMP_GE": configs.GreaterThanOrEqualTo,
"SCMP_CMP_GT": configs.GreaterThan,
"SCMP_CMP_MASKED_EQ": configs.MaskEqualTo,
}
// KnownOperators returns the list of the known operations.
// Used by `runc features`.
func KnownOperators() []string {
var res []string
for k := range operators {
res = append(res, k)
}
sort.Strings(res)
return res
}
var actions = map[string]configs.Action{
"SCMP_ACT_KILL": configs.Kill,
"SCMP_ACT_ERRNO": configs.Errno,
"SCMP_ACT_TRAP": configs.Trap,
"SCMP_ACT_ALLOW": configs.Allow,
"SCMP_ACT_TRACE": configs.Trace,
"SCMP_ACT_LOG": configs.Log,
"SCMP_ACT_NOTIFY": configs.Notify,
"SCMP_ACT_KILL_THREAD": configs.KillThread,
"SCMP_ACT_KILL_PROCESS": configs.KillProcess,
}
// KnownActions returns the list of the known actions.
// Used by `runc features`.
func KnownActions() []string {
var res []string
for k := range actions {
res = append(res, k)
}
sort.Strings(res)
return res
}
var archs = map[string]string{
"SCMP_ARCH_X86": "x86",
"SCMP_ARCH_X86_64": "amd64",
"SCMP_ARCH_X32": "x32",
"SCMP_ARCH_ARM": "arm",
"SCMP_ARCH_AARCH64": "arm64",
"SCMP_ARCH_MIPS": "mips",
"SCMP_ARCH_MIPS64": "mips64",
"SCMP_ARCH_MIPS64N32": "mips64n32",
"SCMP_ARCH_MIPSEL": "mipsel",
"SCMP_ARCH_MIPSEL64": "mipsel64",
"SCMP_ARCH_MIPSEL64N32": "mipsel64n32",
"SCMP_ARCH_PPC": "ppc",
"SCMP_ARCH_PPC64": "ppc64",
"SCMP_ARCH_PPC64LE": "ppc64le",
"SCMP_ARCH_RISCV64": "riscv64",
"SCMP_ARCH_S390": "s390",
"SCMP_ARCH_S390X": "s390x",
}
// KnownArchs returns the list of the known archs.
// Used by `runc features`.
func KnownArchs() []string {
var res []string
for k := range archs {
res = append(res, k)
}
sort.Strings(res)
return res
}
// ConvertStringToOperator converts a string into a Seccomp comparison operator.
// Comparison operators use the names they are assigned by Libseccomp's header.
// Attempting to convert a string that is not a valid operator results in an
// error.
func ConvertStringToOperator(in string) (configs.Operator, error) {
if op, ok := operators[in]; ok {
return op, nil
}
return 0, fmt.Errorf("string %s is not a valid operator for seccomp", in)
}
// ConvertStringToAction converts a string into a Seccomp rule match action.
// Actions use the names they are assigned in Libseccomp's header.
// Attempting to convert a string that is not a valid action results in an
// error.
func ConvertStringToAction(in string) (configs.Action, error) {
if act, ok := actions[in]; ok {
return act, nil
}
return 0, fmt.Errorf("string %s is not a valid action for seccomp", in)
}
// ConvertStringToArch converts a string into a Seccomp comparison arch.
func ConvertStringToArch(in string) (string, error) {
if arch, ok := archs[in]; ok {
return arch, nil
}
return "", fmt.Errorf("string %s is not a valid arch for seccomp", in)
}

View File

@@ -1,721 +0,0 @@
//go:build cgo && seccomp
// +build cgo,seccomp
package patchbpf
import (
"bytes"
"encoding/binary"
"errors"
"fmt"
"io"
"os"
"runtime"
"unsafe"
libseccomp "github.com/seccomp/libseccomp-golang"
"github.com/sirupsen/logrus"
"golang.org/x/net/bpf"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/utils"
)
// #cgo pkg-config: libseccomp
/*
#include <errno.h>
#include <stdint.h>
#include <seccomp.h>
#include <linux/seccomp.h>
const uint32_t C_ACT_ERRNO_ENOSYS = SCMP_ACT_ERRNO(ENOSYS);
// Copied from <linux/seccomp.h>.
#ifndef SECCOMP_SET_MODE_FILTER
# define SECCOMP_SET_MODE_FILTER 1
#endif
const uintptr_t C_SET_MODE_FILTER = SECCOMP_SET_MODE_FILTER;
#ifndef SECCOMP_FILTER_FLAG_LOG
# define SECCOMP_FILTER_FLAG_LOG (1UL << 1)
#endif
const uintptr_t C_FILTER_FLAG_LOG = SECCOMP_FILTER_FLAG_LOG;
#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
# define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
#endif
const uintptr_t C_FILTER_FLAG_NEW_LISTENER = SECCOMP_FILTER_FLAG_NEW_LISTENER;
#ifndef AUDIT_ARCH_RISCV64
#ifndef EM_RISCV
#define EM_RISCV 243
#endif
#define AUDIT_ARCH_RISCV64 (EM_RISCV|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
#endif
// We use the AUDIT_ARCH_* values because those are the ones used by the kernel
// and SCMP_ARCH_* sometimes has fake values (such as SCMP_ARCH_X32). But we
// use <seccomp.h> so we get libseccomp's fallback definitions of AUDIT_ARCH_*.
const uint32_t C_AUDIT_ARCH_I386 = AUDIT_ARCH_I386;
const uint32_t C_AUDIT_ARCH_X86_64 = AUDIT_ARCH_X86_64;
const uint32_t C_AUDIT_ARCH_ARM = AUDIT_ARCH_ARM;
const uint32_t C_AUDIT_ARCH_AARCH64 = AUDIT_ARCH_AARCH64;
const uint32_t C_AUDIT_ARCH_MIPS = AUDIT_ARCH_MIPS;
const uint32_t C_AUDIT_ARCH_MIPS64 = AUDIT_ARCH_MIPS64;
const uint32_t C_AUDIT_ARCH_MIPS64N32 = AUDIT_ARCH_MIPS64N32;
const uint32_t C_AUDIT_ARCH_MIPSEL = AUDIT_ARCH_MIPSEL;
const uint32_t C_AUDIT_ARCH_MIPSEL64 = AUDIT_ARCH_MIPSEL64;
const uint32_t C_AUDIT_ARCH_MIPSEL64N32 = AUDIT_ARCH_MIPSEL64N32;
const uint32_t C_AUDIT_ARCH_PPC = AUDIT_ARCH_PPC;
const uint32_t C_AUDIT_ARCH_PPC64 = AUDIT_ARCH_PPC64;
const uint32_t C_AUDIT_ARCH_PPC64LE = AUDIT_ARCH_PPC64LE;
const uint32_t C_AUDIT_ARCH_S390 = AUDIT_ARCH_S390;
const uint32_t C_AUDIT_ARCH_S390X = AUDIT_ARCH_S390X;
const uint32_t C_AUDIT_ARCH_RISCV64 = AUDIT_ARCH_RISCV64;
*/
import "C"
var retErrnoEnosys = uint32(C.C_ACT_ERRNO_ENOSYS)
// This syscall is used for multiplexing "large" syscalls on s390(x). Unknown
// syscalls will end up with this syscall number, so we need to explicitly
// return -ENOSYS for this syscall on those architectures.
const s390xMultiplexSyscall libseccomp.ScmpSyscall = 0
func isAllowAction(action configs.Action) bool {
switch action {
// Trace is considered an "allow" action because a good tracer should
// support future syscalls (by handling -ENOSYS on its own), and giving
// -ENOSYS will be disruptive for emulation.
case configs.Allow, configs.Log, configs.Trace:
return true
default:
return false
}
}
func parseProgram(rdr io.Reader) ([]bpf.RawInstruction, error) {
var program []bpf.RawInstruction
loop:
for {
// Read the next instruction. We have to use NativeEndian because
// seccomp_export_bpf outputs the program in *host* endian-ness.
var insn unix.SockFilter
if err := binary.Read(rdr, utils.NativeEndian, &insn); err != nil {
if errors.Is(err, io.EOF) {
// Parsing complete.
break loop
}
if errors.Is(err, io.ErrUnexpectedEOF) {
// Parsing stopped mid-instruction.
return nil, fmt.Errorf("program parsing halted mid-instruction: %w", err)
}
// All other errors.
return nil, fmt.Errorf("error parsing instructions: %w", err)
}
program = append(program, bpf.RawInstruction{
Op: insn.Code,
Jt: insn.Jt,
Jf: insn.Jf,
K: insn.K,
})
}
return program, nil
}
func disassembleFilter(filter *libseccomp.ScmpFilter) ([]bpf.Instruction, error) {
rdr, wtr, err := os.Pipe()
if err != nil {
return nil, fmt.Errorf("error creating scratch pipe: %w", err)
}
defer wtr.Close()
defer rdr.Close()
readerBuffer := new(bytes.Buffer)
errChan := make(chan error, 1)
go func() {
_, err := io.Copy(readerBuffer, rdr)
errChan <- err
close(errChan)
}()
if err := filter.ExportBPF(wtr); err != nil {
return nil, fmt.Errorf("error exporting BPF: %w", err)
}
// Close so that the reader actually gets EOF.
_ = wtr.Close()
if copyErr := <-errChan; copyErr != nil {
return nil, fmt.Errorf("error reading from ExportBPF pipe: %w", copyErr)
}
// Parse the instructions.
rawProgram, err := parseProgram(readerBuffer)
if err != nil {
return nil, fmt.Errorf("parsing generated BPF filter: %w", err)
}
program, ok := bpf.Disassemble(rawProgram)
if !ok {
return nil, errors.New("could not disassemble entire BPF filter")
}
return program, nil
}
type linuxAuditArch uint32
const invalidArch linuxAuditArch = 0
func scmpArchToAuditArch(arch libseccomp.ScmpArch) (linuxAuditArch, error) {
switch arch {
case libseccomp.ArchNative:
// Convert to actual native architecture.
arch, err := libseccomp.GetNativeArch()
if err != nil {
return invalidArch, fmt.Errorf("unable to get native arch: %w", err)
}
return scmpArchToAuditArch(arch)
case libseccomp.ArchX86:
return linuxAuditArch(C.C_AUDIT_ARCH_I386), nil
case libseccomp.ArchAMD64, libseccomp.ArchX32:
// NOTE: x32 is treated like x86_64 except all x32 syscalls have the
// 30th bit of the syscall number set to indicate that it's not a
// normal x86_64 syscall.
return linuxAuditArch(C.C_AUDIT_ARCH_X86_64), nil
case libseccomp.ArchARM:
return linuxAuditArch(C.C_AUDIT_ARCH_ARM), nil
case libseccomp.ArchARM64:
return linuxAuditArch(C.C_AUDIT_ARCH_AARCH64), nil
case libseccomp.ArchMIPS:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPS), nil
case libseccomp.ArchMIPS64:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPS64), nil
case libseccomp.ArchMIPS64N32:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPS64N32), nil
case libseccomp.ArchMIPSEL:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPSEL), nil
case libseccomp.ArchMIPSEL64:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPSEL64), nil
case libseccomp.ArchMIPSEL64N32:
return linuxAuditArch(C.C_AUDIT_ARCH_MIPSEL64N32), nil
case libseccomp.ArchPPC:
return linuxAuditArch(C.C_AUDIT_ARCH_PPC), nil
case libseccomp.ArchPPC64:
return linuxAuditArch(C.C_AUDIT_ARCH_PPC64), nil
case libseccomp.ArchPPC64LE:
return linuxAuditArch(C.C_AUDIT_ARCH_PPC64LE), nil
case libseccomp.ArchS390:
return linuxAuditArch(C.C_AUDIT_ARCH_S390), nil
case libseccomp.ArchS390X:
return linuxAuditArch(C.C_AUDIT_ARCH_S390X), nil
case libseccomp.ArchRISCV64:
return linuxAuditArch(C.C_AUDIT_ARCH_RISCV64), nil
default:
return invalidArch, fmt.Errorf("unknown architecture: %v", arch)
}
}
type lastSyscallMap map[linuxAuditArch]map[libseccomp.ScmpArch]libseccomp.ScmpSyscall
// Figure out largest syscall number referenced in the filter for each
// architecture. We will be generating code based on the native architecture
// representation, but SCMP_ARCH_X32 means we have to track cases where the
// same architecture has different largest syscalls based on the mode.
func findLastSyscalls(config *configs.Seccomp) (lastSyscallMap, error) {
scmpArchs := make(map[libseccomp.ScmpArch]struct{})
for _, ociArch := range config.Architectures {
arch, err := libseccomp.GetArchFromString(ociArch)
if err != nil {
return nil, fmt.Errorf("unable to validate seccomp architecture: %w", err)
}
scmpArchs[arch] = struct{}{}
}
// On architectures like ppc64le, Docker inexplicably doesn't include the
// native architecture in the architecture list which results in no
// architectures being present in the list at all (rendering the ENOSYS
// stub a no-op). So, always include the native architecture.
if nativeScmpArch, err := libseccomp.GetNativeArch(); err != nil {
return nil, fmt.Errorf("unable to get native arch: %w", err)
} else if _, ok := scmpArchs[nativeScmpArch]; !ok {
logrus.Debugf("seccomp: adding implied native architecture %v to config set", nativeScmpArch)
scmpArchs[nativeScmpArch] = struct{}{}
}
logrus.Debugf("seccomp: configured architecture set: %s", scmpArchs)
// Only loop over architectures which are present in the filter. Any other
// architectures will get the libseccomp bad architecture action anyway.
lastSyscalls := make(lastSyscallMap)
for arch := range scmpArchs {
auditArch, err := scmpArchToAuditArch(arch)
if err != nil {
return nil, fmt.Errorf("cannot map architecture %v to AUDIT_ARCH_ constant: %w", arch, err)
}
if _, ok := lastSyscalls[auditArch]; !ok {
lastSyscalls[auditArch] = map[libseccomp.ScmpArch]libseccomp.ScmpSyscall{}
}
if _, ok := lastSyscalls[auditArch][arch]; ok {
// Because of ArchNative we may hit the same entry multiple times.
// Just skip it if we've seen this (linuxAuditArch, ScmpArch)
// combination before.
continue
}
// Find the largest syscall in the filter for this architecture.
var largestSyscall libseccomp.ScmpSyscall
for _, rule := range config.Syscalls {
sysno, err := libseccomp.GetSyscallFromNameByArch(rule.Name, arch)
if err != nil {
// Ignore unknown syscalls.
continue
}
if sysno > largestSyscall {
largestSyscall = sysno
}
}
if largestSyscall != 0 {
logrus.Debugf("seccomp: largest syscall number for arch %v is %v", arch, largestSyscall)
lastSyscalls[auditArch][arch] = largestSyscall
} else {
logrus.Warnf("could not find any syscalls for arch %v", arch)
delete(lastSyscalls[auditArch], arch)
}
}
return lastSyscalls, nil
}
// FIXME FIXME FIXME
//
// This solution is less than ideal. In the future it would be great to have
// per-arch information about which syscalls were added in which kernel
// versions so we can create far more accurate filter rules (handling holes in
// the syscall table and determining -ENOSYS requirements based on kernel
// minimum version alone.
//
// This implementation can in principle cause issues with syscalls like
// close_range(2) which were added out-of-order in the syscall table between
// kernel releases.
func generateEnosysStub(lastSyscalls lastSyscallMap) ([]bpf.Instruction, error) {
// A jump-table for each linuxAuditArch used to generate the initial
// conditional jumps -- measured from the *END* of the program so they
// remain valid after prepending to the tail.
archJumpTable := map[linuxAuditArch]uint32{}
// Generate our own -ENOSYS rules for each architecture. They have to be
// generated in reverse (prepended to the tail of the program) because the
// JumpIf jumps need to be computed from the end of the program.
programTail := []bpf.Instruction{
// Fall-through rules jump into the filter.
bpf.Jump{Skip: 1},
// Rules which jump to here get -ENOSYS.
bpf.RetConstant{Val: retErrnoEnosys},
}
// Generate the syscall -ENOSYS rules.
for auditArch, maxSyscalls := range lastSyscalls {
// The number of instructions from the tail of this section which need
// to be jumped in order to reach the -ENOSYS return. If the section
// does not jump, it will fall through to the actual filter.
baseJumpEnosys := uint32(len(programTail) - 1)
baseJumpFilter := baseJumpEnosys + 1
// Add the load instruction for the syscall number -- we jump here
// directly from the arch code so we need to do it here. Sadly we can't
// share this code between architecture branches.
section := []bpf.Instruction{
// load [0] (syscall number)
bpf.LoadAbsolute{Off: 0, Size: 4}, // NOTE: We assume sizeof(int) == 4.
}
switch len(maxSyscalls) {
case 0:
// No syscalls found for this arch -- skip it and move on.
continue
case 1:
// Get the only syscall and scmpArch in the map.
var (
scmpArch libseccomp.ScmpArch
sysno libseccomp.ScmpSyscall
)
for arch, no := range maxSyscalls {
sysno = no
scmpArch = arch
}
switch scmpArch {
// Return -ENOSYS for setup(2) on s390(x). This syscall is used for
// multiplexing "large syscall number" syscalls, but if the syscall
// number is not known to the kernel then the syscall number is
// left unchanged (and because it is sysno=0, you'll end up with
// EPERM for syscalls the kernel doesn't know about).
//
// The actual setup(2) syscall is never used by userspace anymore
// (and hasn't existed for decades) outside of this multiplexing
// scheme so returning -ENOSYS is fine.
case libseccomp.ArchS390, libseccomp.ArchS390X:
section = append(section, []bpf.Instruction{
// jne [setup=0],1
bpf.JumpIf{
Cond: bpf.JumpNotEqual,
Val: uint32(s390xMultiplexSyscall),
SkipTrue: 1,
},
// ret [ENOSYS]
bpf.RetConstant{Val: retErrnoEnosys},
}...)
}
// The simplest case just boils down to a single jgt instruction,
// with special handling if baseJumpEnosys is larger than 255 (and
// thus a long jump is required).
var sectionTail []bpf.Instruction
if baseJumpEnosys+1 <= 255 {
sectionTail = []bpf.Instruction{
// jgt [syscall],[baseJumpEnosys+1]
bpf.JumpIf{
Cond: bpf.JumpGreaterThan,
Val: uint32(sysno),
SkipTrue: uint8(baseJumpEnosys + 1),
},
// ja [baseJumpFilter]
bpf.Jump{Skip: baseJumpFilter},
}
} else {
sectionTail = []bpf.Instruction{
// jle [syscall],1
bpf.JumpIf{Cond: bpf.JumpLessOrEqual, Val: uint32(sysno), SkipTrue: 1},
// ja [baseJumpEnosys+1]
bpf.Jump{Skip: baseJumpEnosys + 1},
// ja [baseJumpFilter]
bpf.Jump{Skip: baseJumpFilter},
}
}
// If we're on x86 we need to add a check for x32 and if we're in
// the wrong mode we jump over the section.
if uint32(auditArch) == uint32(C.C_AUDIT_ARCH_X86_64) {
// Generate a prefix to check the mode.
switch scmpArch {
case libseccomp.ArchAMD64:
sectionTail = append([]bpf.Instruction{
// jset (1<<30),[len(tail)-1]
bpf.JumpIf{
Cond: bpf.JumpBitsSet,
Val: 1 << 30,
SkipTrue: uint8(len(sectionTail) - 1),
},
}, sectionTail...)
case libseccomp.ArchX32:
sectionTail = append([]bpf.Instruction{
// jset (1<<30),0,[len(tail)-1]
bpf.JumpIf{
Cond: bpf.JumpBitsNotSet,
Val: 1 << 30,
SkipTrue: uint8(len(sectionTail) - 1),
},
}, sectionTail...)
default:
return nil, fmt.Errorf("unknown amd64 native architecture %#x", scmpArch)
}
}
section = append(section, sectionTail...)
case 2:
// x32 and x86_64 are a unique case, we can't handle any others.
if uint32(auditArch) != uint32(C.C_AUDIT_ARCH_X86_64) {
return nil, fmt.Errorf("unknown architecture overlap on native arch %#x", auditArch)
}
x32sysno, ok := maxSyscalls[libseccomp.ArchX32]
if !ok {
return nil, fmt.Errorf("missing %v in overlapping x86_64 arch: %v", libseccomp.ArchX32, maxSyscalls)
}
x86sysno, ok := maxSyscalls[libseccomp.ArchAMD64]
if !ok {
return nil, fmt.Errorf("missing %v in overlapping x86_64 arch: %v", libseccomp.ArchAMD64, maxSyscalls)
}
// The x32 ABI indicates that a syscall is being made by an x32
// process by setting the 30th bit of the syscall number, but we
// need to do some special-casing depending on whether we need to
// do long jumps.
if baseJumpEnosys+2 <= 255 {
// For the simple case we want to have something like:
// jset (1<<30),1
// jgt [x86 syscall],[baseJumpEnosys+2],1
// jgt [x32 syscall],[baseJumpEnosys+1]
// ja [baseJumpFilter]
section = append(section, []bpf.Instruction{
// jset (1<<30),1
bpf.JumpIf{Cond: bpf.JumpBitsSet, Val: 1 << 30, SkipTrue: 1},
// jgt [x86 syscall],[baseJumpEnosys+1],1
bpf.JumpIf{
Cond: bpf.JumpGreaterThan,
Val: uint32(x86sysno),
SkipTrue: uint8(baseJumpEnosys + 2), SkipFalse: 1,
},
// jgt [x32 syscall],[baseJumpEnosys]
bpf.JumpIf{
Cond: bpf.JumpGreaterThan,
Val: uint32(x32sysno),
SkipTrue: uint8(baseJumpEnosys + 1),
},
// ja [baseJumpFilter]
bpf.Jump{Skip: baseJumpFilter},
}...)
} else {
// But if the [baseJumpEnosys+2] jump is larger than 255 we
// need to do a long jump like so:
// jset (1<<30),1
// jgt [x86 syscall],1,2
// jle [x32 syscall],1
// ja [baseJumpEnosys+1]
// ja [baseJumpFilter]
section = append(section, []bpf.Instruction{
// jset (1<<30),1
bpf.JumpIf{Cond: bpf.JumpBitsSet, Val: 1 << 30, SkipTrue: 1},
// jgt [x86 syscall],1,2
bpf.JumpIf{
Cond: bpf.JumpGreaterThan,
Val: uint32(x86sysno),
SkipTrue: 1, SkipFalse: 2,
},
// jle [x32 syscall],[baseJumpEnosys]
bpf.JumpIf{
Cond: bpf.JumpLessOrEqual,
Val: uint32(x32sysno),
SkipTrue: 1,
},
// ja [baseJumpEnosys+1]
bpf.Jump{Skip: baseJumpEnosys + 1},
// ja [baseJumpFilter]
bpf.Jump{Skip: baseJumpFilter},
}...)
}
default:
return nil, fmt.Errorf("invalid number of architecture overlaps: %v", len(maxSyscalls))
}
// Prepend this section to the tail.
programTail = append(section, programTail...)
// Update jump table.
archJumpTable[auditArch] = uint32(len(programTail))
}
// Add a dummy "jump to filter" for any architecture we might miss below.
// Such architectures will probably get the BadArch action of the filter
// regardless.
programTail = append([]bpf.Instruction{
// ja [end of stub and start of filter]
bpf.Jump{Skip: uint32(len(programTail))},
}, programTail...)
// Generate the jump rules for each architecture. This has to be done in
// reverse as well for the same reason as above. We add to programTail
// directly because the jumps are impacted by each architecture rule we add
// as well.
//
// TODO: Maybe we want to optimise to avoid long jumps here? So sort the
// architectures based on how large the jumps are going to be, or
// re-sort the candidate architectures each time to make sure that we
// pick the largest jump which is going to be smaller than 255.
for auditArch := range lastSyscalls {
// We jump forwards but the jump table is calculated from the *END*.
jump := uint32(len(programTail)) - archJumpTable[auditArch]
// Same routine as above -- this is a basic jeq check, complicated
// slightly if it turns out that we need to do a long jump.
if jump <= 255 {
programTail = append([]bpf.Instruction{
// jeq [arch],[jump]
bpf.JumpIf{
Cond: bpf.JumpEqual,
Val: uint32(auditArch),
SkipTrue: uint8(jump),
},
}, programTail...)
} else {
programTail = append([]bpf.Instruction{
// jne [arch],1
bpf.JumpIf{
Cond: bpf.JumpNotEqual,
Val: uint32(auditArch),
SkipTrue: 1,
},
// ja [jump]
bpf.Jump{Skip: jump},
}, programTail...)
}
}
// Prepend the load instruction for the architecture.
programTail = append([]bpf.Instruction{
// load [4] (architecture)
bpf.LoadAbsolute{Off: 4, Size: 4}, // NOTE: We assume sizeof(int) == 4.
}, programTail...)
// And that's all folks!
return programTail, nil
}
func assemble(program []bpf.Instruction) ([]unix.SockFilter, error) {
rawProgram, err := bpf.Assemble(program)
if err != nil {
return nil, fmt.Errorf("error assembling program: %w", err)
}
// Convert to []unix.SockFilter for unix.SockFilter.
var filter []unix.SockFilter
for _, insn := range rawProgram {
filter = append(filter, unix.SockFilter{
Code: insn.Op,
Jt: insn.Jt,
Jf: insn.Jf,
K: insn.K,
})
}
return filter, nil
}
func generatePatch(config *configs.Seccomp) ([]bpf.Instruction, error) {
// Patch the generated cBPF only when there is not a defaultErrnoRet set
// and it is different from ENOSYS
if config.DefaultErrnoRet != nil && *config.DefaultErrnoRet == uint(retErrnoEnosys) {
return nil, nil
}
// We only add the stub if the default action is not permissive.
if isAllowAction(config.DefaultAction) {
logrus.Debugf("seccomp: skipping -ENOSYS stub filter generation")
return nil, nil
}
lastSyscalls, err := findLastSyscalls(config)
if err != nil {
return nil, fmt.Errorf("error finding last syscalls for -ENOSYS stub: %w", err)
}
stubProgram, err := generateEnosysStub(lastSyscalls)
if err != nil {
return nil, fmt.Errorf("error generating -ENOSYS stub: %w", err)
}
return stubProgram, nil
}
func enosysPatchFilter(config *configs.Seccomp, filter *libseccomp.ScmpFilter) ([]unix.SockFilter, error) {
program, err := disassembleFilter(filter)
if err != nil {
return nil, fmt.Errorf("error disassembling original filter: %w", err)
}
patch, err := generatePatch(config)
if err != nil {
return nil, fmt.Errorf("error generating patch for filter: %w", err)
}
fullProgram := append(patch, program...)
logrus.Debugf("seccomp: prepending -ENOSYS stub filter to user filter...")
for idx, insn := range patch {
logrus.Debugf(" [%4.1d] %s", idx, insn)
}
logrus.Debugf(" [....] --- original filter ---")
fprog, err := assemble(fullProgram)
if err != nil {
return nil, fmt.Errorf("error assembling modified filter: %w", err)
}
return fprog, nil
}
func filterFlags(config *configs.Seccomp, filter *libseccomp.ScmpFilter) (flags uint, noNewPrivs bool, err error) {
// Ignore the error since pre-2.4 libseccomp is treated as API level 0.
apiLevel, _ := libseccomp.GetAPI()
noNewPrivs, err = filter.GetNoNewPrivsBit()
if err != nil {
return 0, false, fmt.Errorf("unable to fetch no_new_privs filter bit: %w", err)
}
if apiLevel >= 3 {
if logBit, err := filter.GetLogBit(); err != nil {
return 0, false, fmt.Errorf("unable to fetch SECCOMP_FILTER_FLAG_LOG bit: %w", err)
} else if logBit {
flags |= uint(C.C_FILTER_FLAG_LOG)
}
}
// TODO: Support seccomp flags not yet added to libseccomp-golang...
for _, call := range config.Syscalls {
if call.Action == configs.Notify {
flags |= uint(C.C_FILTER_FLAG_NEW_LISTENER)
break
}
}
return
}
func sysSeccompSetFilter(flags uint, filter []unix.SockFilter) (fd int, err error) {
fprog := unix.SockFprog{
Len: uint16(len(filter)),
Filter: &filter[0],
}
fd = -1 // only return a valid fd when C_FILTER_FLAG_NEW_LISTENER is set
// If no seccomp flags were requested we can use the old-school prctl(2).
if flags == 0 {
err = unix.Prctl(unix.PR_SET_SECCOMP,
unix.SECCOMP_MODE_FILTER,
uintptr(unsafe.Pointer(&fprog)), 0, 0)
} else {
fdptr, _, errno := unix.RawSyscall(unix.SYS_SECCOMP,
uintptr(C.C_SET_MODE_FILTER),
uintptr(flags), uintptr(unsafe.Pointer(&fprog)))
if errno != 0 {
err = errno
}
if flags&uint(C.C_FILTER_FLAG_NEW_LISTENER) != 0 {
fd = int(fdptr)
}
}
runtime.KeepAlive(filter)
runtime.KeepAlive(fprog)
return
}
// PatchAndLoad takes a seccomp configuration and a libseccomp filter which has
// been pre-configured with the set of rules in the seccomp config. It then
// patches said filter to handle -ENOSYS in a much nicer manner than the
// default libseccomp default action behaviour, and loads the patched filter
// into the kernel for the current process.
func PatchAndLoad(config *configs.Seccomp, filter *libseccomp.ScmpFilter) (int, error) {
// Generate a patched filter.
fprog, err := enosysPatchFilter(config, filter)
if err != nil {
return -1, fmt.Errorf("error patching filter: %w", err)
}
// Get the set of libseccomp flags set.
seccompFlags, noNewPrivs, err := filterFlags(config, filter)
if err != nil {
return -1, fmt.Errorf("unable to fetch seccomp filter flags: %w", err)
}
// Set no_new_privs if it was requested, though in runc we handle
// no_new_privs separately so warn if we hit this path.
if noNewPrivs {
logrus.Warnf("potentially misconfigured filter -- setting no_new_privs in seccomp path")
if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
return -1, fmt.Errorf("error enabling no_new_privs bit: %w", err)
}
}
// Finally, load the filter.
fd, err := sysSeccompSetFilter(seccompFlags, fprog)
if err != nil {
return -1, fmt.Errorf("error loading seccomp filter: %w", err)
}
return fd, nil
}

View File

@@ -1,4 +0,0 @@
//go:build !linux || !cgo || !seccomp
// +build !linux !cgo !seccomp
package patchbpf

View File

@@ -1,268 +0,0 @@
//go:build cgo && seccomp
// +build cgo,seccomp
package seccomp
import (
"errors"
"fmt"
libseccomp "github.com/seccomp/libseccomp-golang"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/seccomp/patchbpf"
)
var (
actTrace = libseccomp.ActTrace.SetReturnCode(int16(unix.EPERM))
actErrno = libseccomp.ActErrno.SetReturnCode(int16(unix.EPERM))
)
const (
// Linux system calls can have at most 6 arguments
syscallMaxArguments int = 6
)
// InitSeccomp installs the seccomp filters to be used in the container as
// specified in config.
// Returns the seccomp file descriptor if any of the filters include a
// SCMP_ACT_NOTIFY action, otherwise returns -1.
func InitSeccomp(config *configs.Seccomp) (int, error) {
if config == nil {
return -1, errors.New("cannot initialize Seccomp - nil config passed")
}
defaultAction, err := getAction(config.DefaultAction, config.DefaultErrnoRet)
if err != nil {
return -1, errors.New("error initializing seccomp - invalid default action")
}
// Ignore the error since pre-2.4 libseccomp is treated as API level 0.
apiLevel, _ := libseccomp.GetAPI()
for _, call := range config.Syscalls {
if call.Action == configs.Notify {
if apiLevel < 6 {
return -1, fmt.Errorf("seccomp notify unsupported: API level: got %d, want at least 6. Please try with libseccomp >= 2.5.0 and Linux >= 5.7", apiLevel)
}
// We can't allow the write syscall to notify to the seccomp agent.
// After InitSeccomp() is called, we need to syncParentSeccomp() to write the seccomp fd plain
// number, so the parent sends it to the seccomp agent. If we use SCMP_ACT_NOTIFY on write, we
// never can write the seccomp fd to the parent and therefore the seccomp agent never receives
// the seccomp fd and runc is hang during initialization.
//
// Note that read()/close(), that are also used in syncParentSeccomp(), _can_ use SCMP_ACT_NOTIFY.
// Because we write the seccomp fd on the pipe to the parent, the parent is able to proceed and
// send the seccomp fd to the agent (it is another process and not subject to the seccomp
// filter). We will be blocked on read()/close() inside syncParentSeccomp() but if the seccomp
// agent allows those syscalls to proceed, initialization works just fine and the agent can
// handle future read()/close() syscalls as it wanted.
if call.Name == "write" {
return -1, errors.New("SCMP_ACT_NOTIFY cannot be used for the write syscall")
}
}
}
// See comment on why write is not allowed. The same reason applies, as this can mean handling write too.
if defaultAction == libseccomp.ActNotify {
return -1, errors.New("SCMP_ACT_NOTIFY cannot be used as default action")
}
filter, err := libseccomp.NewFilter(defaultAction)
if err != nil {
return -1, fmt.Errorf("error creating filter: %w", err)
}
// Add extra architectures
for _, arch := range config.Architectures {
scmpArch, err := libseccomp.GetArchFromString(arch)
if err != nil {
return -1, fmt.Errorf("error validating Seccomp architecture: %w", err)
}
if err := filter.AddArch(scmpArch); err != nil {
return -1, fmt.Errorf("error adding architecture to seccomp filter: %w", err)
}
}
// Unset no new privs bit
if err := filter.SetNoNewPrivsBit(false); err != nil {
return -1, fmt.Errorf("error setting no new privileges: %w", err)
}
// Add a rule for each syscall
for _, call := range config.Syscalls {
if call == nil {
return -1, errors.New("encountered nil syscall while initializing Seccomp")
}
if err := matchCall(filter, call, defaultAction); err != nil {
return -1, err
}
}
seccompFd, err := patchbpf.PatchAndLoad(config, filter)
if err != nil {
return -1, fmt.Errorf("error loading seccomp filter into kernel: %w", err)
}
return seccompFd, nil
}
// Convert Libcontainer Action to Libseccomp ScmpAction
func getAction(act configs.Action, errnoRet *uint) (libseccomp.ScmpAction, error) {
switch act {
case configs.Kill, configs.KillThread:
return libseccomp.ActKillThread, nil
case configs.Errno:
if errnoRet != nil {
return libseccomp.ActErrno.SetReturnCode(int16(*errnoRet)), nil
}
return actErrno, nil
case configs.Trap:
return libseccomp.ActTrap, nil
case configs.Allow:
return libseccomp.ActAllow, nil
case configs.Trace:
if errnoRet != nil {
return libseccomp.ActTrace.SetReturnCode(int16(*errnoRet)), nil
}
return actTrace, nil
case configs.Log:
return libseccomp.ActLog, nil
case configs.Notify:
return libseccomp.ActNotify, nil
case configs.KillProcess:
return libseccomp.ActKillProcess, nil
default:
return libseccomp.ActInvalid, errors.New("invalid action, cannot use in rule")
}
}
// Convert Libcontainer Operator to Libseccomp ScmpCompareOp
func getOperator(op configs.Operator) (libseccomp.ScmpCompareOp, error) {
switch op {
case configs.EqualTo:
return libseccomp.CompareEqual, nil
case configs.NotEqualTo:
return libseccomp.CompareNotEqual, nil
case configs.GreaterThan:
return libseccomp.CompareGreater, nil
case configs.GreaterThanOrEqualTo:
return libseccomp.CompareGreaterEqual, nil
case configs.LessThan:
return libseccomp.CompareLess, nil
case configs.LessThanOrEqualTo:
return libseccomp.CompareLessOrEqual, nil
case configs.MaskEqualTo:
return libseccomp.CompareMaskedEqual, nil
default:
return libseccomp.CompareInvalid, errors.New("invalid operator, cannot use in rule")
}
}
// Convert Libcontainer Arg to Libseccomp ScmpCondition
func getCondition(arg *configs.Arg) (libseccomp.ScmpCondition, error) {
cond := libseccomp.ScmpCondition{}
if arg == nil {
return cond, errors.New("cannot convert nil to syscall condition")
}
op, err := getOperator(arg.Op)
if err != nil {
return cond, err
}
return libseccomp.MakeCondition(arg.Index, op, arg.Value, arg.ValueTwo)
}
// Add a rule to match a single syscall
func matchCall(filter *libseccomp.ScmpFilter, call *configs.Syscall, defAct libseccomp.ScmpAction) error {
if call == nil || filter == nil {
return errors.New("cannot use nil as syscall to block")
}
if len(call.Name) == 0 {
return errors.New("empty string is not a valid syscall")
}
// Convert the call's action to the libseccomp equivalent
callAct, err := getAction(call.Action, call.ErrnoRet)
if err != nil {
return fmt.Errorf("action in seccomp profile is invalid: %w", err)
}
if callAct == defAct {
// This rule is redundant, silently skip it
// to avoid error from AddRule.
return nil
}
// If we can't resolve the syscall, assume it is not supported
// by this kernel. Warn about it, don't error out.
callNum, err := libseccomp.GetSyscallFromName(call.Name)
if err != nil {
logrus.Debugf("unknown seccomp syscall %q ignored", call.Name)
return nil
}
// Unconditional match - just add the rule
if len(call.Args) == 0 {
if err := filter.AddRule(callNum, callAct); err != nil {
return fmt.Errorf("error adding seccomp filter rule for syscall %s: %w", call.Name, err)
}
} else {
// If two or more arguments have the same condition,
// Revert to old behavior, adding each condition as a separate rule
argCounts := make([]uint, syscallMaxArguments)
conditions := []libseccomp.ScmpCondition{}
for _, cond := range call.Args {
newCond, err := getCondition(cond)
if err != nil {
return fmt.Errorf("error creating seccomp syscall condition for syscall %s: %w", call.Name, err)
}
argCounts[cond.Index] += 1
conditions = append(conditions, newCond)
}
hasMultipleArgs := false
for _, count := range argCounts {
if count > 1 {
hasMultipleArgs = true
break
}
}
if hasMultipleArgs {
// Revert to old behavior
// Add each condition attached to a separate rule
for _, cond := range conditions {
condArr := []libseccomp.ScmpCondition{cond}
if err := filter.AddRuleConditional(callNum, callAct, condArr); err != nil {
return fmt.Errorf("error adding seccomp rule for syscall %s: %w", call.Name, err)
}
}
} else {
// No conditions share same argument
// Use new, proper behavior
if err := filter.AddRuleConditional(callNum, callAct, conditions); err != nil {
return fmt.Errorf("error adding seccomp rule for syscall %s: %w", call.Name, err)
}
}
}
return nil
}
// Version returns major, minor, and micro.
func Version() (uint, uint, uint) {
return libseccomp.GetLibraryVersion()
}
// Enabled is true if seccomp support is compiled in.
const Enabled = true

View File

@@ -1,28 +0,0 @@
//go:build !linux || !cgo || !seccomp
// +build !linux !cgo !seccomp
package seccomp
import (
"errors"
"github.com/opencontainers/runc/libcontainer/configs"
)
var ErrSeccompNotEnabled = errors.New("seccomp: config provided but seccomp not supported")
// InitSeccomp does nothing because seccomp is not supported.
func InitSeccomp(config *configs.Seccomp) (int, error) {
if config != nil {
return -1, ErrSeccompNotEnabled
}
return -1, nil
}
// Version returns major, minor, and micro.
func Version() (uint, uint, uint) {
return 0, 0, 0
}
// Enabled is true if seccomp support is compiled in.
const Enabled = false

View File

@@ -1,149 +0,0 @@
package libcontainer
import (
"errors"
"fmt"
"os"
"os/exec"
"strconv"
"github.com/opencontainers/selinux/go-selinux"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/apparmor"
"github.com/opencontainers/runc/libcontainer/keys"
"github.com/opencontainers/runc/libcontainer/seccomp"
"github.com/opencontainers/runc/libcontainer/system"
"github.com/opencontainers/runc/libcontainer/utils"
)
// linuxSetnsInit performs the container's initialization for running a new process
// inside an existing container.
type linuxSetnsInit struct {
pipe *os.File
consoleSocket *os.File
config *initConfig
logFd int
}
func (l *linuxSetnsInit) getSessionRingName() string {
return "_ses." + l.config.ContainerId
}
func (l *linuxSetnsInit) Init() error {
if !l.config.Config.NoNewKeyring {
if err := selinux.SetKeyLabel(l.config.ProcessLabel); err != nil {
return err
}
defer selinux.SetKeyLabel("") //nolint: errcheck
// Do not inherit the parent's session keyring.
if _, err := keys.JoinSessionKeyring(l.getSessionRingName()); err != nil {
// Same justification as in standart_init_linux.go as to why we
// don't bail on ENOSYS.
//
// TODO(cyphar): And we should have logging here too.
if !errors.Is(err, unix.ENOSYS) {
return fmt.Errorf("unable to join session keyring: %w", err)
}
}
}
if l.config.CreateConsole {
if err := setupConsole(l.consoleSocket, l.config, false); err != nil {
return err
}
if err := system.Setctty(); err != nil {
return err
}
}
if l.config.NoNewPrivileges {
if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
return err
}
}
// Tell our parent that we're ready to exec. This must be done before the
// Seccomp rules have been applied, because we need to be able to read and
// write to a socket.
if err := syncParentReady(l.pipe); err != nil {
return fmt.Errorf("sync ready: %w", err)
}
if err := selinux.SetExecLabel(l.config.ProcessLabel); err != nil {
return err
}
defer selinux.SetExecLabel("") //nolint: errcheck
// Without NoNewPrivileges seccomp is a privileged operation, so we need to
// do this before dropping capabilities; otherwise do it as late as possible
// just before execve so as few syscalls take place after it as possible.
if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
if err != nil {
return err
}
if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
return err
}
}
if err := finalizeNamespace(l.config); err != nil {
return err
}
if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
return err
}
// Check for the arg before waiting to make sure it exists and it is
// returned as a create time error.
name, err := exec.LookPath(l.config.Args[0])
if err != nil {
return err
}
// exec.LookPath in Go < 1.20 might return no error for an executable
// residing on a file system mounted with noexec flag, so perform this
// extra check now while we can still return a proper error.
// TODO: remove this once go < 1.20 is not supported.
if err := eaccess(name); err != nil {
return &os.PathError{Op: "eaccess", Path: name, Err: err}
}
// Set seccomp as close to execve as possible, so as few syscalls take
// place afterward (reducing the amount of syscalls that users need to
// enable in their seccomp profiles).
if l.config.Config.Seccomp != nil && l.config.NoNewPrivileges {
seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
if err != nil {
return fmt.Errorf("unable to init seccomp: %w", err)
}
if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
return err
}
}
logrus.Debugf("setns_init: about to exec")
// Close the log pipe fd so the parent's ForwardLogs can exit.
if err := unix.Close(l.logFd); err != nil {
return &os.PathError{Op: "close log pipe", Path: "fd " + strconv.Itoa(l.logFd), Err: err}
}
// Close all file descriptors we are not passing to the container. This is
// necessary because the execve target could use internal runc fds as the
// execve path, potentially giving access to binary files from the host
// (which can then be opened by container processes, leading to container
// escapes). Note that because this operation will close any open file
// descriptors that are referenced by (*os.File) handles from underneath
// the Go runtime, we must not do any file operations after this point
// (otherwise the (*os.File) finaliser could close the wrong file). See
// CVE-2024-21626 for more information as to why this protection is
// necessary.
//
// This is not needed for runc-dmz, because the extra execve(2) step means
// that all O_CLOEXEC file descriptors have already been closed and thus
// the second execve(2) from runc-dmz cannot access internal file
// descriptors from runc.
if err := utils.UnsafeCloseFrom(l.config.PassedFilesCount + 3); err != nil {
return err
}
return system.Exec(name, l.config.Args[0:], os.Environ())
}

View File

@@ -1,282 +0,0 @@
package libcontainer
import (
"errors"
"fmt"
"os"
"os/exec"
"strconv"
"github.com/opencontainers/runtime-spec/specs-go"
"github.com/opencontainers/selinux/go-selinux"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
"github.com/opencontainers/runc/libcontainer/apparmor"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/keys"
"github.com/opencontainers/runc/libcontainer/seccomp"
"github.com/opencontainers/runc/libcontainer/system"
"github.com/opencontainers/runc/libcontainer/utils"
)
type linuxStandardInit struct {
pipe *os.File
consoleSocket *os.File
parentPid int
fifoFd int
logFd int
mountFds []int
config *initConfig
}
func (l *linuxStandardInit) getSessionRingParams() (string, uint32, uint32) {
var newperms uint32
if l.config.Config.Namespaces.Contains(configs.NEWUSER) {
// With user ns we need 'other' search permissions.
newperms = 0x8
} else {
// Without user ns we need 'UID' search permissions.
newperms = 0x80000
}
// Create a unique per session container name that we can join in setns;
// However, other containers can also join it.
return "_ses." + l.config.ContainerId, 0xffffffff, newperms
}
func (l *linuxStandardInit) Init() error {
if !l.config.Config.NoNewKeyring {
if err := selinux.SetKeyLabel(l.config.ProcessLabel); err != nil {
return err
}
defer selinux.SetKeyLabel("") //nolint: errcheck
ringname, keepperms, newperms := l.getSessionRingParams()
// Do not inherit the parent's session keyring.
if sessKeyId, err := keys.JoinSessionKeyring(ringname); err != nil {
// If keyrings aren't supported then it is likely we are on an
// older kernel (or inside an LXC container). While we could bail,
// the security feature we are using here is best-effort (it only
// really provides marginal protection since VFS credentials are
// the only significant protection of keyrings).
//
// TODO(cyphar): Log this so people know what's going on, once we
// have proper logging in 'runc init'.
if !errors.Is(err, unix.ENOSYS) {
return fmt.Errorf("unable to join session keyring: %w", err)
}
} else {
// Make session keyring searchable. If we've gotten this far we
// bail on any error -- we don't want to have a keyring with bad
// permissions.
if err := keys.ModKeyringPerm(sessKeyId, keepperms, newperms); err != nil {
return fmt.Errorf("unable to mod keyring permissions: %w", err)
}
}
}
if err := setupNetwork(l.config); err != nil {
return err
}
if err := setupRoute(l.config.Config); err != nil {
return err
}
// initialises the labeling system
selinux.GetEnabled()
// We don't need the mountFds after prepareRootfs() nor if it fails.
err := prepareRootfs(l.pipe, l.config, l.mountFds)
for _, m := range l.mountFds {
if m == -1 {
continue
}
if err := unix.Close(m); err != nil {
return fmt.Errorf("Unable to close mountFds fds: %w", err)
}
}
if err != nil {
return err
}
// Set up the console. This has to be done *before* we finalize the rootfs,
// but *after* we've given the user the chance to set up all of the mounts
// they wanted.
if l.config.CreateConsole {
if err := setupConsole(l.consoleSocket, l.config, true); err != nil {
return err
}
if err := system.Setctty(); err != nil {
return &os.SyscallError{Syscall: "ioctl(setctty)", Err: err}
}
}
// Finish the rootfs setup.
if l.config.Config.Namespaces.Contains(configs.NEWNS) {
if err := finalizeRootfs(l.config.Config); err != nil {
return err
}
}
if hostname := l.config.Config.Hostname; hostname != "" {
if err := unix.Sethostname([]byte(hostname)); err != nil {
return &os.SyscallError{Syscall: "sethostname", Err: err}
}
}
if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
return fmt.Errorf("unable to apply apparmor profile: %w", err)
}
for key, value := range l.config.Config.Sysctl {
if err := writeSystemProperty(key, value); err != nil {
return err
}
}
for _, path := range l.config.Config.ReadonlyPaths {
if err := readonlyPath(path); err != nil {
return fmt.Errorf("can't make %q read-only: %w", path, err)
}
}
for _, path := range l.config.Config.MaskPaths {
if err := maskPath(path, l.config.Config.MountLabel); err != nil {
return fmt.Errorf("can't mask path %s: %w", path, err)
}
}
pdeath, err := system.GetParentDeathSignal()
if err != nil {
return fmt.Errorf("can't get pdeath signal: %w", err)
}
if l.config.NoNewPrivileges {
if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil {
return &os.SyscallError{Syscall: "prctl(SET_NO_NEW_PRIVS)", Err: err}
}
}
// Tell our parent that we're ready to exec. This must be done before the
// Seccomp rules have been applied, because we need to be able to read and
// write to a socket.
if err := syncParentReady(l.pipe); err != nil {
return fmt.Errorf("sync ready: %w", err)
}
if err := selinux.SetExecLabel(l.config.ProcessLabel); err != nil {
return fmt.Errorf("can't set process label: %w", err)
}
defer selinux.SetExecLabel("") //nolint: errcheck
// Without NoNewPrivileges seccomp is a privileged operation, so we need to
// do this before dropping capabilities; otherwise do it as late as possible
// just before execve so as few syscalls take place after it as possible.
if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges {
seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
if err != nil {
return err
}
if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
return err
}
}
if err := finalizeNamespace(l.config); err != nil {
return err
}
// finalizeNamespace can change user/group which clears the parent death
// signal, so we restore it here.
if err := pdeath.Restore(); err != nil {
return fmt.Errorf("can't restore pdeath signal: %w", err)
}
// Compare the parent from the initial start of the init process and make
// sure that it did not change. if the parent changes that means it died
// and we were reparented to something else so we should just kill ourself
// and not cause problems for someone else.
if unix.Getppid() != l.parentPid {
return unix.Kill(unix.Getpid(), unix.SIGKILL)
}
// Check for the arg before waiting to make sure it exists and it is
// returned as a create time error.
name, err := exec.LookPath(l.config.Args[0])
if err != nil {
return err
}
// exec.LookPath in Go < 1.20 might return no error for an executable
// residing on a file system mounted with noexec flag, so perform this
// extra check now while we can still return a proper error.
// TODO: remove this once go < 1.20 is not supported.
if err := eaccess(name); err != nil {
return &os.PathError{Op: "eaccess", Path: name, Err: err}
}
// Set seccomp as close to execve as possible, so as few syscalls take
// place afterward (reducing the amount of syscalls that users need to
// enable in their seccomp profiles). However, this needs to be done
// before closing the pipe since we need it to pass the seccompFd to
// the parent.
if l.config.Config.Seccomp != nil && l.config.NoNewPrivileges {
seccompFd, err := seccomp.InitSeccomp(l.config.Config.Seccomp)
if err != nil {
return fmt.Errorf("unable to init seccomp: %w", err)
}
if err := syncParentSeccomp(l.pipe, seccompFd); err != nil {
return err
}
}
// Close the pipe to signal that we have completed our init.
logrus.Debugf("init: closing the pipe to signal completion")
_ = l.pipe.Close()
// Close the log pipe fd so the parent's ForwardLogs can exit.
if err := unix.Close(l.logFd); err != nil {
return &os.PathError{Op: "close log pipe", Path: "fd " + strconv.Itoa(l.logFd), Err: err}
}
// Wait for the FIFO to be opened on the other side before exec-ing the
// user process. We open it through /proc/self/fd/$fd, because the fd that
// was given to us was an O_PATH fd to the fifo itself. Linux allows us to
// re-open an O_PATH fd through /proc.
fifoPath := "/proc/self/fd/" + strconv.Itoa(l.fifoFd)
fd, err := unix.Open(fifoPath, unix.O_WRONLY|unix.O_CLOEXEC, 0)
if err != nil {
return &os.PathError{Op: "open exec fifo", Path: fifoPath, Err: err}
}
if _, err := unix.Write(fd, []byte("0")); err != nil {
return &os.PathError{Op: "write exec fifo", Path: fifoPath, Err: err}
}
// Close the O_PATH fifofd fd before exec because the kernel resets
// dumpable in the wrong order. This has been fixed in newer kernels, but
// we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels.
// N.B. the core issue itself (passing dirfds to the host filesystem) has
// since been resolved.
// https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
_ = unix.Close(l.fifoFd)
s := l.config.SpecState
s.Pid = unix.Getpid()
s.Status = specs.StateCreated
if err := l.config.Config.Hooks[configs.StartContainer].RunHooks(s); err != nil {
return err
}
// Close all file descriptors we are not passing to the container. This is
// necessary because the execve target could use internal runc fds as the
// execve path, potentially giving access to binary files from the host
// (which can then be opened by container processes, leading to container
// escapes). Note that because this operation will close any open file
// descriptors that are referenced by (*os.File) handles from underneath
// the Go runtime, we must not do any file operations after this point
// (otherwise the (*os.File) finaliser could close the wrong file). See
// CVE-2024-21626 for more information as to why this protection is
// necessary.
//
// This is not needed for runc-dmz, because the extra execve(2) step means
// that all O_CLOEXEC file descriptors have already been closed and thus
// the second execve(2) from runc-dmz cannot access internal file
// descriptors from runc.
if err := utils.UnsafeCloseFrom(l.config.PassedFilesCount + 3); err != nil {
return err
}
return system.Exec(name, l.config.Args[0:], os.Environ())
}

View File

@@ -1,243 +0,0 @@
package libcontainer
import (
"fmt"
"os"
"path/filepath"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runtime-spec/specs-go"
"github.com/sirupsen/logrus"
"golang.org/x/sys/unix"
)
func newStateTransitionError(from, to containerState) error {
return &stateTransitionError{
From: from.status().String(),
To: to.status().String(),
}
}
// stateTransitionError is returned when an invalid state transition happens from one
// state to another.
type stateTransitionError struct {
From string
To string
}
func (s *stateTransitionError) Error() string {
return fmt.Sprintf("invalid state transition from %s to %s", s.From, s.To)
}
type containerState interface {
transition(containerState) error
destroy() error
status() Status
}
func destroy(c *linuxContainer) error {
if !c.config.Namespaces.Contains(configs.NEWPID) ||
c.config.Namespaces.PathOf(configs.NEWPID) != "" {
if err := signalAllProcesses(c.cgroupManager, unix.SIGKILL); err != nil {
logrus.Warn(err)
}
}
err := c.cgroupManager.Destroy()
if c.intelRdtManager != nil {
if ierr := c.intelRdtManager.Destroy(); err == nil {
err = ierr
}
}
if rerr := os.RemoveAll(c.root); err == nil {
err = rerr
}
c.initProcess = nil
if herr := runPoststopHooks(c); err == nil {
err = herr
}
c.state = &stoppedState{c: c}
return err
}
func runPoststopHooks(c *linuxContainer) error {
hooks := c.config.Hooks
if hooks == nil {
return nil
}
s, err := c.currentOCIState()
if err != nil {
return err
}
s.Status = specs.StateStopped
if err := hooks[configs.Poststop].RunHooks(s); err != nil {
return err
}
return nil
}
// stoppedState represents a container is a stopped/destroyed state.
type stoppedState struct {
c *linuxContainer
}
func (b *stoppedState) status() Status {
return Stopped
}
func (b *stoppedState) transition(s containerState) error {
switch s.(type) {
case *runningState, *restoredState:
b.c.state = s
return nil
case *stoppedState:
return nil
}
return newStateTransitionError(b, s)
}
func (b *stoppedState) destroy() error {
return destroy(b.c)
}
// runningState represents a container that is currently running.
type runningState struct {
c *linuxContainer
}
func (r *runningState) status() Status {
return Running
}
func (r *runningState) transition(s containerState) error {
switch s.(type) {
case *stoppedState:
if r.c.runType() == Running {
return ErrRunning
}
r.c.state = s
return nil
case *pausedState:
r.c.state = s
return nil
case *runningState:
return nil
}
return newStateTransitionError(r, s)
}
func (r *runningState) destroy() error {
if r.c.runType() == Running {
return ErrRunning
}
return destroy(r.c)
}
type createdState struct {
c *linuxContainer
}
func (i *createdState) status() Status {
return Created
}
func (i *createdState) transition(s containerState) error {
switch s.(type) {
case *runningState, *pausedState, *stoppedState:
i.c.state = s
return nil
case *createdState:
return nil
}
return newStateTransitionError(i, s)
}
func (i *createdState) destroy() error {
_ = i.c.initProcess.signal(unix.SIGKILL)
return destroy(i.c)
}
// pausedState represents a container that is currently pause. It cannot be destroyed in a
// paused state and must transition back to running first.
type pausedState struct {
c *linuxContainer
}
func (p *pausedState) status() Status {
return Paused
}
func (p *pausedState) transition(s containerState) error {
switch s.(type) {
case *runningState, *stoppedState:
p.c.state = s
return nil
case *pausedState:
return nil
}
return newStateTransitionError(p, s)
}
func (p *pausedState) destroy() error {
t := p.c.runType()
if t != Running && t != Created {
if err := p.c.cgroupManager.Freeze(configs.Thawed); err != nil {
return err
}
return destroy(p.c)
}
return ErrPaused
}
// restoredState is the same as the running state but also has associated checkpoint
// information that maybe need destroyed when the container is stopped and destroy is called.
type restoredState struct {
imageDir string
c *linuxContainer
}
func (r *restoredState) status() Status {
return Running
}
func (r *restoredState) transition(s containerState) error {
switch s.(type) {
case *stoppedState, *runningState:
return nil
}
return newStateTransitionError(r, s)
}
func (r *restoredState) destroy() error {
if _, err := os.Stat(filepath.Join(r.c.root, "checkpoint")); err != nil {
if !os.IsNotExist(err) {
return err
}
}
return destroy(r.c)
}
// loadedState is used whenever a container is restored, loaded, or setting additional
// processes inside and it should not be destroyed when it is exiting.
type loadedState struct {
c *linuxContainer
s Status
}
func (n *loadedState) status() Status {
return n.s
}
func (n *loadedState) transition(s containerState) error {
n.c.state = s
return nil
}
func (n *loadedState) destroy() error {
if err := n.c.refreshState(); err != nil {
return err
}
return n.c.state.destroy()
}

View File

@@ -1,13 +0,0 @@
package libcontainer
import (
"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/intelrdt"
"github.com/opencontainers/runc/types"
)
type Stats struct {
Interfaces []*types.NetworkInterface
CgroupStats *cgroups.Stats
IntelRdtStats *intelrdt.Stats
}

View File

@@ -1,126 +0,0 @@
package libcontainer
import (
"encoding/json"
"errors"
"fmt"
"io"
"github.com/opencontainers/runc/libcontainer/utils"
)
type syncType string
// Constants that are used for synchronisation between the parent and child
// during container setup. They come in pairs (with procError being a generic
// response which is followed by an &initError).
//
// [ child ] <-> [ parent ]
//
// procHooks --> [run hooks]
// <-- procResume
//
// procReady --> [final setup]
// <-- procRun
//
// procSeccomp --> [pick up seccomp fd with pidfd_getfd()]
// <-- procSeccompDone
const (
procError syncType = "procError"
procReady syncType = "procReady"
procRun syncType = "procRun"
procHooks syncType = "procHooks"
procResume syncType = "procResume"
procSeccomp syncType = "procSeccomp"
procSeccompDone syncType = "procSeccompDone"
)
type syncT struct {
Type syncType `json:"type"`
Fd int `json:"fd"`
}
// initError is used to wrap errors for passing them via JSON,
// as encoding/json can't unmarshal into error type.
type initError struct {
Message string `json:"message,omitempty"`
}
func (i initError) Error() string {
return i.Message
}
// writeSync is used to write to a synchronisation pipe. An error is returned
// if there was a problem writing the payload.
func writeSync(pipe io.Writer, sync syncType) error {
return writeSyncWithFd(pipe, sync, -1)
}
// writeSyncWithFd is used to write to a synchronisation pipe. An error is
// returned if there was a problem writing the payload.
func writeSyncWithFd(pipe io.Writer, sync syncType, fd int) error {
if err := utils.WriteJSON(pipe, syncT{sync, fd}); err != nil {
return fmt.Errorf("writing syncT %q: %w", string(sync), err)
}
return nil
}
// readSync is used to read from a synchronisation pipe. An error is returned
// if we got an initError, the pipe was closed, or we got an unexpected flag.
func readSync(pipe io.Reader, expected syncType) error {
var procSync syncT
if err := json.NewDecoder(pipe).Decode(&procSync); err != nil {
if errors.Is(err, io.EOF) {
return errors.New("parent closed synchronisation channel")
}
return fmt.Errorf("failed reading error from parent: %w", err)
}
if procSync.Type == procError {
var ierr initError
if err := json.NewDecoder(pipe).Decode(&ierr); err != nil {
return fmt.Errorf("failed reading error from parent: %w", err)
}
return &ierr
}
if procSync.Type != expected {
return errors.New("invalid synchronisation flag from parent")
}
return nil
}
// parseSync runs the given callback function on each syncT received from the
// child. It will return once io.EOF is returned from the given pipe.
func parseSync(pipe io.Reader, fn func(*syncT) error) error {
dec := json.NewDecoder(pipe)
for {
var sync syncT
if err := dec.Decode(&sync); err != nil {
if errors.Is(err, io.EOF) {
break
}
return err
}
// We handle this case outside fn for cleanliness reasons.
var ierr *initError
if sync.Type == procError {
if err := dec.Decode(&ierr); err != nil && !errors.Is(err, io.EOF) {
return fmt.Errorf("error decoding proc error from init: %w", err)
}
if ierr != nil {
return ierr
}
// Programmer error.
panic("No error following JSON procError payload.")
}
if err := fn(&sync); err != nil {
return err
}
}
return nil
}

View File

@@ -1,155 +0,0 @@
package types
import "github.com/opencontainers/runc/libcontainer/intelrdt"
// Event struct for encoding the event data to json.
type Event struct {
Type string `json:"type"`
ID string `json:"id"`
Data interface{} `json:"data,omitempty"`
}
// stats is the runc specific stats structure for stability when encoding and decoding stats.
type Stats struct {
CPU Cpu `json:"cpu"`
CPUSet CPUSet `json:"cpuset"`
Memory Memory `json:"memory"`
Pids Pids `json:"pids"`
Blkio Blkio `json:"blkio"`
Hugetlb map[string]Hugetlb `json:"hugetlb"`
IntelRdt IntelRdt `json:"intel_rdt"`
NetworkInterfaces []*NetworkInterface `json:"network_interfaces"`
}
type Hugetlb struct {
Usage uint64 `json:"usage,omitempty"`
Max uint64 `json:"max,omitempty"`
Failcnt uint64 `json:"failcnt"`
}
type BlkioEntry struct {
Major uint64 `json:"major,omitempty"`
Minor uint64 `json:"minor,omitempty"`
Op string `json:"op,omitempty"`
Value uint64 `json:"value,omitempty"`
}
type Blkio struct {
IoServiceBytesRecursive []BlkioEntry `json:"ioServiceBytesRecursive,omitempty"`
IoServicedRecursive []BlkioEntry `json:"ioServicedRecursive,omitempty"`
IoQueuedRecursive []BlkioEntry `json:"ioQueueRecursive,omitempty"`
IoServiceTimeRecursive []BlkioEntry `json:"ioServiceTimeRecursive,omitempty"`
IoWaitTimeRecursive []BlkioEntry `json:"ioWaitTimeRecursive,omitempty"`
IoMergedRecursive []BlkioEntry `json:"ioMergedRecursive,omitempty"`
IoTimeRecursive []BlkioEntry `json:"ioTimeRecursive,omitempty"`
SectorsRecursive []BlkioEntry `json:"sectorsRecursive,omitempty"`
}
type Pids struct {
Current uint64 `json:"current,omitempty"`
Limit uint64 `json:"limit,omitempty"`
}
type Throttling struct {
Periods uint64 `json:"periods,omitempty"`
ThrottledPeriods uint64 `json:"throttledPeriods,omitempty"`
ThrottledTime uint64 `json:"throttledTime,omitempty"`
}
type CpuUsage struct {
// Units: nanoseconds.
Total uint64 `json:"total,omitempty"`
Percpu []uint64 `json:"percpu,omitempty"`
PercpuKernel []uint64 `json:"percpu_kernel,omitempty"`
PercpuUser []uint64 `json:"percpu_user,omitempty"`
Kernel uint64 `json:"kernel"`
User uint64 `json:"user"`
}
type Cpu struct {
Usage CpuUsage `json:"usage,omitempty"`
Throttling Throttling `json:"throttling,omitempty"`
}
type CPUSet struct {
CPUs []uint16 `json:"cpus,omitempty"`
CPUExclusive uint64 `json:"cpu_exclusive"`
Mems []uint16 `json:"mems,omitempty"`
MemHardwall uint64 `json:"mem_hardwall"`
MemExclusive uint64 `json:"mem_exclusive"`
MemoryMigrate uint64 `json:"memory_migrate"`
MemorySpreadPage uint64 `json:"memory_spread_page"`
MemorySpreadSlab uint64 `json:"memory_spread_slab"`
MemoryPressure uint64 `json:"memory_pressure"`
SchedLoadBalance uint64 `json:"sched_load_balance"`
SchedRelaxDomainLevel int64 `json:"sched_relax_domain_level"`
}
type MemoryEntry struct {
Limit uint64 `json:"limit"`
Usage uint64 `json:"usage,omitempty"`
Max uint64 `json:"max,omitempty"`
Failcnt uint64 `json:"failcnt"`
}
type Memory struct {
Cache uint64 `json:"cache,omitempty"`
Usage MemoryEntry `json:"usage,omitempty"`
Swap MemoryEntry `json:"swap,omitempty"`
Kernel MemoryEntry `json:"kernel,omitempty"`
KernelTCP MemoryEntry `json:"kernelTCP,omitempty"`
Raw map[string]uint64 `json:"raw,omitempty"`
}
type L3CacheInfo struct {
CbmMask string `json:"cbm_mask,omitempty"`
MinCbmBits uint64 `json:"min_cbm_bits,omitempty"`
NumClosids uint64 `json:"num_closids,omitempty"`
}
type MemBwInfo struct {
BandwidthGran uint64 `json:"bandwidth_gran,omitempty"`
DelayLinear uint64 `json:"delay_linear,omitempty"`
MinBandwidth uint64 `json:"min_bandwidth,omitempty"`
NumClosids uint64 `json:"num_closids,omitempty"`
}
type IntelRdt struct {
// The read-only L3 cache information
L3CacheInfo *L3CacheInfo `json:"l3_cache_info,omitempty"`
// The read-only L3 cache schema in root
L3CacheSchemaRoot string `json:"l3_cache_schema_root,omitempty"`
// The L3 cache schema in 'container_id' group
L3CacheSchema string `json:"l3_cache_schema,omitempty"`
// The read-only memory bandwidth information
MemBwInfo *MemBwInfo `json:"mem_bw_info,omitempty"`
// The read-only memory bandwidth schema in root
MemBwSchemaRoot string `json:"mem_bw_schema_root,omitempty"`
// The memory bandwidth schema in 'container_id' group
MemBwSchema string `json:"mem_bw_schema,omitempty"`
// The memory bandwidth monitoring statistics from NUMA nodes in 'container_id' group
MBMStats *[]intelrdt.MBMNumaNodeStats `json:"mbm_stats,omitempty"`
// The cache monitoring technology statistics from NUMA nodes in 'container_id' group
CMTStats *[]intelrdt.CMTNumaNodeStats `json:"cmt_stats,omitempty"`
}
type NetworkInterface struct {
// Name is the name of the network interface.
Name string
RxBytes uint64
RxPackets uint64
RxErrors uint64
RxDropped uint64
TxBytes uint64
TxPackets uint64
TxErrors uint64
TxDropped uint64
}

View File

@@ -1,4 +0,0 @@
*~
*.swp
*.orig
tags

View File

@@ -1,4 +0,0 @@
# For documentation, see https://golangci-lint.run/usage/configuration/
linters:
enable:
- gofumpt

View File

@@ -1,42 +0,0 @@
libseccomp-golang: Releases
===============================================================================
https://github.com/seccomp/libseccomp-golang
* Version 0.10.0 - June 9, 2022
- Minimum supported version of libseccomp bumped to v2.3.1
- Add seccomp userspace notification API (ActNotify, filter.*Notif*)
- Add filter.{Get,Set}SSB (to support SCMP_FLTATR_CTL_SSB)
- Add filter.{Get,Set}Optimize (to support SCMP_FLTATR_CTL_OPTIMIZE)
- Add filter.{Get,Set}RawRC (to support SCMP_FLTATR_API_SYSRAWRC)
- Add ArchPARISC, ArchPARISC64, ArchRISCV64
- Add ActKillProcess and ActKillThread; deprecate ActKill
- Add go module support
- Return ErrSyscallDoesNotExist when unable to resolve a syscall
- Fix some functions to check for both kernel level API and libseccomp version
- Fix MakeCondition to use sanitizeCompareOp
- Fix AddRule to handle EACCES (from libseccomp >= 2.5.0)
- Updated the main docs and converted to README.md
- Added CONTRIBUTING.md, SECURITY.md, and administrative docs under doc/admin
- Add GitHub action CI, enable more linters
- test: test against various libseccomp versions
- test: fix and simplify execInSubprocess
- test: fix APILevelIsSupported
- Refactor the Errno(-1 * retCode) pattern
- Refactor/unify libseccomp version / API level checks
- Code cleanups (linter, formatting, spelling fixes)
- Cleanup: use errors.New instead of fmt.Errorf where appropriate
- Cleanup: remove duplicated cgo stuff, redundant linux build tag
* Version 0.9.1 - May 21, 2019
- Minimum supported version of libseccomp bumped to v2.2.0
- Use Libseccomp's `seccomp_version` API to retrieve library version
- Unconditionally set TSync attribute for filters, due to Go's heavily threaded nature
- Fix CVE-2017-18367 - Multiple syscall arguments were incorrectly combined with logical-OR, instead of logical-AND
- Fix a failure to build on Debian-based distributions due to CGo code
- Fix unit test failures on 32-bit architectures
- Improve several errors to be more verbose about their causes
- Add support for SCMP_ACT_LOG (with libseccomp versions 2.4.x and higher), permitting syscalls but logging their execution
- Add support for SCMP_FLTATR_CTL_LOG (with libseccomp versions 2.4.x and higher), logging not-allowed actions when they are denied
* Version 0.9.0 - January 5, 2017
- Initial tagged release

View File

@@ -1,120 +0,0 @@
How to Submit Patches to the libseccomp-golang Project
===============================================================================
https://github.com/seccomp/libseccomp-golang
This document is intended to act as a guide to help you contribute to the
libseccomp-golang project. It is not perfect, and there will always be
exceptions to the rules described here, but by following the instructions below
you should have a much easier time getting your work merged with the upstream
project.
## Test Your Code Using Existing Tests
A number of tests and lint related recipes are provided in the Makefile, if
you want to run the standard regression tests, you can execute the following:
# make check
In order to use it, the 'golangci-lint' tool is needed, which can be found at:
* https://github.com/golangci/golangci-lint
## Add New Tests for New Functionality
Any submissions which add functionality, or significantly change the existing
code, should include additional tests to verify the proper operation of the
proposed changes.
## Explain Your Work
At the top of every patch you should include a description of the problem you
are trying to solve, how you solved it, and why you chose the solution you
implemented. If you are submitting a bug fix, it is also incredibly helpful
if you can describe/include a reproducer for the problem in the description as
well as instructions on how to test for the bug and verify that it has been
fixed.
## Sign Your Work
The sign-off is a simple line at the end of the patch description, which
certifies that you wrote it or otherwise have the right to pass it on as an
open-source patch. The "Developer's Certificate of Origin" pledge is taken
from the Linux Kernel and the rules are pretty simple:
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
... then you just add a line to the bottom of your patch description, with
your real name, saying:
Signed-off-by: Random J Developer <random@developer.example.org>
You can add this to your commit description in `git` with `git commit -s`
## Post Your Patches Upstream
The libseccomp project accepts both GitHub pull requests and patches sent via
the mailing list. GitHub pull requests are preferred. This sections below
explain how to contribute via either method. Please read each step and perform
all steps that apply to your chosen contribution method.
### Submitting via Email
Depending on how you decided to work with the libseccomp code base and what
tools you are using there are different ways to generate your patch(es).
However, regardless of what tools you use, you should always generate your
patches using the "unified" diff/patch format and the patches should always
apply to the libseccomp source tree using the following command from the top
directory of the libseccomp sources:
# patch -p1 < changes.patch
If you are not using git, stacked git (stgit), or some other tool which can
generate patch files for you automatically, you may find the following command
helpful in generating patches, where "libseccomp.orig/" is the unmodified
source code directory and "libseccomp/" is the source code directory with your
changes:
# diff -purN libseccomp.orig/ libseccomp/
When in doubt please generate your patch and try applying it to an unmodified
copy of the libseccomp sources; if it fails for you, it will fail for the rest
of us.
Finally, you will need to email your patches to the mailing list so they can
be reviewed and potentially merged into the main libseccomp repository. When
sending patches to the mailing list it is important to send your email in text
form, no HTML mail please, and ensure that your email client does not mangle
your patches. It should be possible to save your raw email to disk and apply
it directly to the libseccomp source code; if that fails then you likely have
a problem with your email client. When in doubt try a test first by sending
yourself an email with your patch and attempting to apply the emailed patch to
the libseccomp repository; if it fails for you, it will fail for the rest of
us trying to test your patch and include it in the main libseccomp repository.
### Submitting via GitHub
See [this guide](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request) if you've never done this before.

View File

@@ -1,22 +0,0 @@
Copyright (c) 2015 Matthew Heon <mheon@redhat.com>
Copyright (c) 2015 Paul Moore <pmoore@redhat.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -1,31 +0,0 @@
# libseccomp-golang
.PHONY: all check check-build check-syntax fix-syntax vet test lint
all: check-build
check: lint test
check-build:
go build
check-syntax:
gofmt -d .
fix-syntax:
gofmt -w .
vet:
go vet -v ./...
# Previous bugs have made the tests freeze until the timeout. Golang default
# timeout for tests is 10 minutes, which is too long, considering current tests
# can be executed in less than 1 second. Reduce the timeout, so problems can
# be noticed earlier in the CI.
TEST_TIMEOUT=10s
test:
go test -v -timeout $(TEST_TIMEOUT)
lint:
golangci-lint run .

View File

@@ -1,59 +0,0 @@
![libseccomp Golang Bindings](https://github.com/seccomp/libseccomp-artwork/blob/main/logo/libseccomp-color_text.png)
===============================================================================
https://github.com/seccomp/libseccomp-golang
[![Go Reference](https://pkg.go.dev/badge/github.com/seccomp/libseccomp-golang.svg)](https://pkg.go.dev/github.com/seccomp/libseccomp-golang)
[![validate](https://github.com/seccomp/libseccomp-golang/actions/workflows/validate.yml/badge.svg)](https://github.com/seccomp/libseccomp-golang/actions/workflows/validate.yml)
[![test](https://github.com/seccomp/libseccomp-golang/actions/workflows/test.yml/badge.svg)](https://github.com/seccomp/libseccomp-golang/actions/workflows/test.yml)
The libseccomp library provides an easy to use, platform independent, interface
to the Linux Kernel's syscall filtering mechanism. The libseccomp API is
designed to abstract away the underlying BPF based syscall filter language and
present a more conventional function-call based filtering interface that should
be familiar to, and easily adopted by, application developers.
The libseccomp-golang library provides a Go based interface to the libseccomp
library.
## Online Resources
The library source repository currently lives on GitHub at the following URLs:
* https://github.com/seccomp/libseccomp-golang
* https://github.com/seccomp/libseccomp
Documentation for this package is also available at:
* https://pkg.go.dev/github.com/seccomp/libseccomp-golang
## Verifying Releases
Starting with libseccomp-golang v0.10.0, the git tag corresponding to each
release should be signed by one of the libseccomp-golang maintainers. It is
recommended that before use you verify the release tags using the following
command:
% git tag -v <tag>
At present, only the following keys, specified via the fingerprints below, are
authorized to sign official libseccomp-golang release tags:
Paul Moore <paul@paul-moore.com>
7100 AADF AE6E 6E94 0D2E 0AD6 55E4 5A5A E8CA 7C8A
Tom Hromatka <tom.hromatka@oracle.com>
47A6 8FCE 37C7 D702 4FD6 5E11 356C E62C 2B52 4099
Kir Kolyshkin <kolyshkin@gmail.com>
C242 8CD7 5720 FACD CF76 B6EA 17DE 5ECB 75A1 100E
More information on GnuPG and git tag verification can be found at their
respective websites: https://git-scm.com/docs/git and https://gnupg.org.
## Installing the package
% go get github.com/seccomp/libseccomp-golang
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md).

View File

@@ -1,48 +0,0 @@
The libseccomp-golang Security Vulnerability Handling Process
===============================================================================
https://github.com/seccomp/libseccomp-golang
This document document attempts to describe the processes through which
sensitive security relevant bugs can be responsibly disclosed to the
libseccomp-golang project and how the project maintainers should handle these
reports. Just like the other libseccomp-golang process documents, this
document should be treated as a guiding document and not a hard, unyielding set
of regulations; the bug reporters and project maintainers are encouraged to
work together to address the issues as best they can, in a manner which works
best for all parties involved.
### Reporting Problems
Problems with the libseccomp-golang library that are not suitable for immediate
public disclosure should be emailed to the current libseccomp-golang
maintainers, the list is below. We typically request at most a 90 day time
period to address the issue before it is made public, but we will make every
effort to address the issue as quickly as possible and shorten the disclosure
window.
* Paul Moore, paul@paul-moore.com
* Tom Hromatka, tom.hromatka@oracle.com
* Kir Kolyshkin, kolyshkin@gmail.com
### Resolving Sensitive Security Issues
Upon disclosure of a bug, the maintainers should work together to investigate
the problem and decide on a solution. In order to prevent an early disclosure
of the problem, those working on the solution should do so privately and
outside of the traditional libseccomp-golang development practices. One
possible solution to this is to leverage the GitHub "Security" functionality to
create a private development fork that can be shared among the maintainers, and
optionally the reporter. A placeholder GitHub issue may be created, but
details should remain extremely limited until such time as the problem has been
fixed and responsibly disclosed. If a CVE, or other tag, has been assigned to
the problem, the GitHub issue title should include the vulnerability tag once
the problem has been disclosed.
### Public Disclosure
Whenever possible, responsible reporting and patching practices should be
followed, including notification to the linux-distros and oss-security mailing
lists.
* https://oss-security.openwall.org/wiki/mailing-lists/distros
* https://oss-security.openwall.org/wiki/mailing-lists/oss-security

File diff suppressed because it is too large Load Diff

View File

@@ -1,884 +0,0 @@
// Internal functions for libseccomp Go bindings
// No exported functions
package seccomp
import (
"errors"
"fmt"
"syscall"
)
// Unexported C wrapping code - provides the C-Golang interface
// Get the seccomp header in scope
// Need stdlib.h for free() on cstrings
// To compile libseccomp-golang against a specific version of libseccomp:
// cd ../libseccomp && mkdir -p prefix
// ./configure --prefix=$PWD/prefix && make && make install
// cd ../libseccomp-golang
// PKG_CONFIG_PATH=$PWD/../libseccomp/prefix/lib/pkgconfig/ make
// LD_PRELOAD=$PWD/../libseccomp/prefix/lib/libseccomp.so.2.5.0 PKG_CONFIG_PATH=$PWD/../libseccomp/prefix/lib/pkgconfig/ make test
// #cgo pkg-config: libseccomp
/*
#include <errno.h>
#include <stdlib.h>
#include <seccomp.h>
#if (SCMP_VER_MAJOR < 2) || \
(SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR < 3) || \
(SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR == 3 && SCMP_VER_MICRO < 1)
#error This package requires libseccomp >= v2.3.1
#endif
#define ARCH_BAD ~0
const uint32_t C_ARCH_BAD = ARCH_BAD;
#ifndef SCMP_ARCH_PPC
#define SCMP_ARCH_PPC ARCH_BAD
#endif
#ifndef SCMP_ARCH_PPC64
#define SCMP_ARCH_PPC64 ARCH_BAD
#endif
#ifndef SCMP_ARCH_PPC64LE
#define SCMP_ARCH_PPC64LE ARCH_BAD
#endif
#ifndef SCMP_ARCH_S390
#define SCMP_ARCH_S390 ARCH_BAD
#endif
#ifndef SCMP_ARCH_S390X
#define SCMP_ARCH_S390X ARCH_BAD
#endif
#ifndef SCMP_ARCH_PARISC
#define SCMP_ARCH_PARISC ARCH_BAD
#endif
#ifndef SCMP_ARCH_PARISC64
#define SCMP_ARCH_PARISC64 ARCH_BAD
#endif
#ifndef SCMP_ARCH_RISCV64
#define SCMP_ARCH_RISCV64 ARCH_BAD
#endif
const uint32_t C_ARCH_NATIVE = SCMP_ARCH_NATIVE;
const uint32_t C_ARCH_X86 = SCMP_ARCH_X86;
const uint32_t C_ARCH_X86_64 = SCMP_ARCH_X86_64;
const uint32_t C_ARCH_X32 = SCMP_ARCH_X32;
const uint32_t C_ARCH_ARM = SCMP_ARCH_ARM;
const uint32_t C_ARCH_AARCH64 = SCMP_ARCH_AARCH64;
const uint32_t C_ARCH_MIPS = SCMP_ARCH_MIPS;
const uint32_t C_ARCH_MIPS64 = SCMP_ARCH_MIPS64;
const uint32_t C_ARCH_MIPS64N32 = SCMP_ARCH_MIPS64N32;
const uint32_t C_ARCH_MIPSEL = SCMP_ARCH_MIPSEL;
const uint32_t C_ARCH_MIPSEL64 = SCMP_ARCH_MIPSEL64;
const uint32_t C_ARCH_MIPSEL64N32 = SCMP_ARCH_MIPSEL64N32;
const uint32_t C_ARCH_PPC = SCMP_ARCH_PPC;
const uint32_t C_ARCH_PPC64 = SCMP_ARCH_PPC64;
const uint32_t C_ARCH_PPC64LE = SCMP_ARCH_PPC64LE;
const uint32_t C_ARCH_S390 = SCMP_ARCH_S390;
const uint32_t C_ARCH_S390X = SCMP_ARCH_S390X;
const uint32_t C_ARCH_PARISC = SCMP_ARCH_PARISC;
const uint32_t C_ARCH_PARISC64 = SCMP_ARCH_PARISC64;
const uint32_t C_ARCH_RISCV64 = SCMP_ARCH_RISCV64;
#ifndef SCMP_ACT_LOG
#define SCMP_ACT_LOG 0x7ffc0000U
#endif
#ifndef SCMP_ACT_KILL_PROCESS
#define SCMP_ACT_KILL_PROCESS 0x80000000U
#endif
#ifndef SCMP_ACT_KILL_THREAD
#define SCMP_ACT_KILL_THREAD 0x00000000U
#endif
#ifndef SCMP_ACT_NOTIFY
#define SCMP_ACT_NOTIFY 0x7fc00000U
#endif
const uint32_t C_ACT_KILL = SCMP_ACT_KILL;
const uint32_t C_ACT_KILL_PROCESS = SCMP_ACT_KILL_PROCESS;
const uint32_t C_ACT_KILL_THREAD = SCMP_ACT_KILL_THREAD;
const uint32_t C_ACT_TRAP = SCMP_ACT_TRAP;
const uint32_t C_ACT_ERRNO = SCMP_ACT_ERRNO(0);
const uint32_t C_ACT_TRACE = SCMP_ACT_TRACE(0);
const uint32_t C_ACT_LOG = SCMP_ACT_LOG;
const uint32_t C_ACT_ALLOW = SCMP_ACT_ALLOW;
const uint32_t C_ACT_NOTIFY = SCMP_ACT_NOTIFY;
// The libseccomp SCMP_FLTATR_CTL_LOG member of the scmp_filter_attr enum was
// added in v2.4.0
#if SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR < 4
#define SCMP_FLTATR_CTL_LOG _SCMP_FLTATR_MIN
#endif
// The following SCMP_FLTATR_* were added in libseccomp v2.5.0.
#if SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR < 5
#define SCMP_FLTATR_CTL_SSB _SCMP_FLTATR_MIN
#define SCMP_FLTATR_CTL_OPTIMIZE _SCMP_FLTATR_MIN
#define SCMP_FLTATR_API_SYSRAWRC _SCMP_FLTATR_MIN
#endif
const uint32_t C_ATTRIBUTE_DEFAULT = (uint32_t)SCMP_FLTATR_ACT_DEFAULT;
const uint32_t C_ATTRIBUTE_BADARCH = (uint32_t)SCMP_FLTATR_ACT_BADARCH;
const uint32_t C_ATTRIBUTE_NNP = (uint32_t)SCMP_FLTATR_CTL_NNP;
const uint32_t C_ATTRIBUTE_TSYNC = (uint32_t)SCMP_FLTATR_CTL_TSYNC;
const uint32_t C_ATTRIBUTE_LOG = (uint32_t)SCMP_FLTATR_CTL_LOG;
const uint32_t C_ATTRIBUTE_SSB = (uint32_t)SCMP_FLTATR_CTL_SSB;
const uint32_t C_ATTRIBUTE_OPTIMIZE = (uint32_t)SCMP_FLTATR_CTL_OPTIMIZE;
const uint32_t C_ATTRIBUTE_SYSRAWRC = (uint32_t)SCMP_FLTATR_API_SYSRAWRC;
const int C_CMP_NE = (int)SCMP_CMP_NE;
const int C_CMP_LT = (int)SCMP_CMP_LT;
const int C_CMP_LE = (int)SCMP_CMP_LE;
const int C_CMP_EQ = (int)SCMP_CMP_EQ;
const int C_CMP_GE = (int)SCMP_CMP_GE;
const int C_CMP_GT = (int)SCMP_CMP_GT;
const int C_CMP_MASKED_EQ = (int)SCMP_CMP_MASKED_EQ;
const int C_VERSION_MAJOR = SCMP_VER_MAJOR;
const int C_VERSION_MINOR = SCMP_VER_MINOR;
const int C_VERSION_MICRO = SCMP_VER_MICRO;
#if SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR >= 3
unsigned int get_major_version()
{
return seccomp_version()->major;
}
unsigned int get_minor_version()
{
return seccomp_version()->minor;
}
unsigned int get_micro_version()
{
return seccomp_version()->micro;
}
#else
unsigned int get_major_version()
{
return (unsigned int)C_VERSION_MAJOR;
}
unsigned int get_minor_version()
{
return (unsigned int)C_VERSION_MINOR;
}
unsigned int get_micro_version()
{
return (unsigned int)C_VERSION_MICRO;
}
#endif
// The libseccomp API level functions were added in v2.4.0
#if SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR < 4
const unsigned int seccomp_api_get(void)
{
// libseccomp-golang requires libseccomp v2.2.0, at a minimum, which
// supported API level 2. However, the kernel may not support API level
// 2 constructs which are the seccomp() system call and the TSYNC
// filter flag. Return the "reserved" value of 0 here to indicate that
// proper API level support is not available in libseccomp.
return 0;
}
int seccomp_api_set(unsigned int level)
{
return -EOPNOTSUPP;
}
#endif
typedef struct scmp_arg_cmp* scmp_cast_t;
void* make_arg_cmp_array(unsigned int length)
{
return calloc(length, sizeof(struct scmp_arg_cmp));
}
// Wrapper to add an scmp_arg_cmp struct to an existing arg_cmp array
void add_struct_arg_cmp(
struct scmp_arg_cmp* arr,
unsigned int pos,
unsigned int arg,
int compare,
uint64_t a,
uint64_t b
)
{
arr[pos].arg = arg;
arr[pos].op = compare;
arr[pos].datum_a = a;
arr[pos].datum_b = b;
return;
}
// The seccomp notify API functions were added in v2.5.0
#if SCMP_VER_MAJOR == 2 && SCMP_VER_MINOR < 5
struct seccomp_data {
int nr;
__u32 arch;
__u64 instruction_pointer;
__u64 args[6];
};
struct seccomp_notif {
__u64 id;
__u32 pid;
__u32 flags;
struct seccomp_data data;
};
struct seccomp_notif_resp {
__u64 id;
__s64 val;
__s32 error;
__u32 flags;
};
int seccomp_notify_alloc(struct seccomp_notif **req, struct seccomp_notif_resp **resp) {
return -EOPNOTSUPP;
}
int seccomp_notify_fd(const scmp_filter_ctx ctx) {
return -EOPNOTSUPP;
}
void seccomp_notify_free(struct seccomp_notif *req, struct seccomp_notif_resp *resp) {
}
int seccomp_notify_id_valid(int fd, uint64_t id) {
return -EOPNOTSUPP;
}
int seccomp_notify_receive(int fd, struct seccomp_notif *req) {
return -EOPNOTSUPP;
}
int seccomp_notify_respond(int fd, struct seccomp_notif_resp *resp) {
return -EOPNOTSUPP;
}
#endif
*/
import "C"
// Nonexported types
type scmpFilterAttr uint32
// Nonexported constants
const (
filterAttrActDefault scmpFilterAttr = iota
filterAttrActBadArch
filterAttrNNP
filterAttrTsync
filterAttrLog
filterAttrSSB
filterAttrOptimize
filterAttrRawRC
)
const (
// An error return from certain libseccomp functions
scmpError C.int = -1
// Comparison boundaries to check for architecture validity
archStart ScmpArch = ArchNative
archEnd ScmpArch = ArchRISCV64
// Comparison boundaries to check for action validity
actionStart ScmpAction = ActKillThread
actionEnd ScmpAction = ActKillProcess
// Comparison boundaries to check for comparison operator validity
compareOpStart ScmpCompareOp = CompareNotEqual
compareOpEnd ScmpCompareOp = CompareMaskedEqual
)
var (
// errBadFilter is thrown on bad filter context.
errBadFilter = errors.New("filter is invalid or uninitialized")
errDefAction = errors.New("requested action matches default action of filter")
// Constants representing library major, minor, and micro versions
verMajor = uint(C.get_major_version())
verMinor = uint(C.get_minor_version())
verMicro = uint(C.get_micro_version())
)
// Nonexported functions
// checkVersion returns an error if the libseccomp version being used
// is less than the one specified by major, minor, and micro arguments.
// Argument op is an arbitrary non-empty operation description, which
// is used as a part of the error message returned.
//
// Most users should use checkAPI instead.
func checkVersion(op string, major, minor, micro uint) error {
if (verMajor > major) ||
(verMajor == major && verMinor > minor) ||
(verMajor == major && verMinor == minor && verMicro >= micro) {
return nil
}
return &VersionError{
op: op,
major: major,
minor: minor,
micro: micro,
}
}
func ensureSupportedVersion() error {
return checkVersion("seccomp", 2, 3, 1)
}
// Get the API level
func getAPI() (uint, error) {
api := C.seccomp_api_get()
if api == 0 {
return 0, errors.New("API level operations are not supported")
}
return uint(api), nil
}
// Set the API level
func setAPI(api uint) error {
if retCode := C.seccomp_api_set(C.uint(api)); retCode != 0 {
e := errRc(retCode)
if e == syscall.EOPNOTSUPP {
return errors.New("API level operations are not supported")
}
return fmt.Errorf("could not set API level: %w", e)
}
return nil
}
// Filter helpers
// Filter finalizer - ensure that kernel context for filters is freed
func filterFinalizer(f *ScmpFilter) {
f.Release()
}
func errRc(rc C.int) error {
return syscall.Errno(-1 * rc)
}
// Get a raw filter attribute
func (f *ScmpFilter) getFilterAttr(attr scmpFilterAttr) (C.uint32_t, error) {
f.lock.Lock()
defer f.lock.Unlock()
if !f.valid {
return 0x0, errBadFilter
}
var attribute C.uint32_t
retCode := C.seccomp_attr_get(f.filterCtx, attr.toNative(), &attribute)
if retCode != 0 {
return 0x0, errRc(retCode)
}
return attribute, nil
}
// Set a raw filter attribute
func (f *ScmpFilter) setFilterAttr(attr scmpFilterAttr, value C.uint32_t) error {
f.lock.Lock()
defer f.lock.Unlock()
if !f.valid {
return errBadFilter
}
retCode := C.seccomp_attr_set(f.filterCtx, attr.toNative(), value)
if retCode != 0 {
return errRc(retCode)
}
return nil
}
// DOES NOT LOCK OR CHECK VALIDITY
// Assumes caller has already done this
// Wrapper for seccomp_rule_add_... functions
func (f *ScmpFilter) addRuleWrapper(call ScmpSyscall, action ScmpAction, exact bool, length C.uint, cond C.scmp_cast_t) error {
if length != 0 && cond == nil {
return errors.New("null conditions list, but length is nonzero")
}
var retCode C.int
if exact {
retCode = C.seccomp_rule_add_exact_array(f.filterCtx, action.toNative(), C.int(call), length, cond)
} else {
retCode = C.seccomp_rule_add_array(f.filterCtx, action.toNative(), C.int(call), length, cond)
}
if retCode != 0 {
switch e := errRc(retCode); e {
case syscall.EFAULT:
return fmt.Errorf("unrecognized syscall %#x", int32(call))
// libseccomp >= v2.5.0 returns EACCES, older versions return EPERM.
// TODO: remove EPERM once libseccomp < v2.5.0 is not supported.
case syscall.EPERM, syscall.EACCES:
return errDefAction
case syscall.EINVAL:
return errors.New("two checks on same syscall argument")
default:
return e
}
}
return nil
}
// Generic add function for filter rules
func (f *ScmpFilter) addRuleGeneric(call ScmpSyscall, action ScmpAction, exact bool, conds []ScmpCondition) error {
f.lock.Lock()
defer f.lock.Unlock()
if !f.valid {
return errBadFilter
}
if len(conds) == 0 {
if err := f.addRuleWrapper(call, action, exact, 0, nil); err != nil {
return err
}
} else {
argsArr := C.make_arg_cmp_array(C.uint(len(conds)))
if argsArr == nil {
return errors.New("error allocating memory for conditions")
}
defer C.free(argsArr)
for i, cond := range conds {
C.add_struct_arg_cmp(C.scmp_cast_t(argsArr), C.uint(i),
C.uint(cond.Argument), cond.Op.toNative(),
C.uint64_t(cond.Operand1), C.uint64_t(cond.Operand2))
}
if err := f.addRuleWrapper(call, action, exact, C.uint(len(conds)), C.scmp_cast_t(argsArr)); err != nil {
return err
}
}
return nil
}
// Generic Helpers
// Helper - Sanitize Arch token input
func sanitizeArch(in ScmpArch) error {
if in < archStart || in > archEnd {
return fmt.Errorf("unrecognized architecture %#x", uint(in))
}
if in.toNative() == C.C_ARCH_BAD {
return fmt.Errorf("architecture %v is not supported on this version of the library", in)
}
return nil
}
func sanitizeAction(in ScmpAction) error {
inTmp := in & 0x0000FFFF
if inTmp < actionStart || inTmp > actionEnd {
return fmt.Errorf("unrecognized action %#x", uint(inTmp))
}
if inTmp != ActTrace && inTmp != ActErrno && (in&0xFFFF0000) != 0 {
return errors.New("highest 16 bits must be zeroed except for Trace and Errno")
}
return nil
}
func sanitizeCompareOp(in ScmpCompareOp) error {
if in < compareOpStart || in > compareOpEnd {
return fmt.Errorf("unrecognized comparison operator %#x", uint(in))
}
return nil
}
func archFromNative(a C.uint32_t) (ScmpArch, error) {
switch a {
case C.C_ARCH_X86:
return ArchX86, nil
case C.C_ARCH_X86_64:
return ArchAMD64, nil
case C.C_ARCH_X32:
return ArchX32, nil
case C.C_ARCH_ARM:
return ArchARM, nil
case C.C_ARCH_NATIVE:
return ArchNative, nil
case C.C_ARCH_AARCH64:
return ArchARM64, nil
case C.C_ARCH_MIPS:
return ArchMIPS, nil
case C.C_ARCH_MIPS64:
return ArchMIPS64, nil
case C.C_ARCH_MIPS64N32:
return ArchMIPS64N32, nil
case C.C_ARCH_MIPSEL:
return ArchMIPSEL, nil
case C.C_ARCH_MIPSEL64:
return ArchMIPSEL64, nil
case C.C_ARCH_MIPSEL64N32:
return ArchMIPSEL64N32, nil
case C.C_ARCH_PPC:
return ArchPPC, nil
case C.C_ARCH_PPC64:
return ArchPPC64, nil
case C.C_ARCH_PPC64LE:
return ArchPPC64LE, nil
case C.C_ARCH_S390:
return ArchS390, nil
case C.C_ARCH_S390X:
return ArchS390X, nil
case C.C_ARCH_PARISC:
return ArchPARISC, nil
case C.C_ARCH_PARISC64:
return ArchPARISC64, nil
case C.C_ARCH_RISCV64:
return ArchRISCV64, nil
default:
return 0x0, fmt.Errorf("unrecognized architecture %#x", uint32(a))
}
}
// Only use with sanitized arches, no error handling
func (a ScmpArch) toNative() C.uint32_t {
switch a {
case ArchX86:
return C.C_ARCH_X86
case ArchAMD64:
return C.C_ARCH_X86_64
case ArchX32:
return C.C_ARCH_X32
case ArchARM:
return C.C_ARCH_ARM
case ArchARM64:
return C.C_ARCH_AARCH64
case ArchMIPS:
return C.C_ARCH_MIPS
case ArchMIPS64:
return C.C_ARCH_MIPS64
case ArchMIPS64N32:
return C.C_ARCH_MIPS64N32
case ArchMIPSEL:
return C.C_ARCH_MIPSEL
case ArchMIPSEL64:
return C.C_ARCH_MIPSEL64
case ArchMIPSEL64N32:
return C.C_ARCH_MIPSEL64N32
case ArchPPC:
return C.C_ARCH_PPC
case ArchPPC64:
return C.C_ARCH_PPC64
case ArchPPC64LE:
return C.C_ARCH_PPC64LE
case ArchS390:
return C.C_ARCH_S390
case ArchS390X:
return C.C_ARCH_S390X
case ArchPARISC:
return C.C_ARCH_PARISC
case ArchPARISC64:
return C.C_ARCH_PARISC64
case ArchRISCV64:
return C.C_ARCH_RISCV64
case ArchNative:
return C.C_ARCH_NATIVE
default:
return 0x0
}
}
// Only use with sanitized ops, no error handling
func (a ScmpCompareOp) toNative() C.int {
switch a {
case CompareNotEqual:
return C.C_CMP_NE
case CompareLess:
return C.C_CMP_LT
case CompareLessOrEqual:
return C.C_CMP_LE
case CompareEqual:
return C.C_CMP_EQ
case CompareGreaterEqual:
return C.C_CMP_GE
case CompareGreater:
return C.C_CMP_GT
case CompareMaskedEqual:
return C.C_CMP_MASKED_EQ
default:
return 0x0
}
}
func actionFromNative(a C.uint32_t) (ScmpAction, error) {
aTmp := a & 0xFFFF
switch a & 0xFFFF0000 {
case C.C_ACT_KILL_PROCESS:
return ActKillProcess, nil
case C.C_ACT_KILL_THREAD:
return ActKillThread, nil
case C.C_ACT_TRAP:
return ActTrap, nil
case C.C_ACT_ERRNO:
return ActErrno.SetReturnCode(int16(aTmp)), nil
case C.C_ACT_TRACE:
return ActTrace.SetReturnCode(int16(aTmp)), nil
case C.C_ACT_LOG:
return ActLog, nil
case C.C_ACT_ALLOW:
return ActAllow, nil
case C.C_ACT_NOTIFY:
return ActNotify, nil
default:
return 0x0, fmt.Errorf("unrecognized action %#x", uint32(a))
}
}
// Only use with sanitized actions, no error handling
func (a ScmpAction) toNative() C.uint32_t {
switch a & 0xFFFF {
case ActKillProcess:
return C.C_ACT_KILL_PROCESS
case ActKillThread:
return C.C_ACT_KILL_THREAD
case ActTrap:
return C.C_ACT_TRAP
case ActErrno:
return C.C_ACT_ERRNO | (C.uint32_t(a) >> 16)
case ActTrace:
return C.C_ACT_TRACE | (C.uint32_t(a) >> 16)
case ActLog:
return C.C_ACT_LOG
case ActAllow:
return C.C_ACT_ALLOW
case ActNotify:
return C.C_ACT_NOTIFY
default:
return 0x0
}
}
// Internal only, assumes safe attribute
func (a scmpFilterAttr) toNative() uint32 {
switch a {
case filterAttrActDefault:
return uint32(C.C_ATTRIBUTE_DEFAULT)
case filterAttrActBadArch:
return uint32(C.C_ATTRIBUTE_BADARCH)
case filterAttrNNP:
return uint32(C.C_ATTRIBUTE_NNP)
case filterAttrTsync:
return uint32(C.C_ATTRIBUTE_TSYNC)
case filterAttrLog:
return uint32(C.C_ATTRIBUTE_LOG)
case filterAttrSSB:
return uint32(C.C_ATTRIBUTE_SSB)
case filterAttrOptimize:
return uint32(C.C_ATTRIBUTE_OPTIMIZE)
case filterAttrRawRC:
return uint32(C.C_ATTRIBUTE_SYSRAWRC)
default:
return 0x0
}
}
func syscallFromNative(a C.int) ScmpSyscall {
return ScmpSyscall(a)
}
func notifReqFromNative(req *C.struct_seccomp_notif) (*ScmpNotifReq, error) {
scmpArgs := make([]uint64, 6)
for i := 0; i < len(scmpArgs); i++ {
scmpArgs[i] = uint64(req.data.args[i])
}
arch, err := archFromNative(req.data.arch)
if err != nil {
return nil, err
}
scmpData := ScmpNotifData{
Syscall: syscallFromNative(req.data.nr),
Arch: arch,
InstrPointer: uint64(req.data.instruction_pointer),
Args: scmpArgs,
}
scmpReq := &ScmpNotifReq{
ID: uint64(req.id),
Pid: uint32(req.pid),
Flags: uint32(req.flags),
Data: scmpData,
}
return scmpReq, nil
}
func (scmpResp *ScmpNotifResp) toNative(resp *C.struct_seccomp_notif_resp) {
resp.id = C.__u64(scmpResp.ID)
resp.val = C.__s64(scmpResp.Val)
resp.error = (C.__s32(scmpResp.Error) * -1) // kernel requires a negated value
resp.flags = C.__u32(scmpResp.Flags)
}
// checkAPI checks that both the API level and the seccomp version is equal to
// or greater than the specified minLevel and major, minor, micro,
// respectively, and returns an error otherwise. Argument op is an arbitrary
// non-empty operation description, used as a part of the error message
// returned.
func checkAPI(op string, minLevel uint, major, minor, micro uint) error {
// Ignore error from getAPI, as it returns level == 0 in case of error.
level, _ := getAPI()
if level >= minLevel {
return checkVersion(op, major, minor, micro)
}
return &VersionError{
op: op,
curAPI: level,
minAPI: minLevel,
major: major,
minor: minor,
micro: micro,
}
}
// Userspace Notification API
// Calls to C.seccomp_notify* hidden from seccomp.go
func notifSupported() error {
return checkAPI("seccomp notification", 6, 2, 5, 0)
}
func (f *ScmpFilter) getNotifFd() (ScmpFd, error) {
f.lock.Lock()
defer f.lock.Unlock()
if !f.valid {
return -1, errBadFilter
}
if err := notifSupported(); err != nil {
return -1, err
}
fd := C.seccomp_notify_fd(f.filterCtx)
return ScmpFd(fd), nil
}
func notifReceive(fd ScmpFd) (*ScmpNotifReq, error) {
var req *C.struct_seccomp_notif
var resp *C.struct_seccomp_notif_resp
if err := notifSupported(); err != nil {
return nil, err
}
// we only use the request here; the response is unused
if retCode := C.seccomp_notify_alloc(&req, &resp); retCode != 0 {
return nil, errRc(retCode)
}
defer func() {
C.seccomp_notify_free(req, resp)
}()
for {
retCode, errno := C.seccomp_notify_receive(C.int(fd), req)
if retCode == 0 {
break
}
if errno == syscall.EINTR {
continue
}
if errno == syscall.ENOENT {
return nil, errno
}
return nil, errRc(retCode)
}
return notifReqFromNative(req)
}
func notifRespond(fd ScmpFd, scmpResp *ScmpNotifResp) error {
var req *C.struct_seccomp_notif
var resp *C.struct_seccomp_notif_resp
if err := notifSupported(); err != nil {
return err
}
// we only use the response here; the request is discarded
if retCode := C.seccomp_notify_alloc(&req, &resp); retCode != 0 {
return errRc(retCode)
}
defer func() {
C.seccomp_notify_free(req, resp)
}()
scmpResp.toNative(resp)
for {
retCode, errno := C.seccomp_notify_respond(C.int(fd), resp)
if retCode == 0 {
break
}
if errno == syscall.EINTR {
continue
}
if errno == syscall.ENOENT {
return errno
}
return errRc(retCode)
}
return nil
}
func notifIDValid(fd ScmpFd, id uint64) error {
if err := notifSupported(); err != nil {
return err
}
for {
retCode, errno := C.seccomp_notify_id_valid(C.int(fd), C.uint64_t(id))
if retCode == 0 {
break
}
if errno == syscall.EINTR {
continue
}
if errno == syscall.ENOENT {
return errno
}
return errRc(retCode)
}
return nil
}

View File

@@ -1,24 +0,0 @@
Copyright 2013 Suryandaru Triandana <syndtr@gmail.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -1,133 +0,0 @@
// Copyright (c) 2013, Suryandaru Triandana <syndtr@gmail.com>
// All rights reserved.
//
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
// Package capability provides utilities for manipulating POSIX capabilities.
package capability
type Capabilities interface {
// Get check whether a capability present in the given
// capabilities set. The 'which' value should be one of EFFECTIVE,
// PERMITTED, INHERITABLE, BOUNDING or AMBIENT.
Get(which CapType, what Cap) bool
// Empty check whether all capability bits of the given capabilities
// set are zero. The 'which' value should be one of EFFECTIVE,
// PERMITTED, INHERITABLE, BOUNDING or AMBIENT.
Empty(which CapType) bool
// Full check whether all capability bits of the given capabilities
// set are one. The 'which' value should be one of EFFECTIVE,
// PERMITTED, INHERITABLE, BOUNDING or AMBIENT.
Full(which CapType) bool
// Set sets capabilities of the given capabilities sets. The
// 'which' value should be one or combination (OR'ed) of EFFECTIVE,
// PERMITTED, INHERITABLE, BOUNDING or AMBIENT.
Set(which CapType, caps ...Cap)
// Unset unsets capabilities of the given capabilities sets. The
// 'which' value should be one or combination (OR'ed) of EFFECTIVE,
// PERMITTED, INHERITABLE, BOUNDING or AMBIENT.
Unset(which CapType, caps ...Cap)
// Fill sets all bits of the given capabilities kind to one. The
// 'kind' value should be one or combination (OR'ed) of CAPS,
// BOUNDS or AMBS.
Fill(kind CapType)
// Clear sets all bits of the given capabilities kind to zero. The
// 'kind' value should be one or combination (OR'ed) of CAPS,
// BOUNDS or AMBS.
Clear(kind CapType)
// String return current capabilities state of the given capabilities
// set as string. The 'which' value should be one of EFFECTIVE,
// PERMITTED, INHERITABLE BOUNDING or AMBIENT
StringCap(which CapType) string
// String return current capabilities state as string.
String() string
// Load load actual capabilities value. This will overwrite all
// outstanding changes.
Load() error
// Apply apply the capabilities settings, so all changes will take
// effect.
Apply(kind CapType) error
}
// NewPid initializes a new Capabilities object for given pid when
// it is nonzero, or for the current process if pid is 0.
//
// Deprecated: Replace with NewPid2. For example, replace:
//
// c, err := NewPid(0)
// if err != nil {
// return err
// }
//
// with:
//
// c, err := NewPid2(0)
// if err != nil {
// return err
// }
// err = c.Load()
// if err != nil {
// return err
// }
func NewPid(pid int) (Capabilities, error) {
c, err := newPid(pid)
if err != nil {
return c, err
}
err = c.Load()
return c, err
}
// NewPid2 initializes a new Capabilities object for given pid when
// it is nonzero, or for the current process if pid is 0. This
// does not load the process's current capabilities; to do that you
// must call Load explicitly.
func NewPid2(pid int) (Capabilities, error) {
return newPid(pid)
}
// NewFile initializes a new Capabilities object for given file path.
//
// Deprecated: Replace with NewFile2. For example, replace:
//
// c, err := NewFile(path)
// if err != nil {
// return err
// }
//
// with:
//
// c, err := NewFile2(path)
// if err != nil {
// return err
// }
// err = c.Load()
// if err != nil {
// return err
// }
func NewFile(path string) (Capabilities, error) {
c, err := newFile(path)
if err != nil {
return c, err
}
err = c.Load()
return c, err
}
// NewFile2 creates a new initialized Capabilities object for given
// file path. This does not load the process's current capabilities;
// to do that you must call Load explicitly.
func NewFile2(path string) (Capabilities, error) {
return newFile(path)
}

View File

@@ -1,642 +0,0 @@
// Copyright (c) 2013, Suryandaru Triandana <syndtr@gmail.com>
// All rights reserved.
//
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
package capability
import (
"bufio"
"errors"
"fmt"
"io"
"os"
"strings"
"syscall"
)
var errUnknownVers = errors.New("unknown capability version")
const (
linuxCapVer1 = 0x19980330
linuxCapVer2 = 0x20071026
linuxCapVer3 = 0x20080522
)
var (
capVers uint32
capLastCap Cap
)
func init() {
var hdr capHeader
capget(&hdr, nil)
capVers = hdr.version
if initLastCap() == nil {
CAP_LAST_CAP = capLastCap
if capLastCap > 31 {
capUpperMask = (uint32(1) << (uint(capLastCap) - 31)) - 1
} else {
capUpperMask = 0
}
}
}
func initLastCap() error {
if capLastCap != 0 {
return nil
}
f, err := os.Open("/proc/sys/kernel/cap_last_cap")
if err != nil {
return err
}
defer f.Close()
var b []byte = make([]byte, 11)
_, err = f.Read(b)
if err != nil {
return err
}
fmt.Sscanf(string(b), "%d", &capLastCap)
return nil
}
func mkStringCap(c Capabilities, which CapType) (ret string) {
for i, first := Cap(0), true; i <= CAP_LAST_CAP; i++ {
if !c.Get(which, i) {
continue
}
if first {
first = false
} else {
ret += ", "
}
ret += i.String()
}
return
}
func mkString(c Capabilities, max CapType) (ret string) {
ret = "{"
for i := CapType(1); i <= max; i <<= 1 {
ret += " " + i.String() + "=\""
if c.Empty(i) {
ret += "empty"
} else if c.Full(i) {
ret += "full"
} else {
ret += c.StringCap(i)
}
ret += "\""
}
ret += " }"
return
}
func newPid(pid int) (c Capabilities, err error) {
switch capVers {
case linuxCapVer1:
p := new(capsV1)
p.hdr.version = capVers
p.hdr.pid = int32(pid)
c = p
case linuxCapVer2, linuxCapVer3:
p := new(capsV3)
p.hdr.version = capVers
p.hdr.pid = int32(pid)
c = p
default:
err = errUnknownVers
return
}
return
}
type capsV1 struct {
hdr capHeader
data capData
}
func (c *capsV1) Get(which CapType, what Cap) bool {
if what > 32 {
return false
}
switch which {
case EFFECTIVE:
return (1<<uint(what))&c.data.effective != 0
case PERMITTED:
return (1<<uint(what))&c.data.permitted != 0
case INHERITABLE:
return (1<<uint(what))&c.data.inheritable != 0
}
return false
}
func (c *capsV1) getData(which CapType) (ret uint32) {
switch which {
case EFFECTIVE:
ret = c.data.effective
case PERMITTED:
ret = c.data.permitted
case INHERITABLE:
ret = c.data.inheritable
}
return
}
func (c *capsV1) Empty(which CapType) bool {
return c.getData(which) == 0
}
func (c *capsV1) Full(which CapType) bool {
return (c.getData(which) & 0x7fffffff) == 0x7fffffff
}
func (c *capsV1) Set(which CapType, caps ...Cap) {
for _, what := range caps {
if what > 32 {
continue
}
if which&EFFECTIVE != 0 {
c.data.effective |= 1 << uint(what)
}
if which&PERMITTED != 0 {
c.data.permitted |= 1 << uint(what)
}
if which&INHERITABLE != 0 {
c.data.inheritable |= 1 << uint(what)
}
}
}
func (c *capsV1) Unset(which CapType, caps ...Cap) {
for _, what := range caps {
if what > 32 {
continue
}
if which&EFFECTIVE != 0 {
c.data.effective &= ^(1 << uint(what))
}
if which&PERMITTED != 0 {
c.data.permitted &= ^(1 << uint(what))
}
if which&INHERITABLE != 0 {
c.data.inheritable &= ^(1 << uint(what))
}
}
}
func (c *capsV1) Fill(kind CapType) {
if kind&CAPS == CAPS {
c.data.effective = 0x7fffffff
c.data.permitted = 0x7fffffff
c.data.inheritable = 0
}
}
func (c *capsV1) Clear(kind CapType) {
if kind&CAPS == CAPS {
c.data.effective = 0
c.data.permitted = 0
c.data.inheritable = 0
}
}
func (c *capsV1) StringCap(which CapType) (ret string) {
return mkStringCap(c, which)
}
func (c *capsV1) String() (ret string) {
return mkString(c, BOUNDING)
}
func (c *capsV1) Load() (err error) {
return capget(&c.hdr, &c.data)
}
func (c *capsV1) Apply(kind CapType) error {
if kind&CAPS == CAPS {
return capset(&c.hdr, &c.data)
}
return nil
}
type capsV3 struct {
hdr capHeader
data [2]capData
bounds [2]uint32
ambient [2]uint32
}
func (c *capsV3) Get(which CapType, what Cap) bool {
var i uint
if what > 31 {
i = uint(what) >> 5
what %= 32
}
switch which {
case EFFECTIVE:
return (1<<uint(what))&c.data[i].effective != 0
case PERMITTED:
return (1<<uint(what))&c.data[i].permitted != 0
case INHERITABLE:
return (1<<uint(what))&c.data[i].inheritable != 0
case BOUNDING:
return (1<<uint(what))&c.bounds[i] != 0
case AMBIENT:
return (1<<uint(what))&c.ambient[i] != 0
}
return false
}
func (c *capsV3) getData(which CapType, dest []uint32) {
switch which {
case EFFECTIVE:
dest[0] = c.data[0].effective
dest[1] = c.data[1].effective
case PERMITTED:
dest[0] = c.data[0].permitted
dest[1] = c.data[1].permitted
case INHERITABLE:
dest[0] = c.data[0].inheritable
dest[1] = c.data[1].inheritable
case BOUNDING:
dest[0] = c.bounds[0]
dest[1] = c.bounds[1]
case AMBIENT:
dest[0] = c.ambient[0]
dest[1] = c.ambient[1]
}
}
func (c *capsV3) Empty(which CapType) bool {
var data [2]uint32
c.getData(which, data[:])
return data[0] == 0 && data[1] == 0
}
func (c *capsV3) Full(which CapType) bool {
var data [2]uint32
c.getData(which, data[:])
if (data[0] & 0xffffffff) != 0xffffffff {
return false
}
return (data[1] & capUpperMask) == capUpperMask
}
func (c *capsV3) Set(which CapType, caps ...Cap) {
for _, what := range caps {
var i uint
if what > 31 {
i = uint(what) >> 5
what %= 32
}
if which&EFFECTIVE != 0 {
c.data[i].effective |= 1 << uint(what)
}
if which&PERMITTED != 0 {
c.data[i].permitted |= 1 << uint(what)
}
if which&INHERITABLE != 0 {
c.data[i].inheritable |= 1 << uint(what)
}
if which&BOUNDING != 0 {
c.bounds[i] |= 1 << uint(what)
}
if which&AMBIENT != 0 {
c.ambient[i] |= 1 << uint(what)
}
}
}
func (c *capsV3) Unset(which CapType, caps ...Cap) {
for _, what := range caps {
var i uint
if what > 31 {
i = uint(what) >> 5
what %= 32
}
if which&EFFECTIVE != 0 {
c.data[i].effective &= ^(1 << uint(what))
}
if which&PERMITTED != 0 {
c.data[i].permitted &= ^(1 << uint(what))
}
if which&INHERITABLE != 0 {
c.data[i].inheritable &= ^(1 << uint(what))
}
if which&BOUNDING != 0 {
c.bounds[i] &= ^(1 << uint(what))
}
if which&AMBIENT != 0 {
c.ambient[i] &= ^(1 << uint(what))
}
}
}
func (c *capsV3) Fill(kind CapType) {
if kind&CAPS == CAPS {
c.data[0].effective = 0xffffffff
c.data[0].permitted = 0xffffffff
c.data[0].inheritable = 0
c.data[1].effective = 0xffffffff
c.data[1].permitted = 0xffffffff
c.data[1].inheritable = 0
}
if kind&BOUNDS == BOUNDS {
c.bounds[0] = 0xffffffff
c.bounds[1] = 0xffffffff
}
if kind&AMBS == AMBS {
c.ambient[0] = 0xffffffff
c.ambient[1] = 0xffffffff
}
}
func (c *capsV3) Clear(kind CapType) {
if kind&CAPS == CAPS {
c.data[0].effective = 0
c.data[0].permitted = 0
c.data[0].inheritable = 0
c.data[1].effective = 0
c.data[1].permitted = 0
c.data[1].inheritable = 0
}
if kind&BOUNDS == BOUNDS {
c.bounds[0] = 0
c.bounds[1] = 0
}
if kind&AMBS == AMBS {
c.ambient[0] = 0
c.ambient[1] = 0
}
}
func (c *capsV3) StringCap(which CapType) (ret string) {
return mkStringCap(c, which)
}
func (c *capsV3) String() (ret string) {
return mkString(c, BOUNDING)
}
func (c *capsV3) Load() (err error) {
err = capget(&c.hdr, &c.data[0])
if err != nil {
return
}
var status_path string
if c.hdr.pid == 0 {
status_path = fmt.Sprintf("/proc/self/status")
} else {
status_path = fmt.Sprintf("/proc/%d/status", c.hdr.pid)
}
f, err := os.Open(status_path)
if err != nil {
return
}
b := bufio.NewReader(f)
for {
line, e := b.ReadString('\n')
if e != nil {
if e != io.EOF {
err = e
}
break
}
if strings.HasPrefix(line, "CapB") {
fmt.Sscanf(line[4:], "nd: %08x%08x", &c.bounds[1], &c.bounds[0])
continue
}
if strings.HasPrefix(line, "CapA") {
fmt.Sscanf(line[4:], "mb: %08x%08x", &c.ambient[1], &c.ambient[0])
continue
}
}
f.Close()
return
}
func (c *capsV3) Apply(kind CapType) (err error) {
if kind&BOUNDS == BOUNDS {
var data [2]capData
err = capget(&c.hdr, &data[0])
if err != nil {
return
}
if (1<<uint(CAP_SETPCAP))&data[0].effective != 0 {
for i := Cap(0); i <= CAP_LAST_CAP; i++ {
if c.Get(BOUNDING, i) {
continue
}
err = prctl(syscall.PR_CAPBSET_DROP, uintptr(i), 0, 0, 0)
if err != nil {
// Ignore EINVAL since the capability may not be supported in this system.
if errno, ok := err.(syscall.Errno); ok && errno == syscall.EINVAL {
err = nil
continue
}
return
}
}
}
}
if kind&CAPS == CAPS {
err = capset(&c.hdr, &c.data[0])
if err != nil {
return
}
}
if kind&AMBS == AMBS {
for i := Cap(0); i <= CAP_LAST_CAP; i++ {
action := pr_CAP_AMBIENT_LOWER
if c.Get(AMBIENT, i) {
action = pr_CAP_AMBIENT_RAISE
}
err := prctl(pr_CAP_AMBIENT, action, uintptr(i), 0, 0)
// Ignore EINVAL as not supported on kernels before 4.3
if errno, ok := err.(syscall.Errno); ok && errno == syscall.EINVAL {
err = nil
continue
}
}
}
return
}
func newFile(path string) (c Capabilities, err error) {
c = &capsFile{path: path}
return
}
type capsFile struct {
path string
data vfscapData
}
func (c *capsFile) Get(which CapType, what Cap) bool {
var i uint
if what > 31 {
if c.data.version == 1 {
return false
}
i = uint(what) >> 5
what %= 32
}
switch which {
case EFFECTIVE:
return (1<<uint(what))&c.data.effective[i] != 0
case PERMITTED:
return (1<<uint(what))&c.data.data[i].permitted != 0
case INHERITABLE:
return (1<<uint(what))&c.data.data[i].inheritable != 0
}
return false
}
func (c *capsFile) getData(which CapType, dest []uint32) {
switch which {
case EFFECTIVE:
dest[0] = c.data.effective[0]
dest[1] = c.data.effective[1]
case PERMITTED:
dest[0] = c.data.data[0].permitted
dest[1] = c.data.data[1].permitted
case INHERITABLE:
dest[0] = c.data.data[0].inheritable
dest[1] = c.data.data[1].inheritable
}
}
func (c *capsFile) Empty(which CapType) bool {
var data [2]uint32
c.getData(which, data[:])
return data[0] == 0 && data[1] == 0
}
func (c *capsFile) Full(which CapType) bool {
var data [2]uint32
c.getData(which, data[:])
if c.data.version == 0 {
return (data[0] & 0x7fffffff) == 0x7fffffff
}
if (data[0] & 0xffffffff) != 0xffffffff {
return false
}
return (data[1] & capUpperMask) == capUpperMask
}
func (c *capsFile) Set(which CapType, caps ...Cap) {
for _, what := range caps {
var i uint
if what > 31 {
if c.data.version == 1 {
continue
}
i = uint(what) >> 5
what %= 32
}
if which&EFFECTIVE != 0 {
c.data.effective[i] |= 1 << uint(what)
}
if which&PERMITTED != 0 {
c.data.data[i].permitted |= 1 << uint(what)
}
if which&INHERITABLE != 0 {
c.data.data[i].inheritable |= 1 << uint(what)
}
}
}
func (c *capsFile) Unset(which CapType, caps ...Cap) {
for _, what := range caps {
var i uint
if what > 31 {
if c.data.version == 1 {
continue
}
i = uint(what) >> 5
what %= 32
}
if which&EFFECTIVE != 0 {
c.data.effective[i] &= ^(1 << uint(what))
}
if which&PERMITTED != 0 {
c.data.data[i].permitted &= ^(1 << uint(what))
}
if which&INHERITABLE != 0 {
c.data.data[i].inheritable &= ^(1 << uint(what))
}
}
}
func (c *capsFile) Fill(kind CapType) {
if kind&CAPS == CAPS {
c.data.effective[0] = 0xffffffff
c.data.data[0].permitted = 0xffffffff
c.data.data[0].inheritable = 0
if c.data.version == 2 {
c.data.effective[1] = 0xffffffff
c.data.data[1].permitted = 0xffffffff
c.data.data[1].inheritable = 0
}
}
}
func (c *capsFile) Clear(kind CapType) {
if kind&CAPS == CAPS {
c.data.effective[0] = 0
c.data.data[0].permitted = 0
c.data.data[0].inheritable = 0
if c.data.version == 2 {
c.data.effective[1] = 0
c.data.data[1].permitted = 0
c.data.data[1].inheritable = 0
}
}
}
func (c *capsFile) StringCap(which CapType) (ret string) {
return mkStringCap(c, which)
}
func (c *capsFile) String() (ret string) {
return mkString(c, INHERITABLE)
}
func (c *capsFile) Load() (err error) {
return getVfsCap(c.path, &c.data)
}
func (c *capsFile) Apply(kind CapType) (err error) {
if kind&CAPS == CAPS {
return setVfsCap(c.path, &c.data)
}
return
}

View File

@@ -1,19 +0,0 @@
// Copyright (c) 2013, Suryandaru Triandana <syndtr@gmail.com>
// All rights reserved.
//
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
// +build !linux
package capability
import "errors"
func newPid(pid int) (Capabilities, error) {
return nil, errors.New("not supported")
}
func newFile(path string) (Capabilities, error) {
return nil, errors.New("not supported")
}

View File

@@ -1,309 +0,0 @@
// Copyright (c) 2013, Suryandaru Triandana <syndtr@gmail.com>
// All rights reserved.
//
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
package capability
type CapType uint
func (c CapType) String() string {
switch c {
case EFFECTIVE:
return "effective"
case PERMITTED:
return "permitted"
case INHERITABLE:
return "inheritable"
case BOUNDING:
return "bounding"
case CAPS:
return "caps"
case AMBIENT:
return "ambient"
}
return "unknown"
}
const (
EFFECTIVE CapType = 1 << iota
PERMITTED
INHERITABLE
BOUNDING
AMBIENT
CAPS = EFFECTIVE | PERMITTED | INHERITABLE
BOUNDS = BOUNDING
AMBS = AMBIENT
)
//go:generate go run enumgen/gen.go
type Cap int
// POSIX-draft defined capabilities and Linux extensions.
//
// Defined in https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h
const (
// In a system with the [_POSIX_CHOWN_RESTRICTED] option defined, this
// overrides the restriction of changing file ownership and group
// ownership.
CAP_CHOWN = Cap(0)
// Override all DAC access, including ACL execute access if
// [_POSIX_ACL] is defined. Excluding DAC access covered by
// CAP_LINUX_IMMUTABLE.
CAP_DAC_OVERRIDE = Cap(1)
// Overrides all DAC restrictions regarding read and search on files
// and directories, including ACL restrictions if [_POSIX_ACL] is
// defined. Excluding DAC access covered by CAP_LINUX_IMMUTABLE.
CAP_DAC_READ_SEARCH = Cap(2)
// Overrides all restrictions about allowed operations on files, where
// file owner ID must be equal to the user ID, except where CAP_FSETID
// is applicable. It doesn't override MAC and DAC restrictions.
CAP_FOWNER = Cap(3)
// Overrides the following restrictions that the effective user ID
// shall match the file owner ID when setting the S_ISUID and S_ISGID
// bits on that file; that the effective group ID (or one of the
// supplementary group IDs) shall match the file owner ID when setting
// the S_ISGID bit on that file; that the S_ISUID and S_ISGID bits are
// cleared on successful return from chown(2) (not implemented).
CAP_FSETID = Cap(4)
// Overrides the restriction that the real or effective user ID of a
// process sending a signal must match the real or effective user ID
// of the process receiving the signal.
CAP_KILL = Cap(5)
// Allows setgid(2) manipulation
// Allows setgroups(2)
// Allows forged gids on socket credentials passing.
CAP_SETGID = Cap(6)
// Allows set*uid(2) manipulation (including fsuid).
// Allows forged pids on socket credentials passing.
CAP_SETUID = Cap(7)
// Linux-specific capabilities
// Without VFS support for capabilities:
// Transfer any capability in your permitted set to any pid,
// remove any capability in your permitted set from any pid
// With VFS support for capabilities (neither of above, but)
// Add any capability from current's capability bounding set
// to the current process' inheritable set
// Allow taking bits out of capability bounding set
// Allow modification of the securebits for a process
CAP_SETPCAP = Cap(8)
// Allow modification of S_IMMUTABLE and S_APPEND file attributes
CAP_LINUX_IMMUTABLE = Cap(9)
// Allows binding to TCP/UDP sockets below 1024
// Allows binding to ATM VCIs below 32
CAP_NET_BIND_SERVICE = Cap(10)
// Allow broadcasting, listen to multicast
CAP_NET_BROADCAST = Cap(11)
// Allow interface configuration
// Allow administration of IP firewall, masquerading and accounting
// Allow setting debug option on sockets
// Allow modification of routing tables
// Allow setting arbitrary process / process group ownership on
// sockets
// Allow binding to any address for transparent proxying (also via NET_RAW)
// Allow setting TOS (type of service)
// Allow setting promiscuous mode
// Allow clearing driver statistics
// Allow multicasting
// Allow read/write of device-specific registers
// Allow activation of ATM control sockets
CAP_NET_ADMIN = Cap(12)
// Allow use of RAW sockets
// Allow use of PACKET sockets
// Allow binding to any address for transparent proxying (also via NET_ADMIN)
CAP_NET_RAW = Cap(13)
// Allow locking of shared memory segments
// Allow mlock and mlockall (which doesn't really have anything to do
// with IPC)
CAP_IPC_LOCK = Cap(14)
// Override IPC ownership checks
CAP_IPC_OWNER = Cap(15)
// Insert and remove kernel modules - modify kernel without limit
CAP_SYS_MODULE = Cap(16)
// Allow ioperm/iopl access
// Allow sending USB messages to any device via /proc/bus/usb
CAP_SYS_RAWIO = Cap(17)
// Allow use of chroot()
CAP_SYS_CHROOT = Cap(18)
// Allow ptrace() of any process
CAP_SYS_PTRACE = Cap(19)
// Allow configuration of process accounting
CAP_SYS_PACCT = Cap(20)
// Allow configuration of the secure attention key
// Allow administration of the random device
// Allow examination and configuration of disk quotas
// Allow setting the domainname
// Allow setting the hostname
// Allow calling bdflush()
// Allow mount() and umount(), setting up new smb connection
// Allow some autofs root ioctls
// Allow nfsservctl
// Allow VM86_REQUEST_IRQ
// Allow to read/write pci config on alpha
// Allow irix_prctl on mips (setstacksize)
// Allow flushing all cache on m68k (sys_cacheflush)
// Allow removing semaphores
// Used instead of CAP_CHOWN to "chown" IPC message queues, semaphores
// and shared memory
// Allow locking/unlocking of shared memory segment
// Allow turning swap on/off
// Allow forged pids on socket credentials passing
// Allow setting readahead and flushing buffers on block devices
// Allow setting geometry in floppy driver
// Allow turning DMA on/off in xd driver
// Allow administration of md devices (mostly the above, but some
// extra ioctls)
// Allow tuning the ide driver
// Allow access to the nvram device
// Allow administration of apm_bios, serial and bttv (TV) device
// Allow manufacturer commands in isdn CAPI support driver
// Allow reading non-standardized portions of pci configuration space
// Allow DDI debug ioctl on sbpcd driver
// Allow setting up serial ports
// Allow sending raw qic-117 commands
// Allow enabling/disabling tagged queuing on SCSI controllers and sending
// arbitrary SCSI commands
// Allow setting encryption key on loopback filesystem
// Allow setting zone reclaim policy
// Allow everything under CAP_BPF and CAP_PERFMON for backward compatibility
CAP_SYS_ADMIN = Cap(21)
// Allow use of reboot()
CAP_SYS_BOOT = Cap(22)
// Allow raising priority and setting priority on other (different
// UID) processes
// Allow use of FIFO and round-robin (realtime) scheduling on own
// processes and setting the scheduling algorithm used by another
// process.
// Allow setting cpu affinity on other processes
CAP_SYS_NICE = Cap(23)
// Override resource limits. Set resource limits.
// Override quota limits.
// Override reserved space on ext2 filesystem
// Modify data journaling mode on ext3 filesystem (uses journaling
// resources)
// NOTE: ext2 honors fsuid when checking for resource overrides, so
// you can override using fsuid too
// Override size restrictions on IPC message queues
// Allow more than 64hz interrupts from the real-time clock
// Override max number of consoles on console allocation
// Override max number of keymaps
// Control memory reclaim behavior
CAP_SYS_RESOURCE = Cap(24)
// Allow manipulation of system clock
// Allow irix_stime on mips
// Allow setting the real-time clock
CAP_SYS_TIME = Cap(25)
// Allow configuration of tty devices
// Allow vhangup() of tty
CAP_SYS_TTY_CONFIG = Cap(26)
// Allow the privileged aspects of mknod()
CAP_MKNOD = Cap(27)
// Allow taking of leases on files
CAP_LEASE = Cap(28)
CAP_AUDIT_WRITE = Cap(29)
CAP_AUDIT_CONTROL = Cap(30)
CAP_SETFCAP = Cap(31)
// Override MAC access.
// The base kernel enforces no MAC policy.
// An LSM may enforce a MAC policy, and if it does and it chooses
// to implement capability based overrides of that policy, this is
// the capability it should use to do so.
CAP_MAC_OVERRIDE = Cap(32)
// Allow MAC configuration or state changes.
// The base kernel requires no MAC configuration.
// An LSM may enforce a MAC policy, and if it does and it chooses
// to implement capability based checks on modifications to that
// policy or the data required to maintain it, this is the
// capability it should use to do so.
CAP_MAC_ADMIN = Cap(33)
// Allow configuring the kernel's syslog (printk behaviour)
CAP_SYSLOG = Cap(34)
// Allow triggering something that will wake the system
CAP_WAKE_ALARM = Cap(35)
// Allow preventing system suspends
CAP_BLOCK_SUSPEND = Cap(36)
// Allow reading the audit log via multicast netlink socket
CAP_AUDIT_READ = Cap(37)
// Allow system performance and observability privileged operations
// using perf_events, i915_perf and other kernel subsystems
CAP_PERFMON = Cap(38)
// CAP_BPF allows the following BPF operations:
// - Creating all types of BPF maps
// - Advanced verifier features
// - Indirect variable access
// - Bounded loops
// - BPF to BPF function calls
// - Scalar precision tracking
// - Larger complexity limits
// - Dead code elimination
// - And potentially other features
// - Loading BPF Type Format (BTF) data
// - Retrieve xlated and JITed code of BPF programs
// - Use bpf_spin_lock() helper
//
// CAP_PERFMON relaxes the verifier checks further:
// - BPF progs can use of pointer-to-integer conversions
// - speculation attack hardening measures are bypassed
// - bpf_probe_read to read arbitrary kernel memory is allowed
// - bpf_trace_printk to print kernel memory is allowed
//
// CAP_SYS_ADMIN is required to use bpf_probe_write_user.
//
// CAP_SYS_ADMIN is required to iterate system wide loaded
// programs, maps, links, BTFs and convert their IDs to file descriptors.
//
// CAP_PERFMON and CAP_BPF are required to load tracing programs.
// CAP_NET_ADMIN and CAP_BPF are required to load networking programs.
CAP_BPF = Cap(39)
// Allow checkpoint/restore related operations.
// Introduced in kernel 5.9
CAP_CHECKPOINT_RESTORE = Cap(40)
)
var (
// Highest valid capability of the running kernel.
CAP_LAST_CAP = Cap(63)
capUpperMask = ^uint32(0)
)

View File

@@ -1,138 +0,0 @@
// generated file; DO NOT EDIT - use go generate in directory with source
package capability
func (c Cap) String() string {
switch c {
case CAP_CHOWN:
return "chown"
case CAP_DAC_OVERRIDE:
return "dac_override"
case CAP_DAC_READ_SEARCH:
return "dac_read_search"
case CAP_FOWNER:
return "fowner"
case CAP_FSETID:
return "fsetid"
case CAP_KILL:
return "kill"
case CAP_SETGID:
return "setgid"
case CAP_SETUID:
return "setuid"
case CAP_SETPCAP:
return "setpcap"
case CAP_LINUX_IMMUTABLE:
return "linux_immutable"
case CAP_NET_BIND_SERVICE:
return "net_bind_service"
case CAP_NET_BROADCAST:
return "net_broadcast"
case CAP_NET_ADMIN:
return "net_admin"
case CAP_NET_RAW:
return "net_raw"
case CAP_IPC_LOCK:
return "ipc_lock"
case CAP_IPC_OWNER:
return "ipc_owner"
case CAP_SYS_MODULE:
return "sys_module"
case CAP_SYS_RAWIO:
return "sys_rawio"
case CAP_SYS_CHROOT:
return "sys_chroot"
case CAP_SYS_PTRACE:
return "sys_ptrace"
case CAP_SYS_PACCT:
return "sys_pacct"
case CAP_SYS_ADMIN:
return "sys_admin"
case CAP_SYS_BOOT:
return "sys_boot"
case CAP_SYS_NICE:
return "sys_nice"
case CAP_SYS_RESOURCE:
return "sys_resource"
case CAP_SYS_TIME:
return "sys_time"
case CAP_SYS_TTY_CONFIG:
return "sys_tty_config"
case CAP_MKNOD:
return "mknod"
case CAP_LEASE:
return "lease"
case CAP_AUDIT_WRITE:
return "audit_write"
case CAP_AUDIT_CONTROL:
return "audit_control"
case CAP_SETFCAP:
return "setfcap"
case CAP_MAC_OVERRIDE:
return "mac_override"
case CAP_MAC_ADMIN:
return "mac_admin"
case CAP_SYSLOG:
return "syslog"
case CAP_WAKE_ALARM:
return "wake_alarm"
case CAP_BLOCK_SUSPEND:
return "block_suspend"
case CAP_AUDIT_READ:
return "audit_read"
case CAP_PERFMON:
return "perfmon"
case CAP_BPF:
return "bpf"
case CAP_CHECKPOINT_RESTORE:
return "checkpoint_restore"
}
return "unknown"
}
// List returns list of all supported capabilities
func List() []Cap {
return []Cap{
CAP_CHOWN,
CAP_DAC_OVERRIDE,
CAP_DAC_READ_SEARCH,
CAP_FOWNER,
CAP_FSETID,
CAP_KILL,
CAP_SETGID,
CAP_SETUID,
CAP_SETPCAP,
CAP_LINUX_IMMUTABLE,
CAP_NET_BIND_SERVICE,
CAP_NET_BROADCAST,
CAP_NET_ADMIN,
CAP_NET_RAW,
CAP_IPC_LOCK,
CAP_IPC_OWNER,
CAP_SYS_MODULE,
CAP_SYS_RAWIO,
CAP_SYS_CHROOT,
CAP_SYS_PTRACE,
CAP_SYS_PACCT,
CAP_SYS_ADMIN,
CAP_SYS_BOOT,
CAP_SYS_NICE,
CAP_SYS_RESOURCE,
CAP_SYS_TIME,
CAP_SYS_TTY_CONFIG,
CAP_MKNOD,
CAP_LEASE,
CAP_AUDIT_WRITE,
CAP_AUDIT_CONTROL,
CAP_SETFCAP,
CAP_MAC_OVERRIDE,
CAP_MAC_ADMIN,
CAP_SYSLOG,
CAP_WAKE_ALARM,
CAP_BLOCK_SUSPEND,
CAP_AUDIT_READ,
CAP_PERFMON,
CAP_BPF,
CAP_CHECKPOINT_RESTORE,
}
}

View File

@@ -1,154 +0,0 @@
// Copyright (c) 2013, Suryandaru Triandana <syndtr@gmail.com>
// All rights reserved.
//
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.
package capability
import (
"syscall"
"unsafe"
)
type capHeader struct {
version uint32
pid int32
}
type capData struct {
effective uint32
permitted uint32
inheritable uint32
}
func capget(hdr *capHeader, data *capData) (err error) {
_, _, e1 := syscall.Syscall(syscall.SYS_CAPGET, uintptr(unsafe.Pointer(hdr)), uintptr(unsafe.Pointer(data)), 0)
if e1 != 0 {
err = e1
}
return
}
func capset(hdr *capHeader, data *capData) (err error) {
_, _, e1 := syscall.Syscall(syscall.SYS_CAPSET, uintptr(unsafe.Pointer(hdr)), uintptr(unsafe.Pointer(data)), 0)
if e1 != 0 {
err = e1
}
return
}
// not yet in syscall
const (
pr_CAP_AMBIENT = 47
pr_CAP_AMBIENT_IS_SET = uintptr(1)
pr_CAP_AMBIENT_RAISE = uintptr(2)
pr_CAP_AMBIENT_LOWER = uintptr(3)
pr_CAP_AMBIENT_CLEAR_ALL = uintptr(4)
)
func prctl(option int, arg2, arg3, arg4, arg5 uintptr) (err error) {
_, _, e1 := syscall.Syscall6(syscall.SYS_PRCTL, uintptr(option), arg2, arg3, arg4, arg5, 0)
if e1 != 0 {
err = e1
}
return
}
const (
vfsXattrName = "security.capability"
vfsCapVerMask = 0xff000000
vfsCapVer1 = 0x01000000
vfsCapVer2 = 0x02000000
vfsCapFlagMask = ^vfsCapVerMask
vfsCapFlageffective = 0x000001
vfscapDataSizeV1 = 4 * (1 + 2*1)
vfscapDataSizeV2 = 4 * (1 + 2*2)
)
type vfscapData struct {
magic uint32
data [2]struct {
permitted uint32
inheritable uint32
}
effective [2]uint32
version int8
}
var (
_vfsXattrName *byte
)
func init() {
_vfsXattrName, _ = syscall.BytePtrFromString(vfsXattrName)
}
func getVfsCap(path string, dest *vfscapData) (err error) {
var _p0 *byte
_p0, err = syscall.BytePtrFromString(path)
if err != nil {
return
}
r0, _, e1 := syscall.Syscall6(syscall.SYS_GETXATTR, uintptr(unsafe.Pointer(_p0)), uintptr(unsafe.Pointer(_vfsXattrName)), uintptr(unsafe.Pointer(dest)), vfscapDataSizeV2, 0, 0)
if e1 != 0 {
if e1 == syscall.ENODATA {
dest.version = 2
return
}
err = e1
}
switch dest.magic & vfsCapVerMask {
case vfsCapVer1:
dest.version = 1
if r0 != vfscapDataSizeV1 {
return syscall.EINVAL
}
dest.data[1].permitted = 0
dest.data[1].inheritable = 0
case vfsCapVer2:
dest.version = 2
if r0 != vfscapDataSizeV2 {
return syscall.EINVAL
}
default:
return syscall.EINVAL
}
if dest.magic&vfsCapFlageffective != 0 {
dest.effective[0] = dest.data[0].permitted | dest.data[0].inheritable
dest.effective[1] = dest.data[1].permitted | dest.data[1].inheritable
} else {
dest.effective[0] = 0
dest.effective[1] = 0
}
return
}
func setVfsCap(path string, data *vfscapData) (err error) {
var _p0 *byte
_p0, err = syscall.BytePtrFromString(path)
if err != nil {
return
}
var size uintptr
if data.version == 1 {
data.magic = vfsCapVer1
size = vfscapDataSizeV1
} else if data.version == 2 {
data.magic = vfsCapVer2
if data.effective[0] != 0 || data.effective[1] != 0 {
data.magic |= vfsCapFlageffective
}
size = vfscapDataSizeV2
} else {
return syscall.EINVAL
}
_, _, e1 := syscall.Syscall6(syscall.SYS_SETXATTR, uintptr(unsafe.Pointer(_p0)), uintptr(unsafe.Pointer(_vfsXattrName)), uintptr(unsafe.Pointer(data)), size, 0, 0)
if e1 != 0 {
err = e1
}
return
}

41
vendor/golang.org/x/net/bpf/asm.go generated vendored
View File

@@ -1,41 +0,0 @@
// Copyright 2016 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package bpf
import "fmt"
// Assemble converts insts into raw instructions suitable for loading
// into a BPF virtual machine.
//
// Currently, no optimization is attempted, the assembled program flow
// is exactly as provided.
func Assemble(insts []Instruction) ([]RawInstruction, error) {
ret := make([]RawInstruction, len(insts))
var err error
for i, inst := range insts {
ret[i], err = inst.Assemble()
if err != nil {
return nil, fmt.Errorf("assembling instruction %d: %s", i+1, err)
}
}
return ret, nil
}
// Disassemble attempts to parse raw back into
// Instructions. Unrecognized RawInstructions are assumed to be an
// extension not implemented by this package, and are passed through
// unchanged to the output. The allDecoded value reports whether insts
// contains no RawInstructions.
func Disassemble(raw []RawInstruction) (insts []Instruction, allDecoded bool) {
insts = make([]Instruction, len(raw))
allDecoded = true
for i, r := range raw {
insts[i] = r.Disassemble()
if _, ok := insts[i].(RawInstruction); ok {
allDecoded = false
}
}
return insts, allDecoded
}

View File

@@ -1,222 +0,0 @@
// Copyright 2016 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
package bpf
// A Register is a register of the BPF virtual machine.
type Register uint16
const (
// RegA is the accumulator register. RegA is always the
// destination register of ALU operations.
RegA Register = iota
// RegX is the indirection register, used by LoadIndirect
// operations.
RegX
)
// An ALUOp is an arithmetic or logic operation.
type ALUOp uint16
// ALU binary operation types.
const (
ALUOpAdd ALUOp = iota << 4
ALUOpSub
ALUOpMul
ALUOpDiv
ALUOpOr
ALUOpAnd
ALUOpShiftLeft
ALUOpShiftRight
aluOpNeg // Not exported because it's the only unary ALU operation, and gets its own instruction type.
ALUOpMod
ALUOpXor
)
// A JumpTest is a comparison operator used in conditional jumps.
type JumpTest uint16
// Supported operators for conditional jumps.
// K can be RegX for JumpIfX
const (
// K == A
JumpEqual JumpTest = iota
// K != A
JumpNotEqual
// K > A
JumpGreaterThan
// K < A
JumpLessThan
// K >= A
JumpGreaterOrEqual
// K <= A
JumpLessOrEqual
// K & A != 0
JumpBitsSet
// K & A == 0
JumpBitsNotSet
)
// An Extension is a function call provided by the kernel that
// performs advanced operations that are expensive or impossible
// within the BPF virtual machine.
//
// Extensions are only implemented by the Linux kernel.
//
// TODO: should we prune this list? Some of these extensions seem
// either broken or near-impossible to use correctly, whereas other
// (len, random, ifindex) are quite useful.
type Extension int
// Extension functions available in the Linux kernel.
const (
// extOffset is the negative maximum number of instructions used
// to load instructions by overloading the K argument.
extOffset = -0x1000
// ExtLen returns the length of the packet.
ExtLen Extension = 1
// ExtProto returns the packet's L3 protocol type.
ExtProto Extension = 0
// ExtType returns the packet's type (skb->pkt_type in the kernel)
//
// TODO: better documentation. How nice an API do we want to
// provide for these esoteric extensions?
ExtType Extension = 4
// ExtPayloadOffset returns the offset of the packet payload, or
// the first protocol header that the kernel does not know how to
// parse.
ExtPayloadOffset Extension = 52
// ExtInterfaceIndex returns the index of the interface on which
// the packet was received.
ExtInterfaceIndex Extension = 8
// ExtNetlinkAttr returns the netlink attribute of type X at
// offset A.
ExtNetlinkAttr Extension = 12
// ExtNetlinkAttrNested returns the nested netlink attribute of
// type X at offset A.
ExtNetlinkAttrNested Extension = 16
// ExtMark returns the packet's mark value.
ExtMark Extension = 20
// ExtQueue returns the packet's assigned hardware queue.
ExtQueue Extension = 24
// ExtLinkLayerType returns the packet's hardware address type
// (e.g. Ethernet, Infiniband).
ExtLinkLayerType Extension = 28
// ExtRXHash returns the packets receive hash.
//
// TODO: figure out what this rxhash actually is.
ExtRXHash Extension = 32
// ExtCPUID returns the ID of the CPU processing the current
// packet.
ExtCPUID Extension = 36
// ExtVLANTag returns the packet's VLAN tag.
ExtVLANTag Extension = 44
// ExtVLANTagPresent returns non-zero if the packet has a VLAN
// tag.
//
// TODO: I think this might be a lie: it reads bit 0x1000 of the
// VLAN header, which changed meaning in recent revisions of the
// spec - this extension may now return meaningless information.
ExtVLANTagPresent Extension = 48
// ExtVLANProto returns 0x8100 if the frame has a VLAN header,
// 0x88a8 if the frame has a "Q-in-Q" double VLAN header, or some
// other value if no VLAN information is present.
ExtVLANProto Extension = 60
// ExtRand returns a uniformly random uint32.
ExtRand Extension = 56
)
// The following gives names to various bit patterns used in opcode construction.
const (
opMaskCls uint16 = 0x7
// opClsLoad masks
opMaskLoadDest = 0x01
opMaskLoadWidth = 0x18
opMaskLoadMode = 0xe0
// opClsALU & opClsJump
opMaskOperand = 0x08
opMaskOperator = 0xf0
)
const (
// +---------------+-----------------+---+---+---+
// | AddrMode (3b) | LoadWidth (2b) | 0 | 0 | 0 |
// +---------------+-----------------+---+---+---+
opClsLoadA uint16 = iota
// +---------------+-----------------+---+---+---+
// | AddrMode (3b) | LoadWidth (2b) | 0 | 0 | 1 |
// +---------------+-----------------+---+---+---+
opClsLoadX
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
// +---+---+---+---+---+---+---+---+
opClsStoreA
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
// +---+---+---+---+---+---+---+---+
opClsStoreX
// +---------------+-----------------+---+---+---+
// | Operator (4b) | OperandSrc (1b) | 1 | 0 | 0 |
// +---------------+-----------------+---+---+---+
opClsALU
// +-----------------------------+---+---+---+---+
// | TestOperator (4b) | 0 | 1 | 0 | 1 |
// +-----------------------------+---+---+---+---+
opClsJump
// +---+-------------------------+---+---+---+---+
// | 0 | 0 | 0 | RetSrc (1b) | 0 | 1 | 1 | 0 |
// +---+-------------------------+---+---+---+---+
opClsReturn
// +---+-------------------------+---+---+---+---+
// | 0 | 0 | 0 | TXAorTAX (1b) | 0 | 1 | 1 | 1 |
// +---+-------------------------+---+---+---+---+
opClsMisc
)
const (
opAddrModeImmediate uint16 = iota << 5
opAddrModeAbsolute
opAddrModeIndirect
opAddrModeScratch
opAddrModePacketLen // actually an extension, not an addressing mode.
opAddrModeMemShift
)
const (
opLoadWidth4 uint16 = iota << 3
opLoadWidth2
opLoadWidth1
)
// Operand for ALU and Jump instructions
type opOperand uint16
// Supported operand sources.
const (
opOperandConstant opOperand = iota << 3
opOperandX
)
// An jumpOp is a conditional jump condition.
type jumpOp uint16
// Supported jump conditions.
const (
opJumpAlways jumpOp = iota << 4
opJumpEqual
opJumpGT
opJumpGE
opJumpSet
)
const (
opRetSrcConstant uint16 = iota << 4
opRetSrcA
)
const (
opMiscTAX = 0x00
opMiscTXA = 0x80
)

Some files were not shown because too many files have changed in this diff Show More