120 Commits

Author SHA1 Message Date
Daman Arora
d21ca8674c kube-proxy: add NodeTopologyConfig for tracking topology labels
This simplifies how the proxier receives update for change in node
labels. Instead of passing the complete Node object we just pass
the proxy relevant topology labels extracted from the complete list
of labels, and the downstream event handlers will only be notified
when there are changes in topology labels.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-07-21 17:00:44 -04:00
Daman Arora
bc5088cbf3 Revert "Kube proxy node manager" 2025-07-15 19:34:05 +05:30
Daman Arora
af7abde0e5 kube-proxy: add NodeTopologyConfig for tracking topology labels
This simplifies how the proxier receives update for change in node
labels. Instead of passing the complete Node object we just pass
the proxy relevant topology labels extracted from the complete list
of labels, and the downstream event handlers will only be notified
when there are changes in topology labels.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-07-11 21:05:19 +05:30
Kubernetes Prow Robot
9538d53353 Merge pull request #132456 from aroradaman/nftables-etp-fix
nftables short-circuit local traffic to external addresses
2025-07-09 17:53:27 -07:00
Kubernetes Prow Robot
c3b06a5366 Merge pull request #131615 from danwinship/proxy-bfr
update BoundedFrequencyRunner for kube-proxy
2025-07-01 09:21:24 -07:00
Dan Winship
eae17c21b0 Change how BoundedFrequencyRunner retries work
Rather than having a RetryAfter function, do a retry (at a fixed
interval) if the work function returns an error.

Co-authored-by: Antonio Ojea <aojea@google.com>
2025-07-01 08:54:14 -04:00
Dan Winship
c16ee887ef Remove burst syncs from BoundedFrequencyRunner
Burst syncs are theoretically useful for dealing with a single change
that results in multiple Run() calls (eg, a Service and EndpointSlice
both changing), but 2 isn't enough to cover all cases, and a better
way of dealing with this problem is to just use a smaller
minSyncPeriod.

Co-authored-by: Antonio Ojea <aojea@google.com>
2025-07-01 08:54:14 -04:00
Antonio Ojea
6da9d363f3 Copy BoundedFrequencyRunner to kube-proxy 2025-07-01 08:53:54 -04:00
Daman Arora
7e3945808d nftables: remove filter-output-post-dnat chain
With filter-output chain already operating with priority
post DNAT, we can merge both the chains together.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-06-23 18:12:13 +05:30
Daman Arora
91f2256b34 update filter chains and priority
With this commit the filter-input, filter-forward, and filter-output base chains
are hooked with priority 0. For filtering before DNAT, filter-prerouting-pre-dnat
and filter-output-pre-dnat should be used which have a priority lower than DNAT
(-110)

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-06-23 18:12:13 +05:30
Kubernetes Prow Robot
ef66667c8e Merge pull request #131243 from danwinship/kube-proxy-cleanup
Improve `kube-proxy --cleanup` / cleanup on kube-proxy mode switch
2025-05-06 09:29:13 -07:00
Kubernetes Prow Robot
0b8133816b Merge pull request #131477 from pohly/golangci-lint@v2
golangci-lint v2
2025-05-02 23:03:55 -07:00
Matthieu MOREL
4adb58565c chore: bump golangci-lint to v2
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2025-05-02 12:51:02 +02:00
Daman Arora
c7a870135a nftables: cleanup service chain checks
A packet can traverse the service-xxxx chains by matching on either
service-ips or service-nodeports verdict map. We masquerade off-cluster
traffic to ClusterIP (when masqueradeAll = false) by adding a rule in
service-xxxx which checks if destination IP is ClusterIP, port and
protocol matches with service specs and source IP doesn't belong to
PodCIDR and masquerade on match.

If the packet reaches the service chain by match on service-ips map,
then ClusterIP, port and protocol are already matching service specs.
If it comes via external-xxxx chain then the destination IP will
never be ClusterIP. Therefore, we can simplify the masquerade
off-cluster traffic to ClusterIP check by simply matching on
destination ip and source ip.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-04-27 01:05:45 +05:30
Dan Winship
f9c1876b45 Make proxy CleanupLeftovers methods quieter
To make switching to/from nftables easier, kube-proxy runs iptables
and ipvs cleanup when starting in nftables mode, and runs nftables
cleanup when starting in iptables or ipvs mode. But there's no
guarantee that the node actually supports the mode we're trying to
clean up, so don't log errors if it doesn't.
2025-04-10 14:58:37 -04:00
Dan Winship
88f8e6697d Implement PreferSameNode traffic distribution in kube-proxy 2025-03-19 08:46:17 -04:00
Dan Winship
c85083589c Clarify hostname vs node name in kube-proxy
Various parts of kube-proxy passed around a "hostname", but it is
actually the name of the *node* kube-proxy is running on, which is not
100% guaranteed to be exactly the same as the hostname. Rename it
everywhere to make it clearer that (a) it is definitely safe to use
that name to refer to the Node, (b) it is not necessarily safe to use
that name with DNS, etc.
2025-03-19 08:46:15 -04:00
Dan Winship
303593cafe Fix some pkg/proxy comments
Remove a bunch of comments that are either inaccurate ("the proxier
can only be tested by e2e tests") or weirdly overspecific about
obvious details ("the proxier will not exit if an iptables call
fails").
2025-03-07 10:43:55 -05:00
Dan Winship
36f5820ad1 Remove some unused proxy args/fields
Remove the utilexec.Interface args from the iptables/ipvs constructors
(which have been unused since the conntrack cleanup code was ported to
netlink).

Remove the EventRecorder fields from the iptables/ipvs Proxiers, which
have been unused since we removed the port-opener code in 2022.

Remove the strictARP field from the ipvs Proxier, which has apparently
always been unused (strictARP is only looked at at construct time).
2025-03-07 10:43:45 -05:00
Dan Winship
13f0449e4c Fix up kube-proxy import ordering/organization. 2025-03-07 10:43:43 -05:00
Kubernetes Prow Robot
80026570aa Merge pull request #130119 from npinaeva/nft-restart
[kube-proxy: nftables] Optimize kube-proxy restart time
2025-03-04 10:17:44 -08:00
Nadia Pinaeva
cc0faf086d [kube-proxy:nftables] Skip EP chain updates on startup.
Endpoint chain contents are fairly predictable from their name and
existing affinity sets. Skip endpoint chain updates, when we can be sure
that rules in that chain are still correct.

Add unit test to verify first transaction is optimized.
Change baseRules ordering to make it accepted by nft.ParseDump.

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
2025-02-27 10:07:22 +01:00
Ryota Sakamoto
f484ae5bcb Fix kernel version check condition in nftables proxier
Signed-off-by: Ryota Sakamoto <skmt@amazon.com>
2025-02-24 18:45:16 +00:00
Nadia Pinaeva
7d5f3c5723 [kube-proxy:nftables] Read map/set elements on setup.
We used to flush and re-add all map/set elements on nftables
setup, but it is faster to read the existing elements and only
transact the diff.

Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
2025-02-18 11:28:41 +01:00
Kubernetes Prow Robot
d7774fce9a Merge pull request #129653 from danwinship/nftables-ga
KEP-3866 nftables kube-proxy to GA
2025-02-13 08:42:20 -08:00
Kubernetes Prow Robot
3a4c2a0bbb Merge pull request #129271 from aroradaman/dual_stack_healthz
Dual stack healthz server
2025-01-20 07:32:42 -08:00
Dan Winship
cba6300414 Document nftables kube-proxy's "public API" 2025-01-15 15:53:51 -05:00
olderTaoist
561c1d235a full sync per one hour with BFR 2025-01-14 09:24:38 +08:00
Daman Arora
d6c575532a pkg/proxy/healthcheck: rename 'proxier' to 'proxy'
KubeProxy operates with a single health server and two proxies,
one for each IP family. The use of the term 'proxier' in the
types and functions within pkg/proxy/healthcheck can be
misleading, as it may suggest the existence of two health
servers, one for each IP family.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
2025-01-08 17:26:47 +05:30
Dan Winship
f5969adb14 Clean up NewServiceChangeTracker/NewEndpointsChangeTracker args
Remove the now-unused event recorders, and put the remaining args into
a sensible order, and consistent between the two.
2024-12-14 12:12:42 -05:00
Nadia Pinaeva
90e64a57c6 kube-proxy,nftables: add debug logging for failed transaction.
Use a rate limiter to avoid large output with a high rate.

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
2024-12-13 15:53:19 +01:00
Antonio Ojea
f93e6f3d3a kube-proxy implement dual stack metrics
Signed-off-by: Daman Arora <aroradaman@gmail.com>
Co-authored-by: Antonio Ojea <aojea@google.com>
2024-12-12 16:13:30 +05:30
Daman Arora
6657d220d3 proxy: cleanup UpdateServiceMapResult
Signed-off-by: Daman Arora <aroradaman@gmail.com>
2024-10-28 20:10:46 +05:30
Daman Arora
c398af07fa proxy: refactor UpdateEndpointsMapResult
Signed-off-by: Daman Arora <aroradaman@gmail.com>
2024-10-28 20:10:34 +05:30
Daman Arora
1ad8880c0f proxy/conntrack: reconciler
Signed-off-by: Daman Arora <aroradaman@gmail.com>
2024-10-28 20:08:53 +05:30
Paco Xu
0e10a3a28c Revert "re: kube-proxy: internal config: refactor HealthzAddress and MetricsAddress " 2024-10-21 11:36:59 +08:00
Kubernetes Prow Robot
4d32d7e5ad Merge pull request #127930 from aroradaman/kube-proxy-refactor-healthz-metrics-address
re: kube-proxy: internal config: refactor HealthzAddress and MetricsAddress
2024-10-17 16:03:11 +01:00
Daman Arora
48f1356b2f pkg/proxy: refactor NodePortAddresses to NodeAddressHandler
Signed-off-by: Daman Arora <aroradaman@gmail.com>
2024-10-14 21:49:29 +05:30
Aohan Yang
da5738d9aa Set feature gate emulation version during test 2024-10-10 19:26:31 +08:00
Matthieu MOREL
f736cca0e5 fix: enable expected-actual rule from testifylint in module k8s.io/kubernetes
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-27 07:56:31 +02:00
Daman Arora
c34b20fa63 proxy/conntrack: use proxier ip family for conntrack cleanup
Signed-off-by: Daman Arora <aroradaman@gmail.com>
2024-09-04 22:56:03 +05:30
Daman Arora
b0f823e6cc remove the conntrack binary dependency
kube-proxy needs to delete stale conntrack entries for UDP services to
avoid blackholing traffic. Instead of using the conntrack binary it
can use netlink calls directly, reducing the containers images size and
the security surface.

Signed-off-by: Daman Arora <aroradaman@gmail.com>
Co-authored-by: Antonio Ojea <aojea@google.com>
2024-09-04 21:48:34 +05:30
Nadia Pinaeva
2ec3929134 [kube-proxy:nftables] Add partial sync unit test.
Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
2024-07-23 17:32:30 +02:00
Nadia Pinaeva
3ccf5b8a55 [kube-proxy:nftables] Add partialSync mode to only transact changed
objects.
Change the order of operations to stop current iteration if no changes
to the service chains are needed.
Bump syncProxy frequency to 1 hour.
In a test kind cluster creation of 10K services, 2 endpoints each,
takes ~25m before the fix and ~9min after. Maximum memory usage
during creation is ~650MiB and 260MiB respectively.
Another important metric is the time it takes to create 1 new service
when 10K svc already exist. It used to take ~8m before the fix,
with partialSync it takes ~141ms.

Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
2024-07-23 17:32:30 +02:00
Nadia Pinaeva
dc13e42f56 [kube-proxy:nftables] cleanup: remove unused parameter and fix typo.
Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
2024-07-23 17:32:29 +02:00
Dan Winship
30bc1b59d7 Add unit tests to validate "bad IP/CIDR" handling in kube-proxy
Also, fix the handling of bad EndpointSlice IPs!
2024-07-18 10:55:13 -04:00
Dan Winship
f762e5c8de Remove an unnecessary comment in nftables output
(It's redundant with the chain name.)
2024-07-18 10:54:30 -04:00
Dan Winship
11f55eae96 Reduce some duplication in nftables unit tests 2024-07-18 10:53:36 -04:00
Dan Winship
b39fd03ee4 Allow disabling nftables kernel version check 2024-07-08 07:29:27 -04:00
Dan Winship
505f6833d9 Require kernel 5.13 for nftables kube-proxy 2024-07-01 10:07:27 -04:00