A packet can traverse the service-xxxx chains by matching on either
service-ips or service-nodeports verdict map. We masquerade off-cluster
traffic to ClusterIP (when masqueradeAll = false) by adding a rule in
service-xxxx which checks if destination IP is ClusterIP, port and
protocol matches with service specs and source IP doesn't belong to
PodCIDR and masquerade on match.
If the packet reaches the service chain by match on service-ips map,
then ClusterIP, port and protocol are already matching service specs.
If it comes via external-xxxx chain then the destination IP will
never be ClusterIP. Therefore, we can simplify the masquerade
off-cluster traffic to ClusterIP check by simply matching on
destination ip and source ip.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
To make switching to/from nftables easier, kube-proxy runs iptables
and ipvs cleanup when starting in nftables mode, and runs nftables
cleanup when starting in iptables or ipvs mode. But there's no
guarantee that the node actually supports the mode we're trying to
clean up, so don't log errors if it doesn't.
iptables and ipvs were both leaving KUBE-MARK-MASQ behind (even though
the corresponding KUBE-POSTROUTING rule to actually do the masquerade
got deleted).
iptables was failing to clean up its KUBE-PROXY-FIREWALL chain (the
cleanup rules never got updated when that was split out of
KUBE-FIREWALL), and also not cleaning up its canary chain.
This also fixes it so that ipvs.CleanupLeftovers only deletes
ipvs/ipset stuff once, rather than first deleting all of it on behalf
of the IPv4 Proxier and then no-op "deleting" it all again on behalf
of the IPv6 Proxier.
Various parts of kube-proxy passed around a "hostname", but it is
actually the name of the *node* kube-proxy is running on, which is not
100% guaranteed to be exactly the same as the hostname. Rename it
everywhere to make it clearer that (a) it is definitely safe to use
that name to refer to the Node, (b) it is not necessarily safe to use
that name with DNS, etc.
ServiceTrafficDistribution feature-gate is GA'd and enabled by default since
1.33. Since it is also locked-to-default, we can remove flag-usages in
kube-proxy.
NOTE that as per
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/feature-gates.md#disablement-tests:
_"Disablement tests are only required to be preserved for components and
libraries that support compatibility version. Tests for node and kubelet are
unaffected by compatibility version."_
Basically all callers want dual-stack-if-possible, so simplify that.
Also, tweak the startup-time checking in kubelet to treat "no iptables
support" as interesting but not an error.
Remove a bunch of comments that are either inaccurate ("the proxier
can only be tested by e2e tests") or weirdly overspecific about
obvious details ("the proxier will not exit if an iptables call
fails").
Historically it took an exec argument so you could pass a FakeExec to
mock its behavior in unit tests, but it has a fake implementation now
that is much more useful for unit tests than trying to use the real
implementation with a fake exec. (The unit tests still use fake execs,
but they don't need to use a public constructor.) So remove the exec
args from the public constructors.
Remove the utilexec.Interface args from the iptables/ipvs constructors
(which have been unused since the conntrack cleanup code was ported to
netlink).
Remove the EventRecorder fields from the iptables/ipvs Proxiers, which
have been unused since we removed the port-opener code in 2022.
Remove the strictARP field from the ipvs Proxier, which has apparently
always been unused (strictARP is only looked at at construct time).
The conntrack reconciler maintains the consistency between the
conntrack table on each node and the desired state of Kubernetes UDP services.
A valid entry matches a service's ClusterIP, LoadBalancerIP, or ExternalIP and Service port,
or any ip matching a NodePort, and has a reverse source IP matching an active endpoint for
that service. Other entries are deleted.
Services without endpoints and traffic not handled by kube-proxy are ignored
Co-authored-by: Daman Arora <aroradaman@gmail.com>
I fixed up the TestValidateEndpointsCreate path to show the matcher
instead of manual origin checking.
I picked TestValidateTopologySpreadConstraints because it was the last
failing test on my screen when I changed on of the commonly hard-coded
error strings. I fixed exactly those validation errors that were needed
to make this test pass. Some of the Origin values can be debated.
The `field/testing.Matcher` interface allows tests to configure the
criteria by which they want to match expected and actual errors. The
hope is that everyone will use Origin for Invalid errors.
There's some collateral impact for tests which use exact-comparisons and
don't expect origins. These are all candidates for using the matcher.
Endpoint chain contents are fairly predictable from their name and
existing affinity sets. Skip endpoint chain updates, when we can be sure
that rules in that chain are still correct.
Add unit test to verify first transaction is optimized.
Change baseRules ordering to make it accepted by nft.ParseDump.
Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
We used to flush and re-add all map/set elements on nftables
setup, but it is faster to read the existing elements and only
transact the diff.
Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
kubeproxy_conntrack_reconciler_deleted_entries_total can be used
to track total entries deleted in conntrack reconciliation.
Signed-off-by: Daman Arora <aroradaman@gmail.com>