This simplifies how the proxier receives update for change in node
labels. Instead of passing the complete Node object we just pass
the proxy relevant topology labels extracted from the complete list
of labels, and the downstream event handlers will only be notified
when there are changes in topology labels.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
This simplifies how the proxier receives update for change in node
labels. Instead of passing the complete Node object we just pass
the proxy relevant topology labels extracted from the complete list
of labels, and the downstream event handlers will only be notified
when there are changes in topology labels.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
Rather than having a RetryAfter function, do a retry (at a fixed
interval) if the work function returns an error.
Co-authored-by: Antonio Ojea <aojea@google.com>
Burst syncs are theoretically useful for dealing with a single change
that results in multiple Run() calls (eg, a Service and EndpointSlice
both changing), but 2 isn't enough to cover all cases, and a better
way of dealing with this problem is to just use a smaller
minSyncPeriod.
Co-authored-by: Antonio Ojea <aojea@google.com>
With filter-output chain already operating with priority
post DNAT, we can merge both the chains together.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
With this commit the filter-input, filter-forward, and filter-output base chains
are hooked with priority 0. For filtering before DNAT, filter-prerouting-pre-dnat
and filter-output-pre-dnat should be used which have a priority lower than DNAT
(-110)
Signed-off-by: Daman Arora <aroradaman@gmail.com>
A packet can traverse the service-xxxx chains by matching on either
service-ips or service-nodeports verdict map. We masquerade off-cluster
traffic to ClusterIP (when masqueradeAll = false) by adding a rule in
service-xxxx which checks if destination IP is ClusterIP, port and
protocol matches with service specs and source IP doesn't belong to
PodCIDR and masquerade on match.
If the packet reaches the service chain by match on service-ips map,
then ClusterIP, port and protocol are already matching service specs.
If it comes via external-xxxx chain then the destination IP will
never be ClusterIP. Therefore, we can simplify the masquerade
off-cluster traffic to ClusterIP check by simply matching on
destination ip and source ip.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
To make switching to/from nftables easier, kube-proxy runs iptables
and ipvs cleanup when starting in nftables mode, and runs nftables
cleanup when starting in iptables or ipvs mode. But there's no
guarantee that the node actually supports the mode we're trying to
clean up, so don't log errors if it doesn't.
Various parts of kube-proxy passed around a "hostname", but it is
actually the name of the *node* kube-proxy is running on, which is not
100% guaranteed to be exactly the same as the hostname. Rename it
everywhere to make it clearer that (a) it is definitely safe to use
that name to refer to the Node, (b) it is not necessarily safe to use
that name with DNS, etc.
Remove a bunch of comments that are either inaccurate ("the proxier
can only be tested by e2e tests") or weirdly overspecific about
obvious details ("the proxier will not exit if an iptables call
fails").
Remove the utilexec.Interface args from the iptables/ipvs constructors
(which have been unused since the conntrack cleanup code was ported to
netlink).
Remove the EventRecorder fields from the iptables/ipvs Proxiers, which
have been unused since we removed the port-opener code in 2022.
Remove the strictARP field from the ipvs Proxier, which has apparently
always been unused (strictARP is only looked at at construct time).
Endpoint chain contents are fairly predictable from their name and
existing affinity sets. Skip endpoint chain updates, when we can be sure
that rules in that chain are still correct.
Add unit test to verify first transaction is optimized.
Change baseRules ordering to make it accepted by nft.ParseDump.
Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
We used to flush and re-add all map/set elements on nftables
setup, but it is faster to read the existing elements and only
transact the diff.
Signed-off-by: Nadia Pinaeva <npinaeva@redhat.com>
KubeProxy operates with a single health server and two proxies,
one for each IP family. The use of the term 'proxier' in the
types and functions within pkg/proxy/healthcheck can be
misleading, as it may suggest the existence of two health
servers, one for each IP family.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
kube-proxy needs to delete stale conntrack entries for UDP services to
avoid blackholing traffic. Instead of using the conntrack binary it
can use netlink calls directly, reducing the containers images size and
the security surface.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
Co-authored-by: Antonio Ojea <aojea@google.com>
objects.
Change the order of operations to stop current iteration if no changes
to the service chains are needed.
Bump syncProxy frequency to 1 hour.
In a test kind cluster creation of 10K services, 2 endpoints each,
takes ~25m before the fix and ~9min after. Maximum memory usage
during creation is ~650MiB and 260MiB respectively.
Another important metric is the time it takes to create 1 new service
when 10K svc already exist. It used to take ~8m before the fix,
with partialSync it takes ~141ms.
Signed-off-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>