RL8.9 nftables unexpected behavior with ipv6 masquerading and iproute2

I have Router with two ipv6 nics let it be eth0 (default route) and tun0, and Host with self assigned real ipv6 in the range of eth0. All traffic from Host goes thru eth0, no problem, but I want to pass some traffic via tun0 according ip set in nftables.
Because Host has it’s ipv6 and tun0 has different ip masquerading is required. Also packets marking according ip set. So the rules are

chain nat_POSTROUTING {type nat hook postrouting priority srcnat; policy accept;
meta nfproto ipv6 oifname tun0 masquerade
}

chain mangle_PREROUTING { type filter hook prerouting priority mangle; policy accept;
ip6 daddr @addr-tun0 counter meta mark set 4
}

iproute2 rule is

32762: from all fwmark 0x4 lookup tbltun0

In a result packets are correctly route via tun0 BUT no masquerading. Source address is Host’s real ipv6. If I delete mangle mark rule and add iproute2 rule

32760: from <Host’s ipv6> lookup tbltun0

All traffic(as expected) from Host goes via tun0 and source address nated to tun0 ip.
Seems I do not understand something.

Same idea with ipv4 traffic works perfectly.

Lets say the subnet on tun0 is a::0/x (I don’t use IPv6, so only pseudo-syntax)

  • tun0 has address a::1
  • the set addr-tun0 has address a::2
  • another machine in that subnet has address a::3

A packet arrives (from eth0) with destination a::y

  • if y is 1, then packet is routed to localhost (via input filter)
  • if y is 3, then packet is sent out from tun0 (after forward filter and nat postrouting)
  • if y is 2, then packet is routed according to routes in tbltun0
    you want that to send packet out from tun0 (after forward filter and nat postrouting)

Why the additional mangling as the default ought to “do the right thing”?

In the router eth0 is WAN interface with real address aaaa::1
tun0 has fd00:1::2 the other end of tunnel has fd00:1::1
Another Computer has real address aaaa::2 and wants connect bbbb::100
If do nothing packet will go out from eth0 aaaa::2→bbbb::100 and the answer come to eth0 from bbbb::100 to aaaa::2

I want this packet go via tun0. So I need create iproute2 rule that packets marked with 4 are going according route in tbltun0 (via tun0) and add bbbb::100 element to set @addr-tun0addr-tun0

Then packet from aaaa::2 goes via tun0 to bbbb::100, bbbb::100 send answer to aaaa::2 and this answer returns ANOTHER way to eth0 not tun0. That’s why masquerading is required, it changes source address from aaaa::2 to fd00:1::2. Answer returns same way then.

But masquerading somehow does not work if I add marking rule. Packet goes to tun0 wtih aaaa::2 source address instead fd00:1::2

Remark:
I have a home router that does have IPv6. It gets a 2001:abcd::/64 subnet for my LAN from ISP.
My machine gets four IPv6 routes:

$ ip -6 ro
::1 dev lo proto kernel metric 256 pref medium
2001:abcd::/64 dev enp7s0 proto ra metric 100 pref medium
fe80::/64 dev enp7s0 proto kernel metric 1024 pref medium
default via fe80::cdef dev enp7s0 proto ra metric 100 pref medium

That is, the default route is to link-local address of the router, not any of its public IPv6 addresses.


There obviously must be at least one router between aaaa:: and bbbb:: subnets.
To bbbb::100 via whom? Is it the aaaa::1?

Where is the other end of the tunnel? Is it within aaaa::, or outside?
Do you control/configure the other end?

Lets say tun0 end is at Router2. The other end ip is fd00:1::1 also Router2 has its real ip lets say cccc::1
If bbbb::100 receives packet from aaaa::1 it send answer and it comes to Router1 eth0, because it has aaaa::0/64 net.
If bbbb::100 receives packet from cccc::1 answer comes to Router2 and can be forward via tun0 to Router1 because its net is cccc::0/64
If I want to send some packets via tun0 I need to define iproute2 rule (mark or from…) and table tbltun0 with “default dev tun0”. Then route is:
Host→ Router(lan) → Router1(tun0) → Router2(tun0) → Router2(eth0) → DstHost

Then answer must return same way
DstHost → Router2(eth0) → Router2(tun0) → Router1(tun0) → Router(lan) → Host
And for that It needs to change src address (masquerading) from aaaa::1 to fd00:1::2

Without src change return path is
DstHost → Router1(eth0) → Router(lan) → Host

I have no problem with routing. If I use source base routing instead packet marking traffic goes via tunnel and source address correctly replaced by fd00:1::2. BUT in this case all traffic from aaaa::1 goes via tunnel, not according the list.

Yesterday I have discovered than masquerading work is set is small (2 elements). But if list is about 10k elements (regional list) there is no masquerading.

That sounds like a genuine issue and if it is, then it probably exists in RHEL too.

What routes do you have in the tbltun0?

default dev tun0 metric 1024 pref medium

Which alone sends everything via tun0.

What if you had there:

fd00:1::/64 dev tun0
aaaa::/64 dev eth0
bbbb::100 via fd00:1::1
default via aaaa::x

That is, default via eth0 and only select targets via tun0.

It is not just default route it is default for THIS table, so nothing goes via tun0 except fd00:1::/64

So I have!
default via xxxx::1 dev eth0 proto static metric 100 pref medium
default dev tun0 table tbltun0 metric 1024 pref medium

The second one is only for iproute2 table! It works only if iproute2 rule triggered

I have changed

default dev tun0 table tbltun0 metric 1024 pref medium
to
default via fd00:1::1 dev tun0 table tbltun0 metric 1024 pref medium

and seems it solved the problem

1 Like

Let me rephrase – for the source-rounting case:
If there is a rule that all packets from host aaaa::2 are handled by table tbltun0 (rather than table main), then the routes in table tbltun0 must handle all destinations. If the table tbltun0 has only one route in it – the catch-all default – then obviously all packets (from aaaa::2) must use that one route.


When interface (e.g. eno1) gets an address, for example aaaa::8, an implicit route to link-local neighbours is created (to main table). aaaa::/x dev eno1
That route tells that when the destination (e.g. aaaa::42) is a member of aaaa::, then it is sufficient to use ARP broadcast “Who is aaaa::42” on eno1 to get MAC-address of that destination and then throw the packet out from eno1.

If the destination is not a member of aaaa::, then it is in some other subnet. In order to reach other subnets, there must be a route that matches the destination and names the gateway to use. The gateway is the router that is member of aaaa:: and of another subnet (which may be the destination’s subnet, or has next gateway to next subnet on path towards the destination).

The default via fd00:1::1 has the “default” that does match any address, and the via fd00:1::1. The latter tells to pass the packet to gateway fd00:1::1 (in hope it knows what to do with it). Hence, ARP call “Who is fd00:1::1” is sent (to tun0, since fd00:1::1 is a link-local neighbour reachable via tun0). With MAC for fd00:1::1 the packet can be passed on.

With that backgroud,

the plain dev tun0 made no sense as it did not name any gateway, but rather said “Just ask for the MAC of bbbb::100 from tun0”. The bbbb::100 is not linklocal to the tun0 interface, so it cannot answer and routers do not pretend to be someone else. (Network bridges do to some extent.) I’m surprised that it did “work” at all.

The via fd00:1::1 dev tun0 looks like a proper route “to remote destinations”.

tunnel is point-to-point connection it does not need arp(ip4) or neighbor discover (ip6) because they do not have MAC address. Tunnel has itself ip and peer ip. Packets just sends to “other end” (peer ip) and the other end will decide what to do then.
So all tunnels works with “default dev tun0” without “via ”

1 Like

That explains.

Except you did just discover that yours did not. (Perhaps someone forgot to tell nftables about point-to-point?)

I do not understand situation.
All the same: iproute2 table with default route, nftables postrouting masquerade for oifname tun0

  1. No mark rule in nftables, iproute2 rule from lookup table tbltun0
    All traffic from that ip goes via tunnel with correct src address (masquerading works)

  2. Mark rule for small set
    Masquerading works, traffic for ranges in set goes via tun0

  3. mark rule for big set
    nftables do not do its job - no masquerade. iproute2 correctly route marked traffic to tun0

3a) addition “via” to default route in tbltun0 seems let nftables understand that src should be changed

And that’s only for ipv6, ipv4 traffic goes correctly.
So I have found solution for myself - use small set because right now that’s enough for me.

1 Like

Logically, the size of a nftables set (which is used only in prerouting) should have no effect in postrouting. IIRC, there at least were issues with huge (either ipset or nftables) sets. Very slow or something. This could be one sauch case.

Alas, RHEL 8 has already reached 8.10 and is in “Maintenance phase” (until 2029). Hence it (and Rocky 8) will receive only critical fixes. Furthermore, one would have to reproduce the issue on RHEL 8 in order to create a bug report – there is no more CentOS Stream 8. (On the bright side, free-of-charge RHEL licensing does exists.) Anyway, an issue that occurs only with IPv6 P2P tunnels and does have a workaround (the via gw_ip route) is hardly a “critical bug”.

If the same issue is present also in RHEL 9/10, then there is more reason to report it to Red Hat.


As I noted before, all traffic would not go via tunnel if the table tbltun0 would have appropriate routes to direct some traffic via eth0.

All traffic from SPECIFIC host goes via tunnel because I have decided so! I have added the iproute2 rule:
from specific-ip6 lookup tbltun0
that means traffic from Host to any destination goes by routes in tbltun0

That was defined so to check that table route works correctly and masquerading works.

Yes.

My bad. I got impression that you did emphasize the “all” in that like it were an issue,
while it was a completely insignificant detail. What you actually said was that the
packets that were routed to tunnel did got the expected sNAT treatment.


One could – for completeness – test how would a table like below behave:

ip rule add from specific-ip6 lookup tbltun0
for DST in $addr_tun0
do ip route add "$DST" dev tun0 table tbltun0
done
ip route add from default via xxxx::1 dev eth0 table tbltun0

where the $addr_tun0 is a big list of destinations (in appropriate format).

That puts the big list into a routing table, rather than into nftables set. Traffic from specific-ip6 to any address not in addr_tun0 will use eth0, just like the setup with mark.

10000+ routes… I am not sure that’s good idea :wink:
But even if system will not die what happens if destination will be somewhere else, not among these routes? Traffic will be rejected.
If try to route traffic by iproute2 then I need this

for DST in $addr_tun0
do ip ru a from specific-ip6 to “$DST” table tbltun0
done

That will create 10k rules. If dst ip satisfy none of them it will be routed by default

I do agree; mere thought scares me. However, be it table of routes or nftables set, a list of 10000+ has to be loaded into kernel memory and searched with destination addresses. You have found that the 10000+ in a set disturbs postrouting. The (hypothetical) question is what is disturbed by 10000+ in routes.


That is why I had above (alas, with syntax error):

ip route add default via xxxx::1 dev eth0 table tbltun0

If destination is somewhere else, then it matches the default rule and
follows the via xxxx::1 dev eth0 just like all the other packets that do not enter table tbltun0

One more time. I want to route some traffic via tun0 and the rest via default eth0. 10k routes in the table not better than one default because decision send via tun0 is in the iproute2 rule (“from source-ip…” or “fwmark…”). If not to use nftables set then 10k rules but not 10k routes.

Means pass ALL traffic via tbltun0. If no route to destination in the table traffic will be rejected

I did answer to your question about “if destination will be somewhere else”, which was about the case of exact one rule (from specific-ip6 lookup tbltun0) and no mark.


No, it means that all traffic coming from address specific-ip6 will be handled according to the routes that the table tbltun0 has. All routes in that table do not technically have to direct to interface tun0. One can have a “default route” in there that handles all the destinations that the more specific routes do not.

It is naturally up to you what routes you do add to that table.


I don’t think that 10k rules are any cheaper than 10k routes.


Overall, I don’t say that you should use 10k routes or 10k rules. I say that it would be of academic interest to know whether kernel handles those better or worse than the 10k addresses in nftables set.