We have this problem too.  It occurs in two of our three meshes.  It is much more frequent lately.  I do not know whether it is merely coincidental that we recently upgraded from 21.01 to 21.02.

My current solution is to maintain a pair of openssh tunnels between each dhcp server (in which gw_mode='server') and each client (in which gw_mode='client').  If a dhcp server finds itself with no clients that are (still?) in contact with it, it reboots.  If a client finds itself with no dhcp server that is (still?) in contact with it, it reboots.  It's a ridiculously heavy solution which is a lot of trouble to set up in a secure manner, but it has the advantage that each node can detect whether it is in contact with the node(s) with which it has one or more critical relationships.

I suspect this problem is actually a driver issue.  These are all Archer [CA]7 v [245] routers (affordable!) with QCA "wave1" radios.  I haven't been able to use the -CT (Candela Technologies) driver for those radios in a mesh; perhaps I haven't understood the advice I've received about that, or perhaps the advice just doesn't work.  Therefore, I have to use the stock (QCA) driver's inherent 802.11s implementation, which has quirks.  For example, it always fails, usually with hours or minutes, if I have tweaked the radio's built-in MAC address.  Therefore, I suspect the QCA firmware may be insufficiently hardened against the depredations of real-world environments.

On the other hand, this could be a real OpenWRT bug.  I have no explanation as to why it is suddenly so much more frequent.  If anyone can suggest debugging instrumentation that I haven't already tried, I'll be grateful for the advice.

