[OpenWrt-Devel] [PATCH][RFC] libubus: added max retry mechanism

Sun Aug 3 08:15:50 EDT 2014

On 2014-08-01 18:01, Alexandru Ardelean wrote:
> Back with some findings.
> 
> Please check this: 
> https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899
That looks very wrong to me. If that triggers, you're leaking the fd.

> The current problem [high CPU usage] seems to have appeared after some
> modifications we made to the code which were related to a recent
> proposal patch I sent [ that got rejected ].
> Specifically, this one : [PATCH] libubus: handle NULL ubus_context in
> all sync requests by creating a local instance
> 
> [Some context]
> We modified our code in some parts to allocate a ubus_context object for
> certain sync calls, and free it right after we're done with them [the
> calls].
> This was done in order to avoid them getting mixed with async calls on
> the same ubus_context, and avoid races [ in our logic ].
> 
> We then got this high CPU usage issue.
> 
> strace showed something like this [ this is just one cycle; strace was
> spewing this continuously ]:
> ============================================================================
> clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0
> clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0
> epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1
> recvmsg(0, 0x7fff5e7ce638, 0)           = -1 EAGAIN (Resource
> temporarily unavailable)
That looks odd, recvmsg is polling fd 0. Did you close stdin?

> ============================================================================
> 
> Maybe we were also a bit too aggressive with our ubus_context alloc +
> free approach. We were doing them for each call.
> When we did some modifications to be a bit less aggressive, the high CPU
> usage was not reproduce-able [ for the case we used to reproduce it ].
> 
> Also, for all our main/static ubus_context objects we add a
> connection_lost callback of our own, which essentially just does a
> ubus_reconnect() call and some setup again.
> We could try to use the ubus_auto_connect() call in the future. We
> implemented the connection_lost callback before this was implemented.
> 
> Seems that after the ubus_free() call the socket would get closed, which
> would also call our connection_lost callback, which would reconnect.
> This seemed to work fine the first few sometimes [ for our case ].
> It was sometime later [ in our reproducing case 1-2 minutes ] that this
> connection_lost cb + ubus_reconnect() call would cause strace would
> start spewing continuously.
> 
> If we would have left the ubus_default_connection_lost() callback, it
> would have called uloop_end() and then procd would restart our process,
> and all would be well again for a while.
> But we did not want that.
Are you sure that this is a bug in libubus? If this is indeed the call
from within libubus, then it appears that something may have corrupted
the ubus_context. I think you need to take another look at the code.

> [End Some context]
> 
> After trying out a few things, and tracing the code, it seemed that not
> closing the socket in ubus_reconnect() helped.
> I will say directly that I am not understanding the libubus code 100%
> yet, and I am trying not to learn libubox yet, but I think that not
> closing the socket if ctx->sock.eof is true, prevents a racy condition
> where we would close some other context's socket.
> Not sure about this, which is why I am asking some opinions.
> 
> Maybe Felix or someone can point us in a better direction with the
> investigation.
> Still, it makes sense to not close an eof-ed socket, and for our current
> case, I do not see the process taking up all the CPU anymore.
I completely disagree. Why would it make sense to leak the fd if eof is
set? The fact that it somehow prevents your 100% CPU loop could be
something unrelated.

- Felix
_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel