[OpenWrt-Devel] [PATCH][RFC] libubus: added max retry mechanism

Fri Aug 1 12:01:52 EDT 2014

Back with some findings.

Please check this:
https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899

The current problem [high CPU usage] seems to have appeared after some
modifications we made to the code which were related to a recent proposal
patch I sent [ that got rejected ].
Specifically, this one : [PATCH] libubus: handle NULL ubus_context in all
sync requests by creating a local instance

[Some context]
We modified our code in some parts to allocate a ubus_context object for
certain sync calls, and free it right after we're done with them [the
calls].
This was done in order to avoid them getting mixed with async calls on the
same ubus_context, and avoid races [ in our logic ].

We then got this high CPU usage issue.

strace showed something like this [ this is just one cycle; strace was
spewing this continuously ]:
============================================================================
clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0
clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0
epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1
recvmsg(0, 0x7fff5e7ce638, 0)           = -1 EAGAIN (Resource temporarily
unavailable)
============================================================================

Maybe we were also a bit too aggressive with our ubus_context alloc + free
approach. We were doing them for each call.
When we did some modifications to be a bit less aggressive, the high CPU
usage was not reproduce-able [ for the case we used to reproduce it ].

Also, for all our main/static ubus_context objects we add a connection_lost
callback of our own, which essentially just does a ubus_reconnect() call
and some setup again.
We could try to use the ubus_auto_connect() call in the future. We
implemented the connection_lost callback before this was implemented.

Seems that after the ubus_free() call the socket would get closed, which
would also call our connection_lost callback, which would reconnect.
This seemed to work fine the first few sometimes [ for our case ].
It was sometime later [ in our reproducing case 1-2 minutes ] that this
connection_lost cb + ubus_reconnect() call would cause strace would start
spewing continuously.

If we would have left the ubus_default_connection_lost() callback, it would
have called uloop_end() and then procd would restart our process, and all
would be well again for a while.
But we did not want that.

[End Some context]

After trying out a few things, and tracing the code, it seemed that not
closing the socket in ubus_reconnect() helped.
I will say directly that I am not understanding the libubus code 100% yet,
and I am trying not to learn libubox yet, but I think that not closing the
socket if ctx->sock.eof is true, prevents a racy condition where we would
close some other context's socket.
Not sure about this, which is why I am asking some opinions.

Maybe Felix or someone can point us in a better direction with the
investigation.
Still, it makes sense to not close an eof-ed socket, and for our current
case, I do not see the process taking up all the CPU anymore.

Thanks
Alex

On Tue, Jul 29, 2014 at 4:41 PM, Alexandru Ardelean <ardeleanalex at gmail.com>
wrote:

> I'll see what I can find.
>
> At the time I created this patch I had it reproducing more often, but not
> often enough to trace a cause.
> The patch seems to help, but [as you've said] the error seems like
> something else.
> Seems to be reproducing again, after a while, and in a certain
> configuration.
>
> I'll check it out.
> This high CPU usage thing got to annoy me enough that I gave it a bit more
> attention.
>
> Thanks
> Alex
>
> On Tue, Jul 29, 2014 at 4:30 PM, Felix Fietkau <nbd at openwrt.org> wrote:
>
>> On 2014-07-29 15:23, Alexandru Ardelean wrote:
>> > Quick info follow-up:  strace-ing on the PIDs (with high CPU usage)
>> > shows a lot of EAGAIN errors from recvfrom() and a lot of recvfrom() +
>> > poll() + clock_gettime_monotonic() calls.
>> >
>> > Thinking about it, would it make more sense to add a short wait instead
>> > [to reduce the throttle] and let it "infinitely" loop ?  ["infinitely"
>> > loop == loop until it does not error with EAGAIN]
>> I think it's much more likely that this is a bug, caused by errors on
>> the ubus socket not being caught or handled properly.
>> If that is the case, then both approaches are wrong, and we should just
>> handle the error correctly.
>> Can you figure out what triggers this behavior?
>>
>> - Felix
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.infradead.org/pipermail/openwrt-devel/attachments/20140801/86229c22/attachment.htm>
-------------- next part --------------
_______________________________________________
openwrt-devel mailing list
openwrt-devel at lists.openwrt.org
https://lists.openwrt.org/cgi-bin/mailman/listinfo/openwrt-devel