Ticket #1102 (new defect)

Opened 5 years ago

Last modified 5 years ago

[NEEDINFO] Sporadic hangs in PREPARE_WRITE stage

Reported by: scriptor Owned by: dgollub
Priority: normal Milestone:
Component: OpenSync Version: 0.38
Severity: normal Keywords: hangs, PREPARE_WRITE
Cc:

Description (last modified by dgollub) (diff)

Hello,

every now and then I face hangs with osynctool, when it comes to the PREPARE_WRITE stage. A test that takes usually no more than 10 seconds, lasts for 1 hour and finally gets killed by the ctest timeout.

For some time I have had the impression, that the message systems gets out of step, but now I am not sure about this, any more.

I attach a *.tar.bz2 with trace files, a README and the relevant part of LastTest?.log.

Attachments

56.tar.bz2 (76.3 KB) - added by scriptor 5 years ago.
A tar.bz2 file.

Change History

Changed 5 years ago by scriptor

A tar.bz2 file.

comment:1 Changed 5 years ago by scriptor

Well, that 56.tar.bz2 should have been a osynctool_add_mod_del_todo_ldif_and_sync_13.04.2009_00:56.tar.bz2

comment:2 Changed 5 years ago by dgollub

  • Description modified (diff)
  • Summary changed from Sporadic hangs in PREPARE_WRITE stage to [NEEDINFO] Sporadic hangs in PREPARE_WRITE stage

I just did a brief check of the attached trace files - why do you think this is a hang inside PREPARE_WRITE?

comment:3 Changed 5 years ago by scriptor

When something like this appears on the screen:

src/ldap_plugin.c:2266: Got 0 unmodified entries from LDAP server (note).

get_changes is over. At least from the perspective of the LDAP format plugin.

In other situations with this error, I could see from the trace files, that OSYNC_ENGINE_COMMAND_PREPARE_WRITE has already been entered. Otherwise I would not have found it in libopensync.

But I must admit, in this particular case here, it is not quite clear.

Well, I certainly get similar situations, which I can post here.

Unfortunately, this is one of the errors, that are not really reproducible. Regardless of with or without load.

You mean, it does not even get into PREPARE_WRITE stage? Or that the hang occurs regardless of the stage? At least for me it didn't do so, so far. But who knows.

If you know, how to debug this further (anywhere in ipc?) tell me.

Bye, bye.

comment:4 Changed 5 years ago by scriptor

BTW: One reason for hangs are timeouts set by libopensync that are too short. In those cases a corresponding error message occurs. "Disconnect." can be found in the trace files.

Now, I have changed the code in my local working copy as I have shown in

http://www.opensync.org/ticket/1100

and at least this particular reason for timeouts and hangs has been eliminated.

But the remaining cases do not seem to be affected by any timeouts, as I have increased several other ones in my local working copy.

Bye, bye.

comment:5 Changed 5 years ago by scriptor

In the meantime I have faced a few further hangs, and indeed, they are of a different nature than the ones I faced earlier:

All of them reach SYNC_DONE and DISCONNECT. So PREPARE_WRITE is wrong.

The space of time from the first entry in the trace files and the last entry is only 3 seconds. But the timeout of 5 minutes is met. So for the remaining 4 minutes and 57 seconds there is nothing in the trace files.

Unfortunately cmake sends a SIGKILL which I cannot catch...

Bye, bye

comment:6 Changed 5 years ago by scriptor

The hang occurs in osync_client_proxy_shutdown() in libopensync/opensync/client/opensync_client_proxy.c right on the comment "We now wait for the HUP on our sending queue", i.e. in message = osync_queue_get_message(proxy->outgoing);

BTW: replacing g_async_queue_pop() by g_async_queue_timed_pop() in osync_queue_get_message() in libopensync/opensync/client/opensync_client_proxy.c helps to avoid hangs and transform them into regular error conditions.

The HUPS seem to get sent before it comes to that message = osync_queue_get_message(proxy->outgoing); line, don't they?

comment:7 Changed 5 years ago by scriptor

Answering to myself: Well, yes, but the queue simply vanishes sometimes too early.

This is the culprit:

osyncClientDisconnectCallback() libopensync/opensync/client/opensync_client.c.

Using timeouts for ensuring proper communication between threads is the wrong way. May I suggest a solution with g_cond_broadcast or g_cond_signal or the like?

In the meantime I use the following as a workaround:

1742     /* Gives some time if anyone wants to grab the HUP message from the out     going part of the pipe */
1743     // g_usleep(200);  // orig
1744     g_usleep(200000);
1745     /* Now we can safely disconnect our outgoing queue */
1746     osync_queue_disconnect(client->outgoing, NULL);
Note: See TracTickets for help on using tickets.