Hi, we have a two nodes cluster with heartbeat 2.1.4 on a RHEL 5 Linux We have OpenVZ on both nodes, a shared GFS disk with VE data, and ManageVE resources, some on node1 on some on node2 We are experimenting a strange behavior. Sometimes VEs of one node appear started on both nodes. If we monitors with hb_gui, one of the nodes is showed stopped although heartbeat is started there. Logs in the node with all resources started looks like this: ........................ heartbeat[24950]: 2008/12/19_18:07:54 info: time_longclock: clock_t wrapped around (uptime). heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 42949663 06 ms (> 510 ms) before being called (GSource: 0x9744758) cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[25148]: 2008/12/19_18:07:54 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [dead] heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112728 cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm heartbeat[24950]: 2008/12/19_18:07:54 WARN: node parmenides.upc.es: is dead cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: no mbr_track info heartbeat[24950]: 2008/12/19_18:07:54 info: Link parmenides.upc.es:eth1 dead. cib[25144]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 42949665 46 ms (> 510 ms) before being called (GSource: 0x9744988) cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: LOST: parmenides.upc.es crmd[25148]: 2008/12/19_18:07:54 info: mem_handle_event: instance=41, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112704 cib[25144]: 2008/12/19_18:07:54 info: cib_ccm_msg_callback: PEER: heraclito.upc.es crmd[25148]: 2008/12/19_18:07:54 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=41) heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 42949 67216 ms (> 30000 ms) before being called (GSource: 0x9744a08) ccm[25142]: 2008/12/19_18:07:54 info: Break tie for 2 nodes cluster crmd[25148]: 2008/12/19_18:07:54 info: erase_node_from_join: Removed dead node parmenides.upc.es from join calculations: welcome d=0 itegrated=0 finalized=0 confirmed=1 heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64112637 crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: NEW MEMBERSHIP: trans=41, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3 heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 4294962206 ms (> 5000 ms) before being called (GSource: 0x9744908) crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=41] heartbeat[24950]: 2008/12/19_18:07:54 info: Gmain_timeout_dispatch: started at 4359079925 should have started at 64113138 crmd[25148]: 2008/12/19_18:07:54 info: ccm_event_detail: LOST: parmenides.upc.es [nodeid=1, born=40] heartbeat[24950]: 2008/12/19_18:07:54 WARN: Gmain_timeout_dispatch: Dispatch function for memory stats was delayed 4244815296 ms (> 5000 ms) before being called (GSource: 0x9744888) .................. .................... heartbeat[24950]: 2008/12/19_18:07:55 CRIT: Cluster node parmenides.upc.es returning after partition. heartbeat[24950]: 2008/12/19_18:07:55 info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain heartbeat[24950]: 2008/12/19_18:07:55 WARN: Deadtime value may be too small. heartbeat[24950]: 2008/12/19_18:07:55 info: See FAQ for information on tuning deadtime. heartbeat[24950]: 2008/12/19_18:07:55 info: URL: http://linux-ha.org/FAQ#heavy_load heartbeat[24950]: 2008/12/19_18:07:55 info: Link parmenides.upc.es:eth1 up. heartbeat[24950]: 2008/12/19_18:07:55 info: Status update for node parmenides.upc.es: status active crmd[25148]: 2008/12/19_18:07:55 notice: crmd_ha_status_callback: Status update: Node parmenides.upc.es now has status [active] cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: no mbr_track info cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[25144]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 crmd[25148]: 2008/12/19_18:07:56 info: mem_handle_event: instance=42, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: heraclito.upc.es crmd[25148]: 2008/12/19_18:07:56 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=42) cib[25144]: 2008/12/19_18:07:56 info: cib_ccm_msg_callback: PEER: parmenides.upc.es crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW MEMBERSHIP: trans=42, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, ol d_idx=4 crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: heraclito.upc.es [nodeid=0, born=38] crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: CURRENT: parmenides.upc.es [nodeid=1, born=42] crmd[25148]: 2008/12/19_18:07:56 info: ccm_event_detail: NEW: parmenides.upc.es [nodeid=1, born=42] ................. It seems like one of the nodes is lost and latter recovered, but the cluster do not recover normality. Resources on recovered node keeps on started on the other. The only way I have found to solve the situation is to do manual action on the cluster, for instance to stop a resource. If I do that, then cluster seems to realize the situation, detects that some resources are multi-homed, and corrects the situation. What is happening? Does heavy load affects as the log messages says? But why cluster doesn't recover normal status? Thanks for your help. Frank -- Aquest missatge ha estat analitzat per MailScanner a la cerca de virus i d'altres continguts perillosos, i es considera que està net. For all your IT requirements visit: http://www.transtec.co.uk _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems