[Linux-cluster] clvmd hangs

Wed May 16 15:30:31 UTC 2007

Thanks for your help. Details are below quote.

David Teigland wrote:
> On Thu, May 03, 2007 at 11:27:08AM +0200, Sebastian Walter wrote:
>   
>> Sebastian Walter wrote:
>>     
>>> Thanks for your help. These are /proc/cluster/services:
>>>
>>> ###master
>>> Service          Name                              GID LID State     Code
>>> Fence Domain:    "default"                           6   2 run       -
>>> [3 2 1]
>>>
>>> DLM Lock Space:  "clvmd"                             5   3 join      
>>> S-6,20,3
>>> [3 2 1]
>>>
>>> ### node1:
>>> Service          Name                              GID LID State     Code
>>> Fence Domain:    "default"                           6   2 run       -
>>> [3 2 1]
>>>
>>> DLM Lock Space:  "clvmd"                             5   3 update    
>>> U-4,1,1
>>> [2 3 1]
>>>
>>>       
> This says that the dlm is stuck in recovery on all the nodes.
> Which version of the code are you using?
>   
ccsd 1.07
cman_tool 1.0.11
fenced 1.32.25
clvmd 2.02.06, protocol 0.2.1
> Has this happened more than once?
>   
This happens every time.
> Does the cluster have quorum? (cman_tool status)
>   
Yes:
[root at xx ~]# cman_tool status
Protocol version: 5.0.1
Config version: 28
Cluster name: xx
Cluster ID: 338
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 2
Expected_votes: 1
Total_votes: 2
Quorum: 1  
Active subsystems: 3
Node name: xx.xx.xx.xx
Node ID: 1
Node addresses: xx.xx.xx.xx

> What does /proc/cluster/dlm_debug show from all nodes?
>   
[root at master ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,3,0
clvmd move use event 3
clvmd recover event 3 (first)
clvmd add nodes

[root at compute-0-2 ~]# cat /proc/cluster/dlm_debug
clvmd move flags 0,1,0 ids 0,2,0
clvmd move use event 2
clvmd recover event 2 (first)
clvmd add nodes
clvmd total nodes 1
clvmd rebuild resource directory
clvmd rebuilt 0 resources
clvmd recover event 2 done
clvmd move flags 0,0,1 ids 0,2,2
clvmd process held requests
clvmd processed 0 requests
clvmd recover event 2 finished
clvmd move flags 1,0,0 ids 2,2,2
clvmd move flags 0,1,0 ids 2,3,2
clvmd move use event 3
clvmd recover event 3
clvmd add node 1

(I narrowed down the cluster to 2 nodes, same problem)
> What are the dlm threads waiting on? (ps ax -o pid,stat,wchan,cmd | grep dlm)
>   
[root at xx ~]# ps ax -o pid,stat,wchan,cmd|grep dlm
28397 S<   dlm_as [dlm_astd]
28398 S<   dlm_re [dlm_recvd]
28399 S<   dlm_se [dlm_sendd]
28400 S<   dlm_wa [dlm_recoverd]

[root at compute-0-2 ~]# ps ax -o pid,stat,wchan,cmd|grep dlm
 4930 S<   dlm_as [dlm_astd]
 4931 S<   dlm_re [dlm_recvd]
 4932 S<   dlm_se [dlm_sendd]
 4933 S<   dlm_wa [dlm_recoverd]

Sebastian