[Linux-cluster] LVS not not failing over properly

Mon Aug 25 22:22:55 UTC 2008

I have a LVS-NAT implementation in the lab that sort of works.  I have a 
primary and hot backup lvs node, and two web servers behind it.  I can 
happily point my web browser at the virtual IP and I get the apache test 
page just fine.  I check the httpd access logs on the two real web 
servers and see that the load is being distributed. 

The problem lies when I try to test the failover of the lvs nodes.  I 
shut the primary node down, and I see that it at least attempts to fail 
over, and seems to do so successfully:

Aug 25 18:21:44 lb2 pulse[5064]: partner dead: activating lvs
Aug 25 18:21:44 lb2 lvs[5083]: starting virtual service glassfish active: 80
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Withdrawing address record for 
10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.11.12.10 on eth1.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Withdrawing address record for 
10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 avahi-daemon[3136]: Registering new address record 
for 10.100.13.220 on eth0.
Aug 25 18:21:44 lb2 lvs[5083]: create_monitor for glassfish/gf1 running 
as pid 5094
Aug 25 18:21:44 lb2 nanny[5094]: starting LVS client monitor for 
10.100.13.220:80
Aug 25 18:21:44 lb2 nanny[5095]: starting LVS client monitor for 
10.100.13.220:80
Aug 25 18:21:44 lb2 lvs[5083]: create_monitor for glassfish/gf2 running 
as pid 5095
Aug 25 18:21:44 lb2 nanny[5094]: making 10.11.12.1:80 available
Aug 25 18:21:44 lb2 nanny[5095]: making 10.11.12.2:80 available
Aug 25 18:21:49 lb2 pulse[5085]: gratuitous lvs arps finished

The problem is that attempts from my web browser to refresh the page are 
unsuccessful.  The lvs.cf is synchronized between the lvs nodes.  Here's 
a copy of the config:

serial_no = 49
primary = 10.100.13.96
primary_private = 10.11.12.8
service = lvs
backup_active = 1
backup = 10.100.13.87
backup_private = 10.11.12.9
heartbeat = 1
heartbeat_port = 539
keepalive = 6
deadtime = 10
network = nat
nat_router = 10.11.12.10 eth1:1
nat_nmask = 255.255.255.0
debug_level = NONE
monitor_links = 1
virtual glassfish {
     active = 1
     address = 10.100.13.220 eth0:1
     vip_nmask = 255.255.255.0
     port = 80
     send = "GET / HTTP/1.0\r\n\r\n"
     expect = "HTTP"
     use_regex = 0
     load_monitor = none
     scheduler = wlc
     protocol = tcp
     timeout = 6
     reentry = 15
     quiesce_server = 0
     server gf1 {
         address = 10.11.12.1
         active = 1
         weight = 1
     }
     server gf2 {
         address = 10.11.12.2
         active = 1
         weight = 1
     }
}

I believe the problem lies in arping, but I'm not sure how to diagnose 
this.  There are no firewalls between my browser and the lvs, and I'm 
using a fairly dumb 100mb switch (also tried with a smarter switch).

Any help would be greatly appreciated.

Thanks,

James