Date: Wed, 21 Apr 2010 12:02:16 +0300 From: Mikolaj Golub <to.my.trociny@gmail.com> To: freebsd-fs <freebsd-fs@freebsd.org> Subject: HAST: primary might get stuck when there are connectivity problems with secondary Message-ID: <86r5m9dvqf.fsf@zhuzha.ua1>
next in thread | raw e-mail | index | archive | help
--=-=-= Hi, I can make HAST primary get stuck making the secondary not accessible (network packets are lost) for some period of time. I run HAST in VirtualBox hosts, so to emulate network outage I just change bridge interface in VirtualBox configuration. Below are details for one example. On the primary before the outage we have: sockstat: root hastd 1571 10 tcp4 172.20.66.201:41841 172.20.66.202:8457 root hastd 1571 11 tcp4 172.20.66.201:57596 172.20.66.202:8457 During the outage and after it sockstat shows the same, while netstat shows: tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED tcp4 0 8307 172.20.66.201.41841 172.20.66.202.8457 ESTABLISHED (note non zero value for send buffer) and then later tcp4 0 0 172.20.66.201.57596 172.20.66.202.8457 ESTABLISHED tcp4 0 0 172.20.66.201.41841 172.20.66.202.8457 CLOSED Restoring network after this changes nothing. Primary gets stuck. No messages in the log and "dirty" in status output does not change: [root@hasta ~]# hastctl status storage: role: primary provname: storage localpath: /dev/ad4 extentsize: 2097152 keepdirty: 64 remoteaddr: 172.20.66.202 replication: memsync status: complete dirty: 2097152 bytes On the secondary we have all this time: tcp4 0 0 172.20.66.202.8457 172.20.66.201.57596 ESTABLISHED tcp4 0 0 172.20.66.202.8457 172.20.66.201.41841 ESTABLISHED The last messages in log: Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Request received from the kernel: READ(13565952, 65536). Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411bc0) Moving request to the send queue. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: Taking free request. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Got free request. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Waiting for request from the kernel. Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Got request. Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: (0x28411bc0) Moving request to the done queue. Apr 21 10:50:21 hasta hastd: [storage] (primary) local_send: Taking request. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Got request. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: (0x28411bc0) Moving request to the free queue. Apr 21 10:50:21 hasta hastd: [storage] (primary) ggate_send: Taking request. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Request received from the kernel: READ(1812529152, 65536). Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b80) Moving request to the send queue. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: Taking free request. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Got free request. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_recv: (0x28411b00) Waiting for request from the kernel. Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Got request. Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: (0x28411b80) Moving request to the done queue. Apr 21 10:51:00 hasta hastd: [storage] (primary) local_send: Taking request. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Got request. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: (0x28411b80) Moving request to the free queue. Apr 21 10:51:00 hasta hastd: [storage] (primary) ggate_send: Taking request. The backtrace of gotten stuck hastd is in the attach. I interpret this in the following way. Although the network is down hast_proto_send() in remote_send_thread() returns success (sent data are stored in the kernel buffer). Then kernel tries to send data and eventually fails after timeout and close the socket. hastd is not aware about this, remote_send_thread() is blocked in "Taking request" at this time, sync thread is waiting for status from the secondary about sent data but secondary does not send it because it did not receive any data. Restarting hastd on the secondary usually helps. A workaround is to set net.inet.tcp.keepidle to some small value (e.g. 300 sec) on the secondary. Then the secondary will notice much earlier that the peer has closed the connection and will restart the worker itself: Apr 21 11:52:21 hastb hastd: [storage] (secondary) Unable to receive request header: Connection reset by peer. Apr 21 11:52:21 hastb hastd: [storage] (secondary) Worker process (pid=1398) exited ungracefully: status=19200. -- Mikolaj Golub --=-=-= Content-Type: application/octet-stream Content-Disposition: attachment; filename=bt.log Content-Transfer-Encoding: base64 VGhyZWFkIDggKFRocmVhZCAyODQwNDE0MCAoTFdQIDEwMDA3OCkpOgojMCAgMHgyODIzZWRkNyBp biBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMSAgMHgyODIzZTliOCBpbiBfX2Vy cm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMiAgMHgyODRjMzUyMCBpbiA/PyAoKQojMyAg MHgwMDAwMDAwOCBpbiA/PyAoKQojNCAgMHgwMDAwMDAwMSBpbiA/PyAoKQojNSAgMHgyODRjMzUw MCBpbiA/PyAoKQojNiAgMHgwMDAwMDAwMCBpbiA/PyAoKQojNyAgMHgyODBhNWEwMCBpbiA/PyAo KQojOCAgMHhiZmJmZTk4MCBpbiA/PyAoKQojOSAgMHgyODIzZDMxZiBpbiBwdGhyZWFkX3NldGNh bmNlbHN0YXRlICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMTAgMHgyODIzY2JiZSBpbiBwdGhy ZWFkX2NvbmRfc2lnbmFsICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMTEgMHgwODA1OGU3OCBp biBjdl93YWl0IChjdj0weDgwNjdlMmMsIGxvY2s9MHg4MDY3ZTI4KSBhdCBzeW5jaC5oOjEyNQoj MTIgMHgwODA1Yjc1ZSBpbiBjdl90aW1lZHdhaXQgKGN2PTB4ODA2N2UyYywgbG9jaz0weDgwNjdl MjgsIHRpbWVvdXQ9MCkgYXQgc3luY2guaDoxMzUKIzEzIDB4MDgwNWI3MmMgaW4gZ3VhcmRfdGhy ZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTc4 NwojMTQgMHgwODA1ODIwNiBpbiBoYXN0ZF9wcmltYXJ5IChyZXM9MHgyODRjYWIwMCkgYXQgL3Vz ci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6NzczCiMxNSAweDA4MDRjNGU4IGluIGNvbnRyb2xf c2V0X3JvbGUgKGNmZz0weDgwNjY1MDAsIG52b3V0PTB4Mjg0ZWIwYjAsIHJvbGU9MiAnXDAwMics IHJlcz0weDI4NGNhYjAwLCAKICAgIG5hbWU9MHgyODQ4MTQ0MiAic3RvcmFnZSIsIG5vPTApIGF0 IC91c3Ivc3JjL3NiaW4vaGFzdGQvY29udHJvbC5jOjExNAojMTYgMHgwODA0Y2QwMSBpbiBjb250 cm9sX2hhbmRsZSAoY2ZnPTB4ODA2NjUwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9jb250cm9s LmM6MzMyCiMxNyAweDA4MDRmMDdjIGluIG1haW5fbG9vcCAoKSBhdCAvdXNyL3NyYy9zYmluL2hh c3RkL2hhc3RkLmM6NDI1CiMxOCAweDA4MDRmM2U4IGluIG1haW4gKGFyZ2M9MCwgYXJndj0weGJm YmZlZGE0KSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL2hhc3RkLmM6NTIxCgpUaHJlYWQgNyAoVGhy ZWFkIDI4NDA0MjgwIChMV1AgMTAwMTExKSk6CiMwICAweDI4MzQ0NzczIGluIGlvY3RsICgpIGZy b20gL2xpYi9saWJjLnNvLjcKIzEgIDB4MDgwNTg4YzQgaW4gZ2dhdGVfcmVjdl90aHJlYWQgKGFy Zz0weDI4NGNhYjAwKSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzo4OTQKIzIgIDB4 MjgyMzQyOGYgaW4gcHRocmVhZF9nZXRwcmlvICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMyAg MHgwMDAwMDAwMCBpbiA/PyAoKQoKVGhyZWFkIDYgKFRocmVhZCAyODQwNDNjMCAoTFdQIDEwMDEx MikpOgojMCAgMHgyODIzZWRkNyBpbiBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwoj MSAgMHgyODIzZTliOCBpbiBfX2Vycm9yICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojMiAgMHgy ODRjMzJhMCBpbiA/PyAoKQojMyAgMHgwMDAwMDAwOCBpbiA/PyAoKQojNCAgMHgwMDAwMDAwMSBp biA/PyAoKQojNSAgMHgyODRjMzI4MCBpbiA/PyAoKQojNiAgMHgwMDAwMDAwMCBpbiA/PyAoKQoj NyAgMHhiZjhmZGU5NCBpbiA/PyAoKQojOCAgMHgyODIzOGRiNSBpbiBwdGhyZWFkX3J3bG9ja191 bmxvY2sgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiM5ICAweDI4MjNjYmJlIGluIHB0aHJlYWRf Y29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDA4MDU4ZTc4IGluIGN2 X3dhaXQgKGN2PTB4Mjg0YzkwODAsIGxvY2s9MHgyODRjOTA3OCkgYXQgc3luY2guaDoxMjUKIzEx IDB4MDgwNThmMzcgaW4gbG9jYWxfc2VuZF90aHJlYWQgKGFyZz0weDI4NGNhYjAwKSBhdCAvdXNy L3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzoxMDMyCiMxMiAweDI4MjM0MjhmIGluIHB0aHJlYWRf Z2V0cHJpbyAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEzIDB4MDAwMDAwMDAgaW4gPz8gKCkK ClRocmVhZCA1IChUaHJlYWQgMjg0MDQ1MDAgKExXUCAxMDAxMTMpKToKIzAgIDB4MjgyM2VkZDcg aW4gX19lcnJvciAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEgIDB4MjgyM2U5YjggaW4gX19l cnJvciAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzIgIDB4Mjg0YzMzYTAgaW4gPz8gKCkKIzMg IDB4MDAwMDAwMDggaW4gPz8gKCkKIzQgIDB4MDAwMDAwMDEgaW4gPz8gKCkKIzUgIDB4Mjg0YzMz ODAgaW4gPz8gKCkKIzYgIDB4MDAwMDAwMDAgaW4gPz8gKCkKIzcgIDB4MDAwMDAwMDAgaW4gPz8g KCkKIzggIDB4ZDJlZGUzODkgaW4gPz8gKCkKIzkgIDB4MjgyM2QzMWYgaW4gcHRocmVhZF9zZXRj YW5jZWxzdGF0ZSAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzEwIDB4MjgyM2NiYmUgaW4gcHRo cmVhZF9jb25kX3NpZ25hbCAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzExIDB4MDgwNThlNzgg aW4gY3Zfd2FpdCAoY3Y9MHgyODRjOTA4NCwgbG9jaz0weDI4NGM5MDdjKSBhdCBzeW5jaC5oOjEy NQojMTIgMHgwODA1OTUwZiBpbiByZW1vdGVfc2VuZF90aHJlYWQgKGFyZz0weDI4NGNhYjAwKSBh dCAvdXNyL3NyYy9zYmluL2hhc3RkL3ByaW1hcnkuYzoxMTE3CiMxMyAweDI4MjM0MjhmIGluIHB0 aHJlYWRfZ2V0cHJpbyAoKSBmcm9tIC9saWIvbGlidGhyLnNvLjMKIzE0IDB4MDAwMDAwMDAgaW4g Pz8gKCkKClRocmVhZCA0IChUaHJlYWQgMjg0MDQ2NDAgKExXUCAxMDAxMTQpKToKIzAgIDB4Mjgy ZmFhNTcgaW4gcmVjdmZyb20gKCkgZnJvbSAvbGliL2xpYmMuc28uNwojMSAgMHgyODI4MGJlMiBp biByZWN2ICgpIGZyb20gL2xpYi9saWJjLnNvLjcKIzIgIDB4MDgwNWMyODcgaW4gcHJvdG9fY29t bW9uX3JlY3YgKGZkPTExLCBkYXRhPTB4YmY2ZmJmMjcgIiIsIHNpemU9NSkKICAgIGF0IC91c3Iv c3JjL3NiaW4vaGFzdGQvcHJvdG9fY29tbW9uLmM6NzgKIzMgIDB4MDgwNWQ0ZjAgaW4gdGNwNF9y ZWN2IChjdHg9MHgyODQ3ZjIyMCwgZGF0YT0weGJmNmZiZjI3ICIiLCBzaXplPTUpCiAgICBhdCAv dXNyL3NyYy9zYmluL2hhc3RkL3Byb3RvX3RjcDQuYzozMjUKIzQgIDB4MDgwNWJkZjEgaW4gcHJv dG9fcmVjdiAoY29ubj0weDI4NGViMTUwLCBkYXRhPTB4YmY2ZmJmMjcsIHNpemU9NSkgYXQgL3Vz ci9zcmMvc2Jpbi9oYXN0ZC9wcm90by5jOjE5OAojNSAgMHgwODA0ZGRhZSBpbiBoYXN0X3Byb3Rv X3JlY3ZfaGRyIChjb25uPTB4Mjg0ZWIxNTAsIG52cD0weGJmNmZiZjdjKSBhdCAvdXNyL3NyYy9z YmluL2hhc3RkL2hhc3RfcHJvdG8uYzoyOTgKIzYgIDB4MDgwNTllZjkgaW4gcmVtb3RlX3JlY3Zf dGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6 MTI4MgojNyAgMHgyODIzNDI4ZiBpbiBwdGhyZWFkX2dldHByaW8gKCkgZnJvbSAvbGliL2xpYnRo ci5zby4zCiM4ICAweDAwMDAwMDAwIGluID8/ICgpCgpUaHJlYWQgMyAoVGhyZWFkIDI4NDA0Nzgw IChMV1AgMTAwMTE1KSk6CiMwICAweDI4MjNlZGQ3IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xp YnRoci5zby4zCiMxICAweDI4MjNlOWI4IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5z by4zCiMyICAweDI4NGMzNGEwIGluID8/ICgpCiMzICAweDAwMDAwMDA4IGluID8/ICgpCiM0ICAw eDAwMDAwMDAxIGluID8/ICgpCiM0ICAweDAwMDAwMDAxIGluID8/ICgpCiM1ICAweDI4NGMzNDgw IGluID8/ICgpCiM2ICAweDAwMDAwMDAwIGluID8/ICgpCiM3ICAweDAwMDAwMDAwIGluID8/ICgp CiM4ICAweDAwMDAwMDAwIGluID8/ICgpCiM5ICAweDI4MjNkMzFmIGluIHB0aHJlYWRfc2V0Y2Fu Y2Vsc3RhdGUgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDI4MjNjYmJlIGluIHB0aHJl YWRfY29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMSAweDA4MDU4ZTc4IGlu IGN2X3dhaXQgKGN2PTB4ODA2N2UxNCwgbG9jaz0weDgwNjdlMTApIGF0IHN5bmNoLmg6MTI1CiMx MiAweDA4MDVhNDBiIGluIGdnYXRlX3NlbmRfdGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vz ci9zcmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTM4MwojMTMgMHgyODIzNDI4ZiBpbiBwdGhyZWFk X2dldHByaW8gKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxNCAweDAwMDAwMDAwIGluID8/ICgp CgpUaHJlYWQgMiAoVGhyZWFkIDI4NDA0OGMwIChMV1AgMTAwMTE2KSk6CiMwICAweDI4MjNlZGQ3 IGluIF9fZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxICAweDI4MjNlOWI4IGluIF9f ZXJyb3IgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMyICAweDI4NGMzMWEwIGluID8/ICgpCiMz ICAweDAwMDAwMDA4IGluID8/ICgpCiM0ICAweDAwMDAwMDAxIGluID8/ICgpCiM1ICAweDI4NGMz MTgwIGluID8/ICgpCiM2ICAweDAwMDAwMDAwIGluID8/ICgpCiM3ICAweDAwMDAwMDAwIGluID8/ ICgpCiM4ICAweGJmNGY5ZWE4IGluID8/ICgpCiM5ICAweDI4MjNkMzFmIGluIHB0aHJlYWRfc2V0 Y2FuY2Vsc3RhdGUgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMCAweDI4MjNjYmJlIGluIHB0 aHJlYWRfY29uZF9zaWduYWwgKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxMSAweDA4MDU4ZTc4 IGluIGN2X3dhaXQgKGN2PTB4ODA2N2UyMCwgbG9jaz0weDgwNjdlMWMpIGF0IHN5bmNoLmg6MTI1 CiMxMiAweDA4MDVhN2NjIGluIHN5bmNfdGhyZWFkIChhcmc9MHgyODRjYWIwMCkgYXQgL3Vzci9z cmMvc2Jpbi9oYXN0ZC9wcmltYXJ5LmM6MTQ3MgojMTMgMHgyODIzNDI4ZiBpbiBwdGhyZWFkX2dl dHByaW8gKCkgZnJvbSAvbGliL2xpYnRoci5zby4zCiMxNCAweDAwMDAwMDAwIGluID8/ICgpCgpU aHJlYWQgMSAoVGhyZWFkIDI4NDA0YTAwIChMV1AgMTAwMTE3KSk6CiMwICAweDI4MmZhYTU1IGlu IHJlY3Zmcm9tICgpIGZyb20gL2xpYi9saWJjLnNvLjcKIzEgIDB4MjgyODBiZTIgaW4gcmVjdiAo KSBmcm9tIC9saWIvbGliYy5zby43CiMyICAweDA4MDVjMjg3IGluIHByb3RvX2NvbW1vbl9yZWN2 IChmZD05LCBkYXRhPTB4YmYzZjhmNDcgIioiLCBzaXplPTUpCiAgICBhdCAvdXNyL3NyYy9zYmlu L2hhc3RkL3Byb3RvX2NvbW1vbi5jOjc4CiMzICAweDA4MDVjNmFlIGluIHNwX3JlY3YgKGN0eD0w eDI4NGViMTAwLCBkYXRhPTB4YmYzZjhmNDcgIioiLCBzaXplPTUpCiAgICBhdCAvdXNyL3NyYy9z YmluL2hhc3RkL3Byb3RvX3NvY2tldHBhaXIuYzoxNzcKIzQgIDB4MDgwNWJkZjEgaW4gcHJvdG9f cmVjdiAoY29ubj0weDI4NGViMGYwLCBkYXRhPTB4YmYzZjhmNDcsIHNpemU9NSkgYXQgL3Vzci9z cmMvc2Jpbi9oYXN0ZC9wcm90by5jOjE5OAojNSAgMHgwODA0ZGRhZSBpbiBoYXN0X3Byb3RvX3Jl Y3ZfaGRyIChjb25uPTB4Mjg0ZWIwZjAsIG52cD0weGJmM2Y4ZjgwKSBhdCAvdXNyL3NyYy9zYmlu L2hhc3RkL2hhc3RfcHJvdG8uYzoyOTgKIzYgIDB4MDgwNGNlMjcgaW4gY3RybF90aHJlYWQgKGFy Zz0weDI4NGNhYjAwKSBhdCAvdXNyL3NyYy9zYmluL2hhc3RkL2NvbnRyb2wuYzozNzMKIzcgIDB4 MjgyMzQyOGYgaW4gcHRocmVhZF9nZXRwcmlvICgpIGZyb20gL2xpYi9saWJ0aHIuc28uMwojOCAg MHgwMDAwMDAwMCBpbiA/PyAoKQo= --=-=-=--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86r5m9dvqf.fsf>