【问题描述】
在4个副本分布在两个中心的环境中,需要把主节点从一个中心切换到另外一个中心,此时没有停止事务操作,reelect() 不能成功,出现超时错误。
下面是对group2进行重新选举时出现-13错误,提示超时。
db.getRG("group2").reelect({Seconds:60})
(shell):1 uncaught exception: -13
Timeout error
Takes 60.087692s.
reelect()需要停止事务操作吗?
协调节点的日志:
2019-06-05-05.38.07.968651 Level:ERROR
PID:1523 TID:50811
Function:doOnGroups Line:280
File:SequoiaDB/engine/coord/coordOperator.cpp
Message:
Failed to execute command[2004] on node[{ GroupID:1001, NodeID:1004, ServiceID:2(SHARD) }], rc: -13
2019-06-05-05.38.07.968698 Level:ERROR
PID:1523 TID:50811
Function:_executeOnGroups Line:294
File:SequoiaDB/engine/coord/coordCommandBase.cpp
Message:
Do command[2004] on groups failed, rc: -13
2019-06-05-05.38.07.968715 Level:ERROR
PID:1523 TID:50811
Function:execute Line:2650
File:SequoiaDB/engine/coord/coordCommandNode.cpp
Message:
Failed to execute on group[group2], rc: -13
2019-06-05-05.38.07.968720 Level:ERROR
PID:1523 TID:50811
Function:_onQueryReqMsg Line:1786
File:SequoiaDB/engine/pmd/pmdProcessor.cpp
Message:
Execute operator[reelect] failed, rc: -13
2019-06-05-05.38.07.968728 Level:ERROR
PID:1523 TID:50811
Function:processMsg Line:1869
File:SequoiaDB/engine/pmd/pmdProcessor.cpp
Message:
Error processing Agent request, rc=-13
2019-06-05-05.38.07.968758 Level:WARNING
PID:1523 TID:50811
Function:_onMsgEnd Line:334
File:SequoiaDB/engine/pmd/pmdSession.cpp
Message:
Session[127.0.0.1:58768] process msg[opCode=2004, len: 136, TID: 50811, requestID: 315] failed, rc: -13
当前主节点的日志:
2019-06-05-05.38.07.311799 Level:ERROR
PID:1619 TID:47624
Function:run Line:140
File:SequoiaDB/engine/cls/clsReelection.cpp
Message:
reelection is out of time
2019-06-05-05.38.07.313671 Level:ERROR
PID:1619 TID:47624
Function:reelect Line:1246
File:SequoiaDB/engine/cls/clsReplicateSet.cpp
Message:
failed to reelect:-13
2019-06-05-05.38.07.315833 Level:ERROR
PID:1619 TID:47624
Function:doit Line:522
File:SequoiaDB/engine/cls/clsCommand.cpp
Message:
failed to reelect:-13
2019-06-05-05.38.07.317062 Level:ERROR
PID:1619 TID:47624
Function:rtnRunCommand Line:1585
File:SequoiaDB/engine/rtn/rtn.cpp
Message:
run command[reelect] failed[rc=-13]
2019-06-05-05.38.07.318500 Level:ERROR
PID:1619 TID:47624
Function:_onQueryReqMsg Line:1629
File:SequoiaDB/engine/cls/clsShardSession.cpp
Message:
Run command[reelect] failed, rc: -13
2019-06-05-05.38.07.319501 Level:ERROR
PID:1619 TID:47624
Function:_onOPMsg Line:676
File:SequoiaDB/engine/cls/clsShardSession.cpp
Message:
Session[Type:Shard,NetID:1,R-TID:50811,R-IP:192.168.64.128,R-Port:11810] process OP[type:2004] failed[rc:-13]
【解决办法】
1.与事务操作无关,只是必须完成所有写操作。
2.reelect()切换主节点时,要求至少有一个备节点的 LSN 与旧主节点一致,并且完成所有写操作。
3.reelect()操作执行时,阻塞了新的请求,但是旧的请求在规定的时间内没能处理完时,可能报超时错误。根据日志,此次就是在等待写操作完成的时候超时了。
【参考资料】
LSN 说明:http://pmr.sequoiadb.com:8181/browse/SEQUOIADB-1511