进来使用巨杉数据库集群(3.0),发现时间一长,各个节点的状态就无法获取,ssh到服务器,偶尔会提示打开文件数过多,怀疑是limit的问题,但是已经按照网站推荐的Linux配置改了,系统是CentOS6.4的。通过:lsof | awk 'NR>1 {++S[$2]} END { for(a in S) {print a,"\t",S[a]}}'|sort -n -k 2|tail -n 1,命令查看最大系统进程打开文件数发现sdbcm已经打开59999个,到巨杉进程limits上限了
sdbadmin 3880 1 0 Apr04 ? 00:46:48 sdbcm(11790)
通过:cat /proc/3880/limits,命令查看配置的是
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 0 bytes
Max resident set unlimited unlimited bytes
Max processes 514852 514852 processes
Max open files 60000 60000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 514852 514852 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
通过:lsof -p 3880,命令查看进程,全部是:can't identify protocol
sdbcm 3880 sdbadmin *274u sock 0,6 0t0 5163452 can't identify protocol
sdbcm 3880 sdbadmin *275u sock 0,6 0t0 5163500 can't identify protocol
sdbcm 3880 sdbadmin *276u sock 0,6 0t0 5163576 can't identify protocol
sdbcm 3880 sdbadmin *277u sock 0,6 0t0 5163790 can't identify protocol
sdbcm 3880 sdbadmin *278u sock 0,6 0t0 5163824 can't identify protocol
sdbcm 3880 sdbadmin *279u sock 0,6 0t0 5163924 can't identify protocol
sdbcm 3880 sdbadmin *280u sock 0,6 0t0 5164106 can't identify protocol
sdbcm 3880 sdbadmin *281u sock 0,6 0t0 5164324 can't identify protocol
sdbcm 3880 sdbadmin *282u sock 0,6 0t0 5164487 can't identify protocol
sdbcm 3880 sdbadmin *283u sock 0,6 0t0 5164516 can't identify protocol
sdbcm 3880 sdbadmin *284u sock 0,6 0t0 5164614 can't identify protocol
sdbcm 3880 sdbadmin *285u sock 0,6 0t0 5164675 can't identify protocol
继续用:strace -p 3880,跟踪进程,发现每隔几秒钟就会出现
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, {0, 866362860}) = ? ERESTART_RESTARTBLOCK (To be restarted)
restart_syscall() = ? ERESTART_RESTARTBLOCK (To be restarted)
restart_syscall() = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, {0, 930218339}) = ? ERESTART_RESTARTBLOCK (To be restarted)
restart_syscall() = ? ERESTART_RESTARTBLOCK (To be restarted)
restart_syscall() = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
nanosleep({1, 0}, 0x7ffff5ade720) = 0
查到这就不知道应该肿么办了。。。有哪位大神遇到过类似的问题,求解救。。。
另,在/opt/sequoiadb/conf/log/sdbcm.log中找到如下错误日志
2019-04-08-12.05.13.989948 Level:EVENT
PID:3880 TID:49479
Function:pmdEDUEntryPoint Line:1980
File:SequoiaDB/engine/pmd/pmdEDUMgr.cpp
Message:
Start thread[49479] for EDU[ID:40392, type:OMAAgent, Name:]
2019-04-08-12.05.14.197672 Level:ERROR
PID:3880 TID:49479
Function:ossGetDiskInfo Line:1162
File:SequoiaDB/engine/oss/ossUtil.cpp
Message:
Failed to statvfs, errno: 13, rc = -10
2019-04-08-12.05.14.209070 Level:EVENT
PID:3880 TID:3890
Function:dispatchMsg Line:938
File:SequoiaDB/engine/pmd/pmdAsyncSession.cpp
Message:
Session[Type:OMAgent,NetID:40381,R-TID:11393,R-IP:10.1.1.66,R-Port:40602] recieved disconnect message
2019-04-08-12.05.15.209452 Level:EVENT
PID:3880 TID:49479
Function:pmdEDUEntryPoint Line:2089
File:SequoiaDB/engine/pmd/pmdEDUMgr.cpp
Message:
Terminating thread[49479] for EDU[ID:40392, Type:OMAAgent, Name: Type:OMAgent,NetID:40381,R-TID:11393,R-IP:10.1.1.66,R-Port:40602]
以上信息不断的循环中。。。
刚查到在2.8.6版本更新说明中提到过
其它优化:
修正System.getDiskInfo在Ubuntu17上执行报-10的错误
ossGetDiskInfo存在句柄泄漏
可是我这个是3.0版本,难道还存在这样的bug么?