Skip to content

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

@bashoshaun

Description

@bashoshaun

A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.

On startup, they see this kind of output in console.log:

2016-12-11 01:14:47.674 [error] <0.1611.0>@riak_kv_vnode:init:512 Failed to start riak_kv_eleveldb_backend backend for index 1381575766539369164864420818427111292018498732032 error: {db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}
2016-12-11 01:14:47.678 [notice] <0.1611.0>@riak:stop:43 "backend module failed to start."
2016-12-11 01:14:47.678 [error] <0.1611.0> gen_fsm <0.1611.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184
2016-12-11 01:14:47.678 [error] <0.1611.0> CRASH REPORT Process <0.1611.0> with 1 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in gen_fsm:terminate/7 line 600
2016-12-11 01:14:47.679 [error] <0.163.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1611.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in context child_terminated
2016-12-11 01:14:47.679 [error] <0.160.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1585.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1184}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,908}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.1611.0>,wait_for_init,...]}} in context child_terminated
2016-12-11 01:14:47.683 [info] <0.500.0>@riak_kv_app:prep_stop:232 Stopping application riak_kv - marked service down.

I've attached the full console.log file for context.

The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these
console.log.txt

files. This is an example of what they looked like when the node was up and running normally:

lsof:

...
beam.smp  177432           riak  mem       REG              253,8      32526336 1744834583 /data/riak/leveldb/1267395951122892374379757940871151681107879002112/000369.log
beam.smp  177432           riak  mem       REG              253,8      27365376 1879049262 /data/riak/leveldb/68507889249886074290797726533575766546371837952/000366.log
beam.smp  177432           riak  mem       REG              253,8      27631616  805307404 /data/riak/leveldb/125597796958124469533129165311555572001681702912/000367.log
beam.smp  177432           riak  mem       REG              253,8      32116736 1476398089 /data/riak/leveldb/1438665674247607560106752257205091097473808596992/000382.log
beam.smp  177432           riak  mem       REG              253,8      32694272 1476398103 /data/riak/leveldb/1039036320289938793410432185759232459286639542272/000382.log
...

After the stop, there are no leveldb entries in the output of lsof. Similarly, there are no beam processes running.

So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.

Customer is running RHEL7.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions