concurrency - Wrong process getting killed on other node? -
i wrote simple program ("controller") run computation on separate node ("worker"). reason being if worker node runs out of memory, controller still works:
-module(controller). -compile(export_all). p(msg,args) -> io:format("~p " ++ msg, [time() | args]). progress_monitor(p,n) -> timer:sleep(5*60*1000), p("killing worker using strategy #~p~n", [n]), exit(p, took_to_long). start() -> start(1). start(strat) -> p = spawn('worker@localhost', worker, start, [strat,self(),60000000000]), p("starting worker using strategy #~p~n", [strat]), spawn(controller,progress_monitor,[p,strat]), monitor(process, p), receive {'down', _, _, p, info} -> p("worker using strategy #~p died. reason: ~p~n", [strat, info]); x -> p("got result: ~p~n", [x]) end, case strat of 4 -> p("out of strategies. giving up~n", []); _ -> timer:sleep(5000), % wait node come start(strat + 1) end.
to test it, deliberately wrote 3 factorial implementations use lots of memory , crash, , fourth implementation uses tail recursion avoid taking space:
-module(worker). -compile(export_all). start(1,p,n) -> p ! factorial1(n); start(2,p,n) -> p ! factorial2(n); start(3,p,n) -> p ! factorial3(n); start(4,p,n) -> p ! factorial4(n,1). factorial1(0) -> 1; factorial1(n) -> n*factorial1(n-1). factorial2(n) -> case n of 0 -> 1; _ -> n*factorial2(n-1) end. factorial3(n) -> lists:foldl(fun(x,y) -> x*y end, 1, lists:seq(1,n)). factorial4(0, a) -> a; factorial4(n, a) -> factorial4(n-1, a*n).
note tail recursive version, i'm calling 60000000000, take days on machine factorial4
. here output of running controller:
$ erl -sname 'controller@localhost' erlang r16b (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] eshell v5.10.1 (abort ^g) (controller@localhost)1> c(worker). {ok,worker} (controller@localhost)2> c(controller). {ok,controller} (controller@localhost)3> controller:start(). {23,24,28} starting worker using strategy #1 {23,25,13} worker using strategy #1 died. reason: noconnection {23,25,18} starting worker using strategy #2 {23,26,2} worker using strategy #2 died. reason: noconnection {23,26,7} starting worker using strategy #3 {23,26,40} worker using strategy #3 died. reason: noconnection {23,26,45} starting worker using strategy #4 {23,29,28} killing worker using strategy #1 {23,29,29} worker using strategy #4 died. reason: took_to_long {23,29,29} out of strategies. giving ok
it works, worker #4 killed (should have been close 23:31:45, not 23:29:29). looking deeper, worker #1 attempted killed, , no others. worker #4 should not have died, yet did. why? can see reason took_to_long
, , progress_monitor
#1 started @ 23:24:28, 5 minutes before 23:29:29. looks progress_monitor
#1 killed worker #4 instead of worker #1. why did kill wrong process?
here output of worker when ran controller:
$ while true; erl -sname 'worker@localhost'; done erlang r16b (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] eshell v5.10.1 (abort ^g) (worker@localhost)1> crash dump written to: erl_crash.dump eheap_alloc: cannot allocate 2733560184 bytes of memory (of type "heap"). aborted erlang r16b (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] eshell v5.10.1 (abort ^g) (worker@localhost)1> crash dump written to: erl_crash.dump eheap_alloc: cannot allocate 2733560184 bytes of memory (of type "heap"). aborted erlang r16b (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] eshell v5.10.1 (abort ^g) (worker@localhost)1> crash dump written to: erl_crash.dump eheap_alloc: cannot allocate 2733560184 bytes of memory (of type "old_heap"). aborted erlang r16b (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] eshell v5.10.1 (abort ^g) (worker@localhost)1>
there several issues, , experienced creation number wrap around.
since not cancel progress_monitor
process, it send exit signal after 5 minutes.
the computation long and/or vm slow, hence process 4 still running 5 minutes after progress monitor process 1 started.
the 4 worker nodes started sequentially same name workers@localhost
, , creation numbers of first , fourth node same.
creation numbers (creation field in references , pids) mechanism prevent pids , references created crashed node interpreted new node same name. expect in code when try kill worker 1 after node long gone, don't intend kill process in restarted node.
when node sends pid or reference, it encodes creation number. when receives pid or reference node, checks creation number in pid matches own creation number. creation number attributed epmd
following 1,2,3 sequence.
here, unfortunately, when 4th node gets exit message, creation number matches because sequence wrapped. since nodes spawn process , did exact same thing before (initialized erlang), pid of worker of node 4 matches pid of worker of node 1.
as result, controller kills worker 4 believing worker 1.
to avoid this, need more robust creation number if there can 4 workers within lifespan of pid or reference in controller.
Comments
Post a Comment