How to address unclean undeploy exception

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to address unclean undeploy exception

Vivek Bhide
Hi,

In one of the operators, I have some big LRUcache objects (which are not transient and hence checkpointed) and when that operator restarts for any reason, I see the 'unclean undeploy' exception in container logs.

Unfortunately I don't have the stack trace with me but is there any configuration that can be set to make sure that container undeploy waits till the checkpointing is complete?

Also I am a bit curious on how the container undeploy and redeploy is handled (triggered when any of the upstream operator restarts). I see that the undeploy is often interrupted if its taking a bit more time. Is there any documentation which I can refer to to understand this in a bit detail?

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How to address unclean undeploy exception

Sandesh Hegde
Managed state operator is preferable to maintain large LRUCache.
On Wed, Jul 5, 2017 at 7:57 PM Vivek Bhide <[hidden email]> wrote:
Hi,

In one of the operators, I have some big LRUcache objects (which are not
transient and hence checkpointed) and when that operator restarts for any
reason, I see the 'unclean undeploy' exception in container logs.

Unfortunately I don't have the stack trace with me but is there any
configuration that can be set to make sure that container undeploy waits
till the checkpointing is complete?

Also I am a bit curious on how the container undeploy and redeploy is
handled (triggered when any of the upstream operator restarts). I see that
the undeploy is often interrupted if its taking a bit more time. Is there
any documentation which I can refer to to understand this in a bit detail?

Regards
Vivek




--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-to-address-unclean-undeploy-exception-tp1776.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: How to address unclean undeploy exception

Vivek Bhide
Below is the stacktrace. Also Can you please point me to some sample examples for operator which are using managed state. Or general guidelines on how to use it

Regards
Vivek

2017-07-06 18:02:05,006 ERROR engine.StreamingContainer (StreamingContainer.java:run(1456)) - Operator set [OperatorDeployInfo[id=7,name=usageCountCalculator,type=GENERIC,checkpoint={ffffffffffffffff, 0, 0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=inputPort,streamId=sendToAccessCounter,sourceNodeId=6,sourcePortName=accessCountPort,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=outputPort,streamId=sinkToHdfs,bufferServer=brdn2204.target.com]]]] stopped running due to an exception.
com.datatorrent.netlet.NetletThrowable$NetletRuntimeException: java.lang.UnsupportedOperationException: Client does not own the socket any longer!
        at com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:343)
        at com.datatorrent.netlet.AbstractClient$1.offer(AbstractClient.java:333)
        at com.datatorrent.netlet.AbstractClient.send(AbstractClient.java:279)
        at com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:236)
        at com.datatorrent.netlet.AbstractLengthPrependerClient.write(AbstractLengthPrependerClient.java:190)
        at com.datatorrent.stram.stream.BufferServerPublisher.put(BufferServerPublisher.java:164)
        at com.datatorrent.stram.engine.GenericNode.run(GenericNode.java:469)
        at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1428)
Caused by: java.lang.UnsupportedOperationException: Client does not own the socket any longer!
        ... 8 more
2017-07-06 18:02:05,020 WARN  ipc.Client (Client.java:call(1460)) - interrupted waiting to send rpc request to server
java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1092)
        at org.apache.hadoop.ipc.Client.call(Client.java:1455)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:241)
        at com.sun.proxy.$Proxy12.reportError(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:157)
        at com.sun.proxy.$Proxy12.reportError(Unknown Source)
        at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1459)
2017-07-06 18:02:05,021 WARN  stram.RecoverableRpcProxy (RecoverableRpcProxy.java:invoke(168)) - RPC failure, will retry after 10000 ms (remaining 29998 ms)
java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.ipc.Client.call(Client.java:1461)
        at org.apache.hadoop.ipc.Client.call(Client.java:1396)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:241)
        at com.sun.proxy.$Proxy12.reportError(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.datatorrent.stram.RecoverableRpcProxy.invoke(RecoverableRpcProxy.java:157)
        at com.sun.proxy.$Proxy12.reportError(Unknown Source)
        at com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1459)
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
        at java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1092)
        at org.apache.hadoop.ipc.Client.call(Client.java:1455)
        ... 10 more
2017-07-06 18:02:05,022 WARN  engine.StreamingContainer (StreamingContainer.java:teardownNode(1372)) - node 7/usageCountCalculator took longer to exit, resulting in unclean undeploy!
2017-07-06 18:02:07,590 INFO  server.Server (Server.java:onMessage(599)) - Received subscriber request: SubscribeRequestTuple{version=1.0, identifier=tcp://brdn2204.target.com:40013/7.outputPort.1, windowId=595ec0d8000000b3, type=sinkToHdfs/8.input, upstreamIdentifier=7.outputPort.1, mask=0, partitions=null, bufferSize=1024}
2017-07-06 18:02:07,606 INFO  engine.StreamingContainer (StreamingContainer.java:processHeartbeatResponse(825)) - Deploy request: [OperatorDeployInfo[id=7,name=usageCountCalculator,type=GENERIC,checkpoint={595ec0d8000000b3, 0, 0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=inputPort,streamId=sendToAccessCounter,sourceNodeId=6,sourcePortName=accessCountPort,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=outputPort,streamId=sinkToHdfs,bufferServer=brdn2204.target.com]]]]
2017-07-06 18:02:08,058 INFO  server.Server (Server.java:onMessage(555)) - Received publisher request: PublishRequestTuple{version=1.0, identifier=7.outputPort.1, windowId=595ec0d8000000