How the application recovery works when its started with -originalAppId

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

How the application recovery works when its started with -originalAppId

Vivek Bhide
Hi All,

I have implemented the LRUCache in one of the operators and this cache is not Transient. This LRUcache is a simple extension of LinkedHashMap with its removeEldestEntry method overriden. What we found is, the default Kryo serializer, that Apex uses for checkpointing, doesn't work properly for LinkedHashMap. I try using @Map and some other serializer classes and found that only serializer that works is default Java serializer. As a result, we had explicitly mentioned the serializer for this LRUCache to be JavaSerializer (using @Bind at variable declaration)

This resolved the issue of serializing but now the application recovery fails when this application is killed and restarted using -originalAppId through Apex cli. We get a error for LRUCache class while restoring the operator

com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
Serialization trace:
callerContextCache (com.tgt.dqs.datausageingest.operator.UsageCountCalculatorOperator)
        at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:47)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)

Caused by: java.lang.ClassNotFoundException: com.tgt.dqs.datausageingest.common.LRUCache
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
        at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:45)

I checked the apa content (using get-app-package-info from cli) and could see that classpath is set to lib/*.jar so i changed the appPackage.xml to include the application jar in lib but still its of no use

How the application recovery really works in Apex? and what could be done to resolve this issue

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
It will try to deserialize the old state (from prior applicaiton checkpoints) with your new jars from the apa that you are trying to launch. So if there are structural incompatibilities the deser will fail.

On Thu, Aug 10, 2017 at 11:41 AM, Vivek Bhide <[hidden email]> wrote:
Hi All,

I have implemented the LRUCache in one of the operators and this cache is
not Transient. This LRUcache is a simple extension of LinkedHashMap with its
removeEldestEntry method overriden. What we found is, the default Kryo
serializer, that Apex uses for checkpointing, doesn't work properly for
LinkedHashMap. I try using @Map and some other serializer classes and found
that only serializer that works is default Java serializer. As a result, we
had explicitly mentioned the serializer for this LRUCache to be
JavaSerializer (using @Bind at variable declaration)

This resolved the issue of serializing but now the application recovery
fails when this application is killed and restarted using -originalAppId
through Apex cli. We get a error for LRUCache class while restoring the
operator

com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
Serialization trace:
callerContextCache
(com.tgt.dqs.datausageingest.operator.UsageCountCalculatorOperator)
        at
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:47)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
        at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)

Caused by: java.lang.ClassNotFoundException:
com.tgt.dqs.datausageingest.common.LRUCache
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)
        at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
        at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
        at
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:45)

I checked the apa content (using get-app-package-info from cli) and could
see that classpath is set to lib/*.jar so i changed the appPackage.xml to
include the application jar in lib but still its of no use

How the application recovery really works in Apex? and what could be done to
resolve this issue

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Hi Pramod,

I get this error even when I try to resubmit the exact same apa Is there any other angle of this problem that i should look for?

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
Digging in a little bit looks like the java serializer is not using the same class loader that is specified for kryo and what the rest of kryo uses. Let me see what other possibilities are there.

On Thu, Aug 10, 2017 at 12:50 PM, Vivek Bhide <[hidden email]> wrote:
Hi Pramod,

I get this error even when I try to resubmit the exact same apa Is there any
other angle of this problem that i should look for?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1823.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
What is the error you are getting with the default serialization.

On Thu, Aug 10, 2017 at 1:24 PM, Pramod Immaneni <[hidden email]> wrote:
Digging in a little bit looks like the java serializer is not using the same class loader that is specified for kryo and what the rest of kryo uses. Let me see what other possibilities are there.

On Thu, Aug 10, 2017 at 12:50 PM, Vivek Bhide <[hidden email]> wrote:
Hi Pramod,

I get this error even when I try to resubmit the exact same apa Is there any
other angle of this problem that i should look for?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1823.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Hi Pramod,

As I told we have LRUCache (LinkedHashMap of <String,String>) which needs to be serialized and it is initialized in operator constructor. What we found that, when operator is serialized for checkpointing the content of the this LRUcache is not getting serialized and instead its just an empty LinkedHashMap is serialized. We verified this by implementing the CheckpointNotificationListener in operator and logging the state of this cache. Note that, kryo works well with HashMap but not with LinkedHashMap which is weird

So after looking for some other alternatives for proper serialization, we found that default JavaSerializer does the job and hence Cache has now been set to be serialized with JavaSerializer as below

@Bind(value=JavaSerializer.class)
public LRUCache<String, String> callerContextCache;

but now while verifying the operator recovery with correct state by killing the existing apex instance and restarting it with -originalAppId, we found that deserialization fails

As mentioned in my original post, I tried using @Map from kryo and also changing the apa packaging to have application jar in 'lib' (since the get-app-package-info showed the classpath to be lib/*.jar and with original packaging assembly the application jar gets under 'app') but its still of no use

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
I have already pasted the stack trace in my original post. Also can you please confirm what is the classpath kryo is using v/s default Java serializer and where exactly is set for kryo?

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
It doesn't look like a classpath issue but rather an issue with class loading. Since the app is a self-contained unit and multiple apps can be launched from a single apex cli session, apex uses a separate classloader for each app and provides this classloader to kryo to instantiate the classes during deserialization from prior state, kryo is using this classloader for the most part but when part of the deserialization is being handed off to java, java is not using the same classloader and hence is failing to find the class. This is not the best method but you can create a custom java serializer by copying the kryo's JavaSerializer class code but in the read method where it is creating an ObjectInputStream (line 42), create a class that extends ObjectInputStream and overrides resolveClass method to return a class using the current thread's classloader which can be obtained by making a call to Thread.currentThread().getContextClassLoader().

Thanks

On Thu, Aug 10, 2017 at 2:47 PM, Vivek Bhide <[hidden email]> wrote:
I have already pasted the stack trace in my original post. Also can you
please confirm what is the classpath kryo is using v/s default Java
serializer and where exactly is set for kryo?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1827.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Thanks Pramod.. This seems to have done trick.. I will check again when I have some data to process to see if that goes well with it. I am quite confident that it will

Just curious, Is this the best way to handle this issue or if there is any other elegant way it can be addressed?

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
I would dig deeper into why serde of the linked hashmap is failing. There are additional logging you can enable in kryo to get more insight. You can even try a standalone kryo test to see if it is a problem with the linkedhashmap itself or because of some other object that was added to it. You could try a newer version of kryo to check if the serde works in a newer version because some big was fixed. Once you get more insight on the cause then we would be in a better position to determine the best approach.

Thanks
On Thu, Aug 10, 2017 at 5:04 PM Vivek Bhide <[hidden email]> wrote:
Thanks Pramod.. This seems to have done trick.. I will check again when I
have some data to process to see if that goes well with it. I am quite
confident that it will

Just curious, Is this the best way to handle this issue or if there is any
other elegant way it can be addressed?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1829.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Thomas Weise-2
There are couple bugs that were recently identified that look related to this:


Perhaps the fix for first item is what you need?

Thomas


On Thu, Aug 10, 2017 at 8:41 PM, Pramod Immaneni <[hidden email]> wrote:
I would dig deeper into why serde of the linked hashmap is failing. There are additional logging you can enable in kryo to get more insight. You can even try a standalone kryo test to see if it is a problem with the linkedhashmap itself or because of some other object that was added to it. You could try a newer version of kryo to check if the serde works in a newer version because some big was fixed. Once you get more insight on the cause then we would be in a better position to determine the best approach.

Thanks

On Thu, Aug 10, 2017 at 5:04 PM Vivek Bhide <[hidden email]> wrote:
Thanks Pramod.. This seems to have done trick.. I will check again when I
have some data to process to see if that goes well with it. I am quite
confident that it will

Just curious, Is this the best way to handle this issue or if there is any
other elegant way it can be addressed?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1829.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
I believe those relate to different problems. This is a scenario where part of deserializarion is being outsourced to an external deserializer that is not using the correct class loader. The suggested fix to the behavior of the external serde seemed to have worked though I plan to follow up with kryo on this issue.

On Thu, Aug 10, 2017 at 8:17 PM Thomas Weise <[hidden email]> wrote:
There are couple bugs that were recently identified that look related to this:


Perhaps the fix for first item is what you need?

Thomas


On Thu, Aug 10, 2017 at 8:41 PM, Pramod Immaneni <[hidden email]> wrote:
I would dig deeper into why serde of the linked hashmap is failing. There are additional logging you can enable in kryo to get more insight. You can even try a standalone kryo test to see if it is a problem with the linkedhashmap itself or because of some other object that was added to it. You could try a newer version of kryo to check if the serde works in a newer version because some big was fixed. Once you get more insight on the cause then we would be in a better position to determine the best approach.

Thanks

On Thu, Aug 10, 2017 at 5:04 PM Vivek Bhide <[hidden email]> wrote:
Thanks Pramod.. This seems to have done trick.. I will check again when I
have some data to process to see if that goes well with it. I am quite
confident that it will

Just curious, Is this the best way to handle this issue or if there is any
other elegant way it can be addressed?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1829.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
Vivek,

Also a slightly more portable modification to what I suggested earlier is to use kryo.getClassLoader() instead of Thread.currentThread().getContextClassLoader() in the JavaSerializer.

Thanks

On Thu, Aug 10, 2017 at 8:25 PM, Pramod Immaneni <[hidden email]> wrote:
I believe those relate to different problems. This is a scenario where part of deserializarion is being outsourced to an external deserializer that is not using the correct class loader. The suggested fix to the behavior of the external serde seemed to have worked though I plan to follow up with kryo on this issue.

On Thu, Aug 10, 2017 at 8:17 PM Thomas Weise <[hidden email]> wrote:
There are couple bugs that were recently identified that look related to this:


Perhaps the fix for first item is what you need?

Thomas


On Thu, Aug 10, 2017 at 8:41 PM, Pramod Immaneni <[hidden email]> wrote:
I would dig deeper into why serde of the linked hashmap is failing. There are additional logging you can enable in kryo to get more insight. You can even try a standalone kryo test to see if it is a problem with the linkedhashmap itself or because of some other object that was added to it. You could try a newer version of kryo to check if the serde works in a newer version because some big was fixed. Once you get more insight on the cause then we would be in a better position to determine the best approach.

Thanks

On Thu, Aug 10, 2017 at 5:04 PM Vivek Bhide <[hidden email]> wrote:
Thanks Pramod.. This seems to have done trick.. I will check again when I
have some data to process to see if that goes well with it. I am quite
confident that it will

Just curious, Is this the best way to handle this issue or if there is any
other elegant way it can be addressed?

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1829.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Thank You Pramod and Thomas for all your inputs.

Hi Pramod,
Jira https://issues.apache.org/jira/browse/APEXMALHAR-2526 that Thomas referred seem to be the one inline with what you suggested as a possible solution. I see there is new class KryoJavaSerializer.java (new in malhar and not present with 3.7.0 version that I am using) which is doing the work, though its not related to this particular issue

Regarding my statement of Kryo not working with LinkedHashMap, to put it precisely, it doesn't work for a class that extends LinkedHashMap. In my case its LRUCache class. It does work with standard LinkedHashMap and I could verified this with past few version of kryo. Below is the class i tested with

public class KryoSerDeTest {
  public static void main(String[] args) throws FileNotFoundException {
    TestClass clazz = new TestClass();
    clazz.getCache().put("ABC", "ABCDE");
    clazz.getCache().put("GHI", "GHIJK");

    Kryo kryo = new Kryo();
    Output output = new Output(new FileOutputStream("file.bin"));
    kryo.writeObject(output, clazz);
    output.close();
    Input input = new Input(new FileInputStream("file.bin"));
    TestClass clazz1 = kryo.readObject(input, TestClass.class);
    input.close();

    System.out.println(clazz1.getCache().get("ABC"));
    System.out.println(clazz1.getCache().get("GHI"));
  }
}

class TestClass {
 
  LRUCache<String, String> cache;

  public TestClass() {
    cache = new LRUCache<String, String>(10, false);
  }

  public LRUCache<String, String> getCache() {
    return cache;
  }

  public void setCache(LRUCache<String, String> cache) {
    this.cache = cache;
  }
}

class LRUCache<K, V> extends LinkedHashMap<K, V> {

  private static final long serialVersionUID = 1L;
  public int capacity; // Maximum number of items in the cache.

  public int getCapacity() {
    return capacity;
  }

  public void setCapacity(int capacity) {
    this.capacity = capacity;
  }

  public LRUCache() {
    super();
  }

  public LRUCache(int capacity, boolean accessOrder) {
    super(capacity + 1, 1.0f, accessOrder); // Pass 'true' for accessOrder.
    setCapacity(capacity);
  }

  @Override
  public boolean removeEldestEntry(Map.Entry<K, V> entry) {
    return (size() > getCapacity());
  }
}


In the above example, if you replace all references of LRUCache from TestClass with LinkedHashMap then everything works but not with LRUCache

I will update my workaround with your suggestion

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
Can you share the code for your class that extends the linked hash map.

Thanks

On Thu, Aug 10, 2017 at 11:05 PM Vivek Bhide <[hidden email]> wrote:
Thank You Pramod and Thomas for all your inputs.

Hi Pramod,
Jira https://issues.apache.org/jira/browse/APEXMALHAR-2526 that Thomas
referred seem to be the one inline with what you suggested as a possible
solution. I see there is new class KryoJavaSerializer.java (new in malhar
and not present with 3.7.0 version that I am using) which is doing the work,
though its not related to this particular issue

Regarding my statement of Kryo not working with LinkedHashMap, to put it
precisely,/ it doesn't work for a class that extends LinkedHashMap/. In my
case its LRUCache class. It does work with standard LinkedHashMap and I
could verified this with past few version of kryo. Below is the class i
tested with

public class KryoSerDeTest {
  public static void main(String[] args) throws FileNotFoundException {
    TestClass clazz = new TestClass();
    clazz.getCache().put("ABC", "ABCDE");
    clazz.getCache().put("GHI", "GHIJK");

    Kryo kryo = new Kryo();
    Output output = new Output(new FileOutputStream("file.bin"));
    kryo.writeObject(output, clazz);
    output.close();
    Input input = new Input(new FileInputStream("file.bin"));
    TestClass clazz1 = kryo.readObject(input, TestClass.class);
    input.close();

    System.out.println(clazz1.getCache().get("ABC"));
    System.out.println(clazz1.getCache().get("GHI"));
  }
}

class TestClass {

  LRUCache<String, String> cache;

  public TestClass() {
    cache = new LRUCache<String, String>(10, false);
  }

  public LRUCache<String, String> getCache() {
    return cache;
  }

  public void setCache(LRUCache<String, String> cache) {
    this.cache = cache;
  }
}

class LRUCache<K, V> extends LinkedHashMap<K, V> {

  private static final long serialVersionUID = 1L;
  public int capacity; // Maximum number of items in the cache.

  public int getCapacity() {
    return capacity;
  }

  public void setCapacity(int capacity) {
    this.capacity = capacity;
  }

  public LRUCache() {
    super();
  }

  public LRUCache(int capacity, boolean accessOrder) {
    super(capacity + 1, 1.0f, accessOrder); // Pass 'true' for accessOrder.
    setCapacity(capacity);
  }

  @Override
  public boolean removeEldestEntry(Map.Entry<K, V> entry) {
    return (size() > getCapacity());
  }
}


In the above example, if you replace all references of LRUCache from
TestClass with LinkedHashMap then everything works but not with LRUCache

I will update my workaround with your suggestion

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1834.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Please refer the LRUCache class from the code base I pasted above. Its exactly what I m using

Regards
Vivek
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
Provide an empty constructor as well.

On Fri, Aug 11, 2017 at 10:39 AM, Vivek Bhide <[hidden email]> wrote:
Please refer the LRUCache class from the code base I pasted above. Its
exactly what I m using

Regards
Vivek



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1836.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
It is present in class.. just after setCapacity method
Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Pramod Immaneni
My bad let me see..

On Fri, Aug 11, 2017 at 11:13 AM, Vivek Bhide <[hidden email]> wrote:
It is present in class.. just after setCapacity method



--
View this message in context: http://apache-apex-users-list.78494.x6.nabble.com/How-the-application-recovery-works-when-its-started-with-originalAppId-tp1821p1838.html
Sent from the Apache Apex Users list mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How the application recovery works when its started with -originalAppId

Vivek Bhide
Hi Pramod and Thomas

Below are my findings till now on this issue

1. Fix suggested by Pramod and fix made as apart of https://issues.apache.org/jira/browse/APEXMALHAR-2526 are doing the same thing
2. In the comments for https://issues.apache.org/jira/browse/APEXMALHAR-2526 I found that, the new class KryoJavaSerializer.java is created to fix the problem but kryo has already fixed this issue at their end too (in Dec 2016)
3. When checked the kryo version from apex-core, I found that it is expecting version 2.24 which is quite old and latest kryo version (4.0.1) has the fix included
4. So at the end only changes I needed were to update kryo version to latest and continue using a JavaSerializer. On implementing these, Issue with application recovery has resolved

I see that story https://issues.apache.org/jira/browse/APEXCORE-768 is already created to update the kryo version and remove class KryoJavaSerializer.java

Again this still doesn't answer the question that why kryo is not serializing the custom implementation of LinkedHashMap at first place with its default serialization

Regards
Vivek
12