How to specify more than one fields as a dedup keyExpression

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

How to specify more than one fields as a dedup keyExpression

Vivek Bhide
Hi,

I want to specify more than 1 attributes of a tuple in keyExpression for
BoundedDedupOperator. I tried putting it in multiple ways but it doesn't
work

Does BoundedDedup or TimeDedup supports multiple fields while deduping? if
yes then how to specify them in properties.xml

Regards
Vivek



--
Sent from: http://apache-apex-users-list.78494.x6.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Munagala Ramanath-2
Could you provide some details on what types your multiple keys are and what expression variants you tried and what the result was ?

The base class of BoundedDedupOperator is AbstractDeduper; within that class you'll see an method getKey(); you should be able to override that to retrieve the desired fields, combine them in some domain specific way, and return the result as a Slice.

Ram

On Monday, October 23, 2017, 4:30:38 PM PDT, Vivek Bhide <[hidden email]> wrote:


Hi,

I want to specify more than 1 attributes of a tuple in keyExpression for
BoundedDedupOperator. I tried putting it in multiple ways but it doesn't
work

Does BoundedDedup or TimeDedup supports multiple fields while deduping? if
yes then how to specify them in properties.xml

Regards
Vivek



--
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Vivek Bhide
Thanks Ram for your suggestions

Field types that I am trying are the basic primitive types. In fact, I was
just playing around with the dedup examples that is available in malhar git.
I just added one more field id1 with getter and setters to 'TestEvent' class
from testcase and want to try dedup on the combination of both fields

Operator fails right during activate() method while getting the keyGetter
for which is then used in getKey()

Below are few expressions i tried
Default value <value>id</id>
combinations tried -
<value>id,id1</id>
<value>getId(),getId1()</id>
<value>{$}.getId() &amp;&amp; {$}.getId1()</id>
<value>"getId()","getId1()"</id>
<value>{$}.getId(),{$}.getId1()</value>
<value>{{$}.getId(),{$}.getId1()}</value>

Below is the stacktrace of the exception I got most of the times:

2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] WARN
util.LoggerUtil shouldFetchLogFileInformation - Log information is
unavailable. To enable log information log4j/logging should be configured
with single FileAppender that has immediateFlush set to true and log level
set to ERROR or greater.
2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] ERROR
engine.StreamingContainer run - Abandoning deployment of operator
OperatorDeployInfo[id=2,name=Deduper,type=GENERIC,checkpoint={ffffffffffffffff,
0,
0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=input,streamId=Generator
to
Dedup,sourceNodeId=1,sourcePortName=output,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=unique,streamId=Dedup
Unique to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=duplicate,streamId=Dedup
Duplicate to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=expired,streamId=Dedup Expired
to Console,bufferServer=localhost]]] due to setup failure.
java.lang.RuntimeException: org.codehaus.commons.compiler.CompileException:
Line 1, Column 101: ')' expected instead of ','
        at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:778)
        at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:746)
        at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:603)
        at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:235)
        at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:225)
        at
org.apache.apex.malhar.lib.dedup.BoundedDedupOperator.activate(BoundedDedupOperator.java:121)
        at com.datatorrent.stram.engine.Node.activate(Node.java:644)
        at com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:212)
        at
com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1364)
        at
com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:129)
        at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1413)



--
Sent from: http://apache-apex-users-list.78494.x6.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Munagala Ramanath-2
It needs to be an expression that combines both (or all) values: try "id + id1"

Ram


On Monday, October 23, 2017, 6:04:14 PM PDT, Vivek Bhide <[hidden email]> wrote:


Thanks Ram for your suggestions

Field types that I am trying are the basic primitive types. In fact, I was
just playing around with the dedup examples that is available in malhar git.
I just added one more field id1 with getter and setters to 'TestEvent' class
from testcase and want to try dedup on the combination of both fields

Operator fails right during activate() method while getting the keyGetter
for which is then used in getKey()

Below are few expressions i tried
Default value <value>id</id>
combinations tried -
<value>id,id1</id>
<value>getId(),getId1()</id>
<value>{$}.getId() &amp;&amp; {$}.getId1()</id>
<value>"getId()","getId1()"</id>
<value>{$}.getId(),{$}.getId1()</value>
<value>{{$}.getId(),{$}.getId1()}</value>

Below is the stacktrace of the exception I got most of the times:

2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] WARN
util.LoggerUtil shouldFetchLogFileInformation - Log information is
unavailable. To enable log information log4j/logging should be configured
with single FileAppender that has immediateFlush set to true and log level
set to ERROR or greater.
2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] ERROR
engine.StreamingContainer run - Abandoning deployment of operator
OperatorDeployInfo[id=2,name=Deduper,type=GENERIC,checkpoint={ffffffffffffffff,
0,
0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=input,streamId=Generator
to
Dedup,sourceNodeId=1,sourcePortName=output,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=unique,streamId=Dedup
Unique to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=duplicate,streamId=Dedup
Duplicate to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=expired,streamId=Dedup Expired
to Console,bufferServer=localhost]]] due to setup failure.
java.lang.RuntimeException: org.codehaus.commons.compiler.CompileException:
Line 1, Column 101: ')' expected instead of ','
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:778)
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:746)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:603)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:235)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:225)
    at
org.apache.apex.malhar.lib.dedup.BoundedDedupOperator.activate(BoundedDedupOperator.java:121)
    at com.datatorrent.stram.engine.Node.activate(Node.java:644)
    at com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:212)
    at
com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1364)
    at
com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:129)
    at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1413)
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Vlad Rozov-2
I don't think that Apex expression evaluator is that smart :). Try "{$}.getId() + {$}.getId1()" or provide a getter that returns pair object.

Thank you,

Vlad

On 10/23/17 18:13, Munagala Ramanath wrote:
It needs to be an expression that combines both (or all) values: try "id + id1"

Ram


On Monday, October 23, 2017, 6:04:14 PM PDT, Vivek Bhide [hidden email] wrote:


Thanks Ram for your suggestions

Field types that I am trying are the basic primitive types. In fact, I was
just playing around with the dedup examples that is available in malhar git.
I just added one more field id1 with getter and setters to 'TestEvent' class
from testcase and want to try dedup on the combination of both fields

Operator fails right during activate() method while getting the keyGetter
for which is then used in getKey()

Below are few expressions i tried
Default value <value>id</id>
combinations tried -
<value>id,id1</id>
<value>getId(),getId1()</id>
<value>{$}.getId() &amp;&amp; {$}.getId1()</id>
<value>"getId()","getId1()"</id>
<value>{$}.getId(),{$}.getId1()</value>
<value>{{$}.getId(),{$}.getId1()}</value>

Below is the stacktrace of the exception I got most of the times:

2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] WARN
util.LoggerUtil shouldFetchLogFileInformation - Log information is
unavailable. To enable log information log4j/logging should be configured
with single FileAppender that has immediateFlush set to true and log level
set to ERROR or greater.
2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] ERROR
engine.StreamingContainer run - Abandoning deployment of operator
OperatorDeployInfo[id=2,name=Deduper,type=GENERIC,checkpoint={ffffffffffffffff,
0,
0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=input,streamId=Generator
to
Dedup,sourceNodeId=1,sourcePortName=output,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=unique,streamId=Dedup
Unique to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=duplicate,streamId=Dedup
Duplicate to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=expired,streamId=Dedup Expired
to Console,bufferServer=localhost]]] due to setup failure.
java.lang.RuntimeException: org.codehaus.commons.compiler.CompileException:
Line 1, Column 101: ')' expected instead of ','
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:778)
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:746)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:603)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:235)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:225)
    at
org.apache.apex.malhar.lib.dedup.BoundedDedupOperator.activate(BoundedDedupOperator.java:121)
    at com.datatorrent.stram.engine.Node.activate(Node.java:644)
    at com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:212)
    at
com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1364)
    at
com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:129)
    at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1413)

Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Chinmay Kolhatkar
{$.id} + {$.id1} should work.

- Chinmay.

On 24 Oct 2017 6:51 am, "Vlad Rozov" <[hidden email]> wrote:
I don't think that Apex expression evaluator is that smart :). Try "{$}.getId() + {$}.getId1()" or provide a getter that returns pair object.

Thank you,

Vlad

On 10/23/17 18:13, Munagala Ramanath wrote:
It needs to be an expression that combines both (or all) values: try "id + id1"

Ram


On Monday, October 23, 2017, 6:04:14 PM PDT, Vivek Bhide [hidden email] wrote:


Thanks Ram for your suggestions

Field types that I am trying are the basic primitive types. In fact, I was
just playing around with the dedup examples that is available in malhar git.
I just added one more field id1 with getter and setters to 'TestEvent' class
from testcase and want to try dedup on the combination of both fields

Operator fails right during activate() method while getting the keyGetter
for which is then used in getKey()

Below are few expressions i tried
Default value <value>id</id>
combinations tried -
<value>id,id1</id>
<value>getId(),getId1()</id>
<value>{$}.getId() &amp;&amp; {$}.getId1()</id>
<value>"getId()","getId1()"</id>
<value>{$}.getId(),{$}.getId1()</value>
<value>{{$}.getId(),{$}.getId1()}</value>

Below is the stacktrace of the exception I got most of the times:

2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] WARN
util.LoggerUtil shouldFetchLogFileInformation - Log information is
unavailable. To enable log information log4j/logging should be configured
with single FileAppender that has immediateFlush set to true and log level
set to ERROR or greater.
2017-10-23 16:48:52,775 [2/Deduper:BoundedDedupOperator] ERROR
engine.StreamingContainer run - Abandoning deployment of operator
OperatorDeployInfo[id=2,name=Deduper,type=GENERIC,checkpoint={ffffffffffffffff,
0,
0},inputs=[OperatorDeployInfo.InputDeployInfo[portName=input,streamId=Generator
to
Dedup,sourceNodeId=1,sourcePortName=output,locality=<null>,partitionMask=0,partitionKeys=<null>]],outputs=[OperatorDeployInfo.OutputDeployInfo[portName=unique,streamId=Dedup
Unique to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=duplicate,streamId=Dedup
Duplicate to Console,bufferServer=localhost],
OperatorDeployInfo.OutputDeployInfo[portName=expired,streamId=Dedup Expired
to Console,bufferServer=localhost]]] due to setup failure.
java.lang.RuntimeException: org.codehaus.commons.compiler.CompileException:
Line 1, Column 101: ')' expected instead of ','
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:778)
    at com.datatorrent.lib.util.PojoUtils.compileExpression(PojoUtils.java:746)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:603)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:235)
    at com.datatorrent.lib.util.PojoUtils.createGetter(PojoUtils.java:225)
    at
org.apache.apex.malhar.lib.dedup.BoundedDedupOperator.activate(BoundedDedupOperator.java:121)
    at com.datatorrent.stram.engine.Node.activate(Node.java:644)
    at com.datatorrent.stram.engine.GenericNode.activate(GenericNode.java:212)
    at
com.datatorrent.stram.engine.StreamingContainer.setupNode(StreamingContainer.java:1364)
    at
com.datatorrent.stram.engine.StreamingContainer.access$100(StreamingContainer.java:129)
    at
com.datatorrent.stram.engine.StreamingContainer$2.run(StreamingContainer.java:1413)


Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Vivek Bhide
Thank you everyone for reply. Solution that chinmay suggested is working but
then I see one more discrepancy.

After adding more that 1 fields as a dedup key, my expectation was to have
the dedup decision made on combination of these 2 keys. I did run the test
case multiple times with BoundedDedupOperator and what i found is, events
are marked as Duplicate but if search for the corresponding Unique entry in
sysout then that entry is no where to be found. Its not happening for all
but for most of entires marked as Duplicate

Is my expectation of dedup behavior is correct and is this a right way to
validate if its working as expected?
Dedup_test_case_output.txt
<http://apache-apex-users-list.78494.x6.nabble.com/file/t127/Dedup_test_case_output.txt>  
Sample entries :

Present as Unique and Duplicate :
Duplicate: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:44 PDT 2017]
Unique: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:18 PDT 2017]

Only present at Duplicate :
Duplicate: TestEvent [id=23, id1=77, eventTime=Wed Oct 25 12:09:04 PDT 2017]
Duplicate: TestEvent [id=44, id1=63, eventTime=Wed Oct 25 12:09:40 PDT 2017]




--
Sent from: http://apache-apex-users-list.78494.x6.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Munagala Ramanath-2
The mapping: tuple -> dedup-key needs to be 1-1; if multiple tuples are mapped to the same dedup key you'll see problems like this. In your case multiple tuples can be mapped to the same value of "id + id1". For example, all tuples with (id, id1) being any of these pairs will all map to a value of 3: (0, 3), (1, 2), (2, 1), (3, 0).

A simple way to get a unique dedep key is to convert all of your dedup fields to strings and catenate them. So in your event/tuple class, define a getDedupKey() method and within it, compute this string and return it. Then, you can use
the expression {$}.getDedupKey(). Something along those lines should work.

Ram
 
On Wednesday, October 25, 2017, 12:21:59 PM PDT, Vivek Bhide <[hidden email]> wrote:


Thank you everyone for reply. Solution that chinmay suggested is working but
then I see one more discrepancy.

After adding more that 1 fields as a dedup key, my expectation was to have
the dedup decision made on combination of these 2 keys. I did run the test
case multiple times with BoundedDedupOperator and what i found is, events
are marked as Duplicate but if search for the corresponding Unique entry in
sysout then that entry is no where to be found. Its not happening for all
but for most of entires marked as Duplicate

Is my expectation of dedup behavior is correct and is this a right way to
validate if its working as expected?
Dedup_test_case_output.txt
<http://apache-apex-users-list.78494.x6.nabble.com/file/t127/Dedup_test_case_output.txt
Sample entries :

Present as Unique and Duplicate :
Duplicate: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:44 PDT 2017]
Unique: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:18 PDT 2017]

Only present at Duplicate :
Duplicate: TestEvent [id=23, id1=77, eventTime=Wed Oct 25 12:09:04 PDT 2017]
Duplicate: TestEvent [id=44, id1=63, eventTime=Wed Oct 25 12:09:40 PDT 2017]
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Vlad Rozov-2
It will be better to return an instance of a class that implements hash() and equals() as a key instead of a String. Even for a key that has two integers benefit over million tuples may be significant.

Looking at the source code, it looks that support for {$} notation was removed. I did not test it, but if that is the case, it will be good to return it back.
 
Thank you,

Vlad

On 10/25/17 15:43, Munagala Ramanath wrote:
The mapping: tuple -> dedup-key needs to be 1-1; if multiple tuples are mapped to the same dedup key you'll see problems like this. In your case multiple tuples can be mapped to the same value of "id + id1". For example, all tuples with (id, id1) being any of these pairs will all map to a value of 3: (0, 3), (1, 2), (2, 1), (3, 0).

A simple way to get a unique dedep key is to convert all of your dedup fields to strings and catenate them. So in your event/tuple class, define a getDedupKey() method and within it, compute this string and return it. Then, you can use
the expression {$}.getDedupKey(). Something along those lines should work.

Ram
 
On Wednesday, October 25, 2017, 12:21:59 PM PDT, Vivek Bhide [hidden email] wrote:


Thank you everyone for reply. Solution that chinmay suggested is working but
then I see one more discrepancy.

After adding more that 1 fields as a dedup key, my expectation was to have
the dedup decision made on combination of these 2 keys. I did run the test
case multiple times with BoundedDedupOperator and what i found is, events
are marked as Duplicate but if search for the corresponding Unique entry in
sysout then that entry is no where to be found. Its not happening for all
but for most of entires marked as Duplicate

Is my expectation of dedup behavior is correct and is this a right way to
validate if its working as expected?
Dedup_test_case_output.txt
<http://apache-apex-users-list.78494.x6.nabble.com/file/t127/Dedup_test_case_output.txt
Sample entries :

Present as Unique and Duplicate :
Duplicate: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:44 PDT 2017]
Unique: TestEvent [id=75, id1=64, eventTime=Wed Oct 25 12:09:18 PDT 2017]

Only present at Duplicate :
Duplicate: TestEvent [id=23, id1=77, eventTime=Wed Oct 25 12:09:04 PDT 2017]
Duplicate: TestEvent [id=44, id1=63, eventTime=Wed Oct 25 12:09:40 PDT 2017]

Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Vivek Bhide
In reply to this post by Munagala Ramanath-2
Thanks Ram.. so are you saying that more than one integer fields in dedup key
will calculate the sum of the two fields where in terms of strings it will
concatenate them (because of + overloading)?

Regards
Vivek



--
Sent from: http://apache-apex-users-list.78494.x6.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: How to specify more than one fields as a dedup keyExpression

Munagala Ramanath-2
Yes, if you use "+" in your expression, then the numeric sum will be computed if the fields are
integers and the catenation if they are strings; the former will not yield the desired uniqueness.
But now that I think about it some more, even the latter will not, here's why: If the fields in one
record are 'Hello' and 'World' and in another record are "He" and "lloWorld", both will give you
the same catenated value: 'HelloWorld' and will be considered duplicates.

If you can identify a character that is guaranteed to not occur in the string fields, you can use
it as a separator and that will give you the desired uniqueness. For example, if "#" is such a
character, then the 2 cases above will give you distinct strings: "Hello#World" and "He#lloWorld"
and there is no problem.

Ram

On Thursday, October 26, 2017, 3:58:19 PM PDT, Vivek Bhide <[hidden email]> wrote:


Thanks Ram.. so are you saying that more than one integer fields in dedup key
will calculate the sum of the two fields where in terms of strings it will
concatenate them (because of + overloading)?

Regards