JDBC poll operator performance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

JDBC poll operator performance

Thomas Weise-2
Hi,

It seems the poll operator performs unnecessary operations in the case where the "key" column values in the source table are monotonic increasing. There should be no need to sort or do count selects. Instead it should be sufficient to just filter with the key range.

Let's say the key column is a timestamp that is set by a trigger, one could use:

SELECT ... WHERE UPDATE_DATE > "<LAST_SEEN_DATE>"

Instead of operating with ORDER BY, OFFSET and LIMIT.

Thanks


Reply | Threaded
Open this post in threaded view
|

Re: JDBC poll operator performance

Bhupesh Chawda
IMO we would need to sort since, even though the keys are monotonically increasing, it may not return the data in the same order. Depends on the implementation and file format of the given db.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Jun 27, 2017 at 9:16 AM, Thomas Weise <[hidden email]> wrote:
Hi,

It seems the poll operator performs unnecessary operations in the case where the "key" column values in the source table are monotonic increasing. There should be no need to sort or do count selects. Instead it should be sufficient to just filter with the key range.

Let's say the key column is a timestamp that is set by a trigger, one could use:

SELECT ... WHERE UPDATE_DATE > "<LAST_SEEN_DATE>"

Instead of operating with ORDER BY, OFFSET and LIMIT.

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: JDBC poll operator performance

Hitesh Kapoor
I agree with Bhupesh, DB does not guarantees that your data will be retrieved in a specific or sorted order if an 'order by' clause is not given in the query.
IMO in case of our poll operator we will have to sort the records for non-poller partitions to ensure all records are emitted and no 2 records are emitted by different partitions.
I think we can get away with sorting for poller partition with the idea that Thomas has suggested.

--Hitesh
On Tue, Jun 27, 2017 at 10:48 AM, Bhupesh Chawda <[hidden email]> wrote:
IMO we would need to sort since, even though the keys are monotonically increasing, it may not return the data in the same order. Depends on the implementation and file format of the given db.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Jun 27, 2017 at 9:16 AM, Thomas Weise <[hidden email]> wrote:
Hi,

It seems the poll operator performs unnecessary operations in the case where the "key" column values in the source table are monotonic increasing. There should be no need to sort or do count selects. Instead it should be sufficient to just filter with the key range.

Let's say the key column is a timestamp that is set by a trigger, one could use:

SELECT ... WHERE UPDATE_DATE > "<LAST_SEEN_DATE>"

Instead of operating with ORDER BY, OFFSET and LIMIT.

Thanks




Reply | Threaded
Open this post in threaded view
|

Re: JDBC poll operator performance

Thomas Weise-2
Records can be distributed between partitions based on key ranges, no sorting is needed for that.

You may need sorting for repeatable read within a partition. But even then the query should filter to not fetch what was already loaded. Without a WHERE clause, there is an unnecessary repeated full index scan. 

The operator has other deficiencies, such as poor error handling in the poll thread and also the tuple conversion does not work with all column expressions, I'm going to submit tickets and fix some of these issues.

The documentation also fails to mention that it won't work with Oracle, because the dialect is not supported in jooq open source version.

Thomas 


On Tue, Jun 27, 2017 at 2:12 AM, Hitesh Kapoor <[hidden email]> wrote:
I agree with Bhupesh, DB does not guarantees that your data will be retrieved in a specific or sorted order if an 'order by' clause is not given in the query.
IMO in case of our poll operator we will have to sort the records for non-poller partitions to ensure all records are emitted and no 2 records are emitted by different partitions.
I think we can get away with sorting for poller partition with the idea that Thomas has suggested.

--Hitesh
On Tue, Jun 27, 2017 at 10:48 AM, Bhupesh Chawda <[hidden email]> wrote:
IMO we would need to sort since, even though the keys are monotonically increasing, it may not return the data in the same order. Depends on the implementation and file format of the given db.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: [hidden email] | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Tue, Jun 27, 2017 at 9:16 AM, Thomas Weise <[hidden email]> wrote:
Hi,

It seems the poll operator performs unnecessary operations in the case where the "key" column values in the source table are monotonic increasing. There should be no need to sort or do count selects. Instead it should be sufficient to just filter with the key range.

Let's say the key column is a timestamp that is set by a trigger, one could use:

SELECT ... WHERE UPDATE_DATE > "<LAST_SEEN_DATE>"

Instead of operating with ORDER BY, OFFSET and LIMIT.

Thanks