Before I start re-inventing the wheel: AbstractFileInputOperator

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Before I start re-inventing the wheel: AbstractFileInputOperator

Aaron Bossert
folks,

I have been working with com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a slightly different use case, but keep running into inneficiencies...prior to detailing my use case, up front, I do have this working, but I feel like it is horribly inefficient and definitely far from elegant.

  • Scan multiple directories (not one as is expected)
  • Accept changes to directories to be scanned on the fly
  • Accept multiple file types (based on checking magic bytes/number)
  • Assume that files may be in any of the following conditions:
    • "Raw"
    • Compressed
    • Archived
    • Compressed and Archived
  • Associate provenance (e.g. customer and sensor) with events extracted from these files
My existing solution was to provide my own implementation of AbstractFileInputOperator.DirectoryScanner, and also to spit out arrays/lists of events rather than Strings (lines from each file) due to the binary nature of most of my input file types.

I am seeing several mismatches between my use case and the AbstractFileInputOperator, but also see a ton of existing work within it that I would prefer not to redo (partitioning, fault-tolerance, etc.).  Is there a more appropriate class/Interface I should be looking at or is it appropriate to create a new interface to handle a directory scanner that accounts for multiple directories and the potential to deal with compressed and archived files (thus things like openFile would need to support outputting a list of inputstreams at a minimum to accomodate these files)...I just want to make sure I am not overdoing things in a quest for more efficient and clean code...

--

M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group


Reply | Threaded
Open this post in threaded view
|

Re: Before I start re-inventing the wheel: AbstractFileInputOperator

Thomas Weise-2
Did you already look at FileSplitter/BlockReader?


Would that better support your customization requirements?



--
sent from mobile


On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <[hidden email]> wrote:
folks,

I have been working with com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a slightly different use case, but keep running into inneficiencies...prior to detailing my use case, up front, I do have this working, but I feel like it is horribly inefficient and definitely far from elegant.

  • Scan multiple directories (not one as is expected)
  • Accept changes to directories to be scanned on the fly
  • Accept multiple file types (based on checking magic bytes/number)
  • Assume that files may be in any of the following conditions:
    • "Raw"
    • Compressed
    • Archived
    • Compressed and Archived
  • Associate provenance (e.g. customer and sensor) with events extracted from these files
My existing solution was to provide my own implementation of AbstractFileInputOperator.DirectoryScanner, and also to spit out arrays/lists of events rather than Strings (lines from each file) due to the binary nature of most of my input file types.

I am seeing several mismatches between my use case and the AbstractFileInputOperator, but also see a ton of existing work within it that I would prefer not to redo (partitioning, fault-tolerance, etc.).  Is there a more appropriate class/Interface I should be looking at or is it appropriate to create a new interface to handle a directory scanner that accounts for multiple directories and the potential to deal with compressed and archived files (thus things like openFile would need to support outputting a list of inputstreams at a minimum to accomodate these files)...I just want to make sure I am not overdoing things in a quest for more efficient and clean code...

--

M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group


Reply | Threaded
Open this post in threaded view
|

Re: Before I start re-inventing the wheel: AbstractFileInputOperator

Aaron Bossert
Thomas,

You know, I did read through the source for that, but I guess my imagination didn't kick in...Maybe I got stuck on the name being a file splitter as opposed to the potential to re-purpose it for archived files...Thanks for the pointer...I will give that a go before re-inventing the wheel...

On Fri, Jun 22, 2018 at 2:08 AM Thomas Weise <[hidden email]> wrote:
Did you already look at FileSplitter/BlockReader?


Would that better support your customization requirements?



--
sent from mobile


On Thu, Jun 21, 2018, 9:29 PM Aaron Bossert <[hidden email]> wrote:
folks,

I have been working with com.datatorrent.lib.io.fs.AbstractFileInputOperator to accomodate a slightly different use case, but keep running into inneficiencies...prior to detailing my use case, up front, I do have this working, but I feel like it is horribly inefficient and definitely far from elegant.

  • Scan multiple directories (not one as is expected)
  • Accept changes to directories to be scanned on the fly
  • Accept multiple file types (based on checking magic bytes/number)
  • Assume that files may be in any of the following conditions:
    • "Raw"
    • Compressed
    • Archived
    • Compressed and Archived
  • Associate provenance (e.g. customer and sensor) with events extracted from these files
My existing solution was to provide my own implementation of AbstractFileInputOperator.DirectoryScanner, and also to spit out arrays/lists of events rather than Strings (lines from each file) due to the binary nature of most of my input file types.

I am seeing several mismatches between my use case and the AbstractFileInputOperator, but also see a ton of existing work within it that I would prefer not to redo (partitioning, fault-tolerance, etc.).  Is there a more appropriate class/Interface I should be looking at or is it appropriate to create a new interface to handle a directory scanner that accounts for multiple directories and the potential to deal with compressed and archived files (thus things like openFile would need to support outputting a list of inputstreams at a minimum to accomodate these files)...I just want to make sure I am not overdoing things in a quest for more efficient and clean code...

--

M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group




--

M. Aaron Bossert
(571) 242-4021
Punch Cyber Analytics Group