'Camel file component throughput when readLock=changed

I'm using Apache Camel to transfer files from an input directory to a messagebroker. The files are written via SFTP. To avoid consuming incomplete files that are still in transit, I've set readLock=changed and readLockCheckInterval=3000.

As an example, this is how one of my tests looks:

    <route>
        <from uri="file:inbox?readLock=changed&amp;readLockCheckInterval=3000"/>
        <log message="copying ${file:name}"/>
        <to uri="file:outbox"/>
    </route>

I test this with (echo line 1; sleep 2; echo line 2) > inbox/test and the file gets copied faithfully when readLockCheckInterval=3000. However, this doesn't scale, because the file component will wait three seconds before processing each file. So when I test with

for n in $(seq 1 100); do (echo line 1; sleep 2; echo line 2) > inbox/$n & done

it takes camel five minutes to move the files from inbox to outbox.

I've read the chapter on parallel processing in the Camel in Action book. But the examples focus on parallelizing processing of lines in a single consumed file. I couldn't find a way to parallelize the consumer itself.

A throughput of around one file per second would be fine in my use-case. I just don't like the idea of being forced to risk incomplete data to achieve it. The readLock=changed setting seems like a hack anyway, but we can't tell the customer to copy then move, so there doesn't seem to be another option.

How can I improve throughput without sacrificing integrity in the face of network delays?



Solution 1:[1]

Instead of readLockCheckInterval=3000 I use readLockMinAge=3s and the throughput is fine. This is how my test-route looks now:

    <route>
        <from uri="file:inbox?readLock=changed&amp;readLockMinAge=3s"/>
        <to uri="file:outbox"/>
    </route>

As it turns out I wasn't the only one in this situation and there was a ticket to introduce a minimum-age delay to address this. I was just too impatient reading the file component documentation where this is already explained.

Solution 2:[2]

If the producer is faster than your consumer and you want to keep up you have to parallelize the file consumption. You can do that by deploying your consumer multiple times all polling the same folder. Like this you can process the same number of files in parallel than you have consumer instances.

However, the distributed file consumers have a new problem: multiple of them can try to consume the same file concurrently.

To solve this you need to use a distributed idempotent repository to make sure the same file is not consumed multiple times across all instances.

For this to work with the file component, I guess you have to set readLock to idempotent-changed. Like this the consumer should wait until the file does not change anymore and when multiple consumers try to read the file, only the first one "wins". All others are skipped because the file is already known in the idempotent repository.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2