'How to write the contents of a Scala stream to a file?
I have a Scala stream of bytes that I'd like to write to a file. The stream has too much data to buffer all of it memory.
As a first attempt, I created an InputStream
similar to this:
class MyInputStream(data: Stream[Byte]) extends InputStream {
private val iterator = data.iterator
override def read(): Int = if (iterator.hasNext) iterator.next else -1
}
Then I use Apache Commons to write the file:
val source = new MyInputStream(dataStream)
val target = new FileOutputStream(file)
try {
IOUtils.copy(source, target)
} finally {
target.close
}
This works, but I'm not too happy with the performance. I'm guessing that calling MyInputStream.read
for every byte introduces a lot of overhead. Is there a better way?
Solution 1:[1]
You might (or might not!) be mistaken that the read side is the source of your performance troubles. It could be the fact that you are using an unbuffered FileOutputStream(...), forcing a separate system call for every byte written.
Here's my take, quick 'n simple:
def writeBytes( data : Stream[Byte], file : File ) = {
val target = new BufferedOutputStream( new FileOutputStream(file) )
try data.foreach( target.write(_) ) finally target.close
}
Solution 2:[2]
I'd recommend the java.nio.file
package. With Files.write
you can write Array
s of Byte
s to a Path
constructed from a filename.
It's up to you how to provide the Byte
s. You can turn the Stream
into an Array
with .toArray
or you can take
bytes off the stream one (or a handful) at a time and turn them into arrays.
Here's a simple code block demonstrating the .toArray
method.
import java.nio.file.{Files, Paths}
val filename: String = "output.bin"
val bytes: Stream[Byte] = ...
Files.write(Paths.get(filename), bytes.toArray)
Solution 3:[3]
You should implement the bulk read override in your InputStream implementation:
override def read(b: Array[Byte], off: Int, len: Int)
IOUtils.copy
uses that signature to read/write in 4K chunks.
Solution 4:[4]
Given that StreamIterator
reading one byte at a time might be the bottleneck, I've devised a way to write a stream to anOutputStream
that does not rely on it and is hopefully more efficient:
object StreamCopier {
def copy(data: Stream[Byte], output: OutputStream) = {
def write(d: Stream[Byte]): Unit = if (d.nonEmpty) {
val (head, tail) = d.splitAt(4 * 1024)
val bytes = head.toArray
output.write(bytes, 0, bytes.length)
write(tail)
}
write(data)
}
}
EDIT: Fixed a bug by replacing data
with d
inside the tail-recursive write
function.
This approach uses a recursive approach via splitAt
to split the stream into the first ~4K and the remainder, write that head it to the OutputStream
and recurse on the tail of the stream, until splitAt
returns an empty stream.
Since you have performance benchmarks in place, i'll leave it to you to judge if that turns out to be more efficient.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Xavier Guihot |
Solution 2 | Aphex |
Solution 3 | Arne Claassen |
Solution 4 |