'How to write the contents of a Scala stream to a file?

I have a Scala stream of bytes that I'd like to write to a file. The stream has too much data to buffer all of it memory.

As a first attempt, I created an InputStream similar to this:

class MyInputStream(data: Stream[Byte]) extends InputStream {
  private val iterator = data.iterator
  override def read(): Int = if (iterator.hasNext) iterator.next else -1
}

Then I use Apache Commons to write the file:

val source = new MyInputStream(dataStream)
val target = new FileOutputStream(file)
try {
  IOUtils.copy(source, target)
} finally {
  target.close
}

This works, but I'm not too happy with the performance. I'm guessing that calling MyInputStream.read for every byte introduces a lot of overhead. Is there a better way?



Solution 1:[1]

You might (or might not!) be mistaken that the read side is the source of your performance troubles. It could be the fact that you are using an unbuffered FileOutputStream(...), forcing a separate system call for every byte written.

Here's my take, quick 'n simple:

def writeBytes( data : Stream[Byte], file : File ) = {
  val target = new BufferedOutputStream( new FileOutputStream(file) )
  try data.foreach( target.write(_) ) finally target.close
}

Solution 2:[2]

I'd recommend the java.nio.file package. With Files.write you can write Arrays of Bytes to a Path constructed from a filename.

It's up to you how to provide the Bytes. You can turn the Stream into an Array with .toArray or you can take bytes off the stream one (or a handful) at a time and turn them into arrays.

Here's a simple code block demonstrating the .toArray method.

import java.nio.file.{Files, Paths}

val filename: String = "output.bin"
val bytes: Stream[Byte] = ...
Files.write(Paths.get(filename), bytes.toArray)

Solution 3:[3]

You should implement the bulk read override in your InputStream implementation:

override def read(b: Array[Byte], off: Int, len: Int)

IOUtils.copy uses that signature to read/write in 4K chunks.

Solution 4:[4]

Given that StreamIterator reading one byte at a time might be the bottleneck, I've devised a way to write a stream to anOutputStream that does not rely on it and is hopefully more efficient:

object StreamCopier {
  def copy(data: Stream[Byte], output: OutputStream) = {
    def write(d: Stream[Byte]): Unit = if (d.nonEmpty) {
      val (head, tail) = d.splitAt(4 * 1024)
      val bytes = head.toArray
      output.write(bytes, 0, bytes.length)
      write(tail)
    }
    write(data)
  }
}

EDIT: Fixed a bug by replacing data with d inside the tail-recursive write function.

This approach uses a recursive approach via splitAt to split the stream into the first ~4K and the remainder, write that head it to the OutputStream and recurse on the tail of the stream, until splitAt returns an empty stream.

Since you have performance benchmarks in place, i'll leave it to you to judge if that turns out to be more efficient.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Xavier Guihot
Solution 2 Aphex
Solution 3 Arne Claassen
Solution 4