'Reading a large compressed file using Apache Commons Compress
I'm trying to read a bz2 file using Apache Commons Compress.
The following code works for a small file. However for a large file (over 500MB), it ends after reading a few thousands lines without any error.
try {
InputStream fin = new FileInputStream("/data/file.bz2");
BufferedInputStream bis = new BufferedInputStream(fin);
CompressorInputStream input = new CompressorStreamFactory()
.createCompressorInputStream(bis);
BufferedReader br = new BufferedReader(new InputStreamReader(input,
"UTF-8"));
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (Exception e) {
e.printStackTrace();
}
Is there another good way to read a large compressed file?
Solution 1:[1]
I was having the same problem with a large file, until I noticed that CompressorStreamFactory
has a couple of overloaded constructors that take a boolean decompressUntilEOF
parameter.
Simply changing to the following may be all that's missing...
CompressorInputStream input = new CompressorStreamFactory(true)
.createCompressorInputStream(bis);
Clearly, whoever wrote this factory seems to think it's better to create new compressor input streams at certain points, with the same underlying buffered input stream so that the new one picks up where the last one left off. They seem to think that's a better default, or preferred way of doing it over allowing one stream to decompress data all the way to the end of the file. I've no doubt they are cleverer than me, and I haven't worked out what trap I'm setting for future me by setting this parameter to true
. Maybe someone will tell me in the comments! :-)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Tim Hirst |