'How to read data from nested zip files in Java without using temporary files?
I am trying to to extract files out of a nested zip archive and process them in memory.
What this question is not about:
How to read a zip file in Java: NO, the question is how to read a zip file within a zip file within a zip and so on and so forth (as in nested zip files).
Write temporary results on disk: NO, I'm asking about doing it all in memory. I found many answers using the not-so-efficient technique of writing results temporarily to disk, but that's not what I want to do.
Example:
Zipfile -> Zipfile1 -> Zipfile2 -> Zipfile3
Goal: extract the data found in each of the nested zip files, all in memory and using Java.
ZipFile is the answer, you say? NO, it is not, it works for the first iteration, that is for:
Zipfile -> Zipfile1
But once you get to Zipfile2, and perform a:
ZipInputStream z = new ZipInputStream(zipFile.getInputStream( zipEntry) ) ;
you will get a NullPointerException.
My code:
public class ZipHandler {
String findings = new String();
ZipFile zipFile = null;
public void init(String fileName) throws AppException{
try {
//read file into stream
zipFile = new ZipFile(fileName);
Enumeration<?> enu = zipFile.entries();
exctractInfoFromZip(enu);
zipFile.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
//The idea was recursively extract entries using ZipFile
public void exctractInfoFromZip(Enumeration<?> enu) throws IOException, AppException{
try {
while (enu.hasMoreElements()) {
ZipEntry zipEntry = (ZipEntry) enu.nextElement();
String name = zipEntry.getName();
long size = zipEntry.getSize();
long compressedSize = zipEntry.getCompressedSize();
System.out.printf("name: %-20s | size: %6d | compressed size: %6d\n",
name, size, compressedSize);
// directory ?
if (zipEntry.isDirectory()) {
System.out.println("dir found:" + name);
findings+=", " + name;
continue;
}
if (name.toUpperCase().endsWith(".ZIP") || name.toUpperCase().endsWith(".GZ")) {
String fileType = name.substring(
name.lastIndexOf(".")+1, name.length());
System.out.println("File type:" + fileType);
System.out.println("zipEntry: " + zipEntry);
if (fileType.equalsIgnoreCase("ZIP")) {
//ZipFile here returns a NULL pointer when you try to get the first nested zip
ZipInputStream z = new ZipInputStream(zipFile.getInputStream(zipEntry) ) ;
System.out.println("Opening ZIP as stream: " + name);
findings+=", " + name;
exctractInfoFromZip(zipInputStreamToEnum(z));
} else if (fileType.equalsIgnoreCase("GZ")) {
//ZipFile here returns a NULL pointer when you try to get the first nested zip
GZIPInputStream z = new GZIPInputStream(zipFile.getInputStream(zipEntry) ) ;
System.out.println("Opening ZIP as stream: " + name);
findings+=", " + name;
exctractInfoFromZip(gZipInputStreamToEnum(z));
} else
throw new AppException("extension not recognized!");
} else {
System.out.println(name);
findings+=", " + name;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Findings " + findings);
}
public Enumeration<?> zipInputStreamToEnum(ZipInputStream zStream) throws IOException{
List<ZipEntry> list = new ArrayList<ZipEntry>();
while (zStream.available() != 0) {
list.add(zStream.getNextEntry());
}
return Collections.enumeration(list);
}
Solution 1:[1]
I have not tried it but using ZipInputStream
you can read any InputStream that contains a ZIP file as data. Iterate through the entries and when you found the correct entry use the
ZipInputStreamto create another nested
ZipInputStream`.
The following code demonstrates this. Imagine we have a readme.txt
inside 0.zip
which is again zipped in 1.zip
which is zipped in 2.zip
. Now we read some text from readme.txt
:
try (FileInputStream fin = new FileInputStream("D:/2.zip")) {
ZipInputStream firstZip = new ZipInputStream(fin);
ZipInputStream zippedZip = new ZipInputStream(findEntry(firstZip, "1.zip"));
ZipInputStream zippedZippedZip = new ZipInputStream(findEntry(zippedZip, "0.zip"));
ZipInputStream zippedZippedZippedReadme = findEntry(zippedZippedZip, "readme.txt");
InputStreamReader reader = new InputStreamReader(zippedZippedZippedReadme);
char[] cbuf = new char[1024];
int read = reader.read(cbuf);
System.out.println(new String(cbuf, 0, read));
.....
public static ZipInputStream findEntry(ZipInputStream in, String name) throws IOException {
ZipEntry entry = null;
while ((entry = in.getNextEntry()) != null) {
if (entry.getName().equals(name)) {
return in;
}
}
return null;
}
Note the code is really ugly and does not close anything nor does it checks for errors. It is just a minimized version that demonstrates how it works.
Theoretically there is no limit how many ZipInputStreams you cascade into another. The data is never written into a temporary file. The decryption is only performed on-demand when you read each InputStream
.
Solution 2:[2]
this is the way I found to unzip file in memory:
The code is not clean AT ALL, but i understand the rules are to post something working, so i have this hopefully to help so
What I do is use a recursive method to navigate the complex ZIP file and extract folder other inner zips files and save the results in memory to later work with them.
Main things I found I want to share with you:
1 ZipFile is useless if you have nested zip files 2 You have to use the basic Zip InputStream and Outputstream 3 I only use recursive programming to unzip nested zips
package course.hernan;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayDeque;
import java.util.Deque;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;
import org.apache.commons.io.IOUtils;
public class FileReader {
private static final int BUFFER_SIZE = 2048;
public static void main(String[] args) {
try {
File f = new File("DIR/inputs.zip");
FileInputStream fis = new FileInputStream(f);
BufferedInputStream bis = new BufferedInputStream(fis);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(baos);
byte[] buffer = new byte[BUFFER_SIZE];
while (bis.read(buffer, 0, BUFFER_SIZE) != -1) {
bos.write(buffer);
}
bos.flush();
bos.close();
bis.close();
//This STACK has the output byte array information
Deque<Map<Integer, Object[]>> outputDataStack = ZipHandler1.unzip(baos);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
package course.hernan;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Deque;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import org.apache.commons.lang3.StringUtils;
public class ZipHandler1 {
private static final int BUFFER_SIZE = 2048;
private static final String ZIP_EXTENSION = ".zip";
public static final Integer FOLDER = 1;
public static final Integer ZIP = 2;
public static final Integer FILE = 3;
public static Deque<Map<Integer, Object[]>> unzip(ByteArrayOutputStream zippedOutputFile) {
try {
ZipInputStream inputStream = new ZipInputStream(
new BufferedInputStream(new ByteArrayInputStream(
zippedOutputFile.toByteArray())));
ZipEntry entry;
Deque<Map<Integer, Object[]>> result = new ArrayDeque<Map<Integer, Object[]>>();
while ((entry = inputStream.getNextEntry()) != null) {
LinkedHashMap<Integer, Object[]> map = new LinkedHashMap<Integer, Object[]>();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
System.out.println("\tExtracting entry: " + entry);
int count;
byte[] data = new byte[BUFFER_SIZE];
if (!entry.isDirectory()) {
BufferedOutputStream out = new BufferedOutputStream(
outputStream, BUFFER_SIZE);
while ((count = inputStream.read(data, 0, BUFFER_SIZE)) != -1) {
out.write(data, 0, count);
}
out.flush();
out.close();
// recursively unzip files
if (entry.getName().toUpperCase().endsWith(ZIP_EXTENSION.toUpperCase())) {
map.put(ZIP, new Object[] {entry.getName(), unzip(outputStream)});
result.add(map);
//result.addAll();
} else {
map.put(FILE, new Object[] {entry.getName(), outputStream});
result.add(map);
}
} else {
map.put(FOLDER, new Object[] {entry.getName(), unzip(outputStream)});
result.add(map);
}
}
inputStream.close();
return result;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Solution 3:[3]
Thanks to JMax. In my case, The result of reading the pdf file is different from the expected result, It becomes bigger and cannot be opened. Finally I found that I had made a mistake, The buffer may not be full? The following is the error code.
while((n = zippedZippedZippedReadme.read(buffer)) != -1) {
fos.write(buffer);
}
Here is the correct code,
try (FileInputStream fin = new FileInputStream("1.zip")) {
ZipInputStream firstZip = new ZipInputStream(fin);
ZipInputStream zippedZip = new ZipInputStream(findEntry(firstZip, "0.zip"));
ZipInputStream zippedZippedZippedReadme = findEntry(zippedZip, "test.pdf");
long startTime = System.currentTimeMillis();
byte[] buffer = new byte[4096];
File outputFile = new File("test.pdf");
try (FileOutputStream fos = new FileOutputStream(outputFile)) {
int n;
while((n = zippedZippedZippedReadme.read(buffer)) != -1) {
fos.write(buffer, 0 ,n);
}
fos.flush();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("time consuming:" + (System.currentTimeMillis() - startTime)/1000.0);
}
hope to be helpful!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | yilin |