'How to list all embedded files from a microsoft office document, using Apache POI?

is there any opportunity to list all embedded objects (doc, ..., txt) in a office file (doc, docx, xls, xlsx, ppt, pptx, ...)?

I am using Apache POI (Java) Library, to extract text from office files. I don't need to extract all the text from embedded objects, a log file with the file names of all embedded documents would be nice (something like: string objectFileNames = getEmbeddedFileNames(fileInputStream)).

Example: I have a Word Document "test.doc" which contains another file called "excel.xls". I'd like to write the file name of excel.xls (in this case) into a log file.

I tried this using some sample code from the apache homepage (https://poi.apache.org/text-extraction.html). But my Code always returns the same ("Footer Text: Header Text").

What I tried is:

private static void test(String inputfile, String outputfile) throws Exception {

    String[] extractedText = new String[100];
    int emb = 0;//used for counter of embedded objects

    InputStream fis = new FileInputStream(inputfile);
    PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben

System.out.println("Emmbedded Search started. Inputfile: " + inputfile);

//Based on Apache sample Code
emb = 0;//Reset Counter

POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor = 
   ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
   ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);

for (POITextExtractor textExtractor : embeddedExtractors) {
   // If the embedded object was an Excel spreadsheet.
   if (textExtractor instanceof ExcelExtractor) {
      ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
      extractedText[emb] = (excelExtractor.getText());
   }
   // A Word Document
   else if (textExtractor instanceof WordExtractor) {
      WordExtractor wordExtractor = (WordExtractor) textExtractor;
      String[] paragraphText = wordExtractor.getParagraphText();
      for (String paragraph : paragraphText) {
          extractedText[emb] = paragraph;
      }
      // Display the document's header and footer text
      System.out.println("Footer text: " + wordExtractor.getFooterText());
      System.out.println("Header text: " + wordExtractor.getHeaderText());
   }
   // PowerPoint Presentation.
   else if (textExtractor instanceof PowerPointExtractor) {
      PowerPointExtractor powerPointExtractor =
         (PowerPointExtractor) textExtractor;
      extractedText[emb] = powerPointExtractor.getText();
      emb++;
      extractedText[emb] =  powerPointExtractor.getNotes();
   }
   // Visio Drawing
   else if (textExtractor instanceof VisioTextExtractor) {
      VisioTextExtractor visioTextExtractor = 
         (VisioTextExtractor) textExtractor;
      extractedText[emb] = visioTextExtractor.getText();
   }
   emb++;//Count Embedded Objects
}//Close For Each Loop POIText...

for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
    if (extractedText[x] != null){
        System.out.println(extractedText[x]);
        out.println(extractedText[x]);
    }
    else {
        break;
    }
}
out.close();

}

Inputfile is xls, which contains a doc file as object and outputfile is txt.

Thanks if anyone can help me.

Solution 1:^[1]

I don't think embedded OLE objects keep their original file name, so I don't think what you want is really possible.

I believe what Microsoft writes about embedded images also applies to OLE-Objects:

You might notice that the file name of the image file has been changed from Eagle1.gif to image1.gif. This is done to address privacy concerns, in that a malicious person could derive a competitive advantage from the name of parts in a document, such as an image file. For example, an author might choose to protect the contents of a document by encrypting the textual part of the document file. However, if two images are inserted named old_widget.gif and new_reenforced_widget.gif, even though the text is protected, a malicious person could learn the fact that the widget is being upgraded. Using generic image file names such as image1 and image2 adds another layer of protection to Office Open XML Formats files.

However, you could try (for Word 2007 files, aka XWPFDocument, aka ".docx", other MS Office files work similar):

try (FileInputStream fis = new FileInputStream("mydoc.docx")) {
    document = new XWPFDocument(fis);
    listEmbeds (document);
}


private static void listEmbeds (XWPFDocument doc) throws OpenXML4JException {
    List<PackagePart> embeddedDocs = doc.getAllEmbedds();
    if (embeddedDocs != null && !embeddedDocs.isEmpty()) {
        Iterator<PackagePart> pIter = embeddedDocs.iterator();
        while (pIter.hasNext()) {
            PackagePart pPart = pIter.next();
            System.out.print(pPart.getPartName()+", ");
            System.out.print(pPart.getContentType()+", ");
            System.out.println();
        }
    }
}

The pPart.getPartName() is the closest I could find to a file name of an embedded file.

Solution 2:^[2]

Using Apache poi, you cannot get the original names of the embedded files. However if you really need to get the original names then you can use aspose api. You can use aspose.cells for excel files, aspose.slides for presentation files, aspose.words for word files to extract the embedded files. You'll get the file name if the ole object is linked otherwise you'll not get the original file using aspose also.

See the example below....

public void getDocEmbedded(InputStream stream){    
       Document doc=new Document(stream);
        
        NodeCollection<?> shapes = doc.getChildNodes(NodeType.SHAPE, true);
        System.out.println(shapes.getCount());
        int itemcount = 0;
        for (int i = 0; i < shapes.getCount(); i++) {
            Shape shape = (Shape) shapes.get(i);

            OleFormat oleFormat = shape.getOleFormat();
            if (oleFormat != null) {
                if (!oleFormat.isLink() && oleFormat.getOleIcon()) {
                    itemcount++;
                    String progId = oleFormat.getProgId();
                    System.out.println("Extension: " + oleFormat.getSuggestedExtension()+"file Name "+oleFormat.getIconCaption());
                    ByteArrayOutputStream baos = new ByteArrayOutputStream();

                    byte[] bytearray = oleFormat.getRawData();
                    if (bytearray == null) {
                        oleFormat.save(baos);
                        bytearray = baos.toByteArray();
                    }
                 //TO DO : do with the byte array whatever you want to
                }
             }
 }

I'm using oleFormat.getSuggestedExtension() to get the embedded file extension and oleFormat.getIconCaption() to get the embedded file names.

Solution 3:^[3]

public class GetEmbedded {

    public static void main(String[] args) throws Exception {
        String path = "SomeExcelFile.xlsx"
        XSSFWorkbook workbook = new XSSFWorkbook(new FileInputStream(new File(path)));

             for (PackagePart pPart : workbook.getAllEmbedds()) {
                            String contentType = pPart.getContentType();
                            System.out.println("List of all the embedded contents in the Excel"+contentType);
             }
    }
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	Ruhul Hussain
Solution 3	Swetha Prem

'How to list all embedded files from a microsoft office document, using Apache POI?

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]