'Is there a way to read and collect EMF image file in python? Can we read a EMF image with OpenCV ? How to convert it in jpg or png?
I am searching for a solution for a long time but couldn't be able to find it. There are more similar qestion-answers but that didn't help me.
Basically
- I have some word documents (xxx.docx) having some images.
- That image is in WMF format (when I am manually checking it) and it basically contains tabular information.
- I need to collect that table.
So the task is reduced to collect the image and get table from text using computer vision.
1 when I am trying to collect the image-- python-docx can't detect that as image , then, I found "aspose.words" library can detect the image (as it is not in an usual image format)as an image object and can write it in EMF format (xxx.emf). [ if anyother way is there please mention ]
[2] Now I have the image (xxx.emf) in a folder. so the next task is to get the content the image contains, which is totally tabular information. Now I can't use this format to read in python.
So, getting emf image and reading is not my target, the target is to get the table from the image in excel. Please help me out in these steps, or please suggest other ways according to the requirement. If anyone needs to get the docx can go to this here in a repo. Thank you.
Solution 1:[1]
Word and Excel files are actually just zipped archives. You can unzip them with 7zip
:
7z x 36C77022Q0250.docx
That gives you the following content:
ls -lR word
drwx------ 3 mark staff 96 10 May 00:38 _rels
-rw-r--r-- 1 mark staff 48763 28 Apr 21:35 document.xml
-rw-r--r-- 1 mark staff 1290 28 Apr 21:35 fontTable.xml
-rw-r--r-- 1 mark staff 2838 28 Apr 21:35 footer1.xml
-rw-r--r-- 1 mark staff 2865 28 Apr 21:35 footer2.xml
-rw-r--r-- 1 mark staff 1246 28 Apr 21:35 header1.xml
-rw-r--r-- 1 mark staff 1246 28 Apr 21:35 header2.xml
drwx------ 3 mark staff 96 10 May 00:38 media
-rw-r--r-- 1 mark staff 755 28 Apr 21:35 settings.xml
-rw-r--r-- 1 mark staff 49239 28 Apr 21:35 styles.xml
drwx------ 3 mark staff 96 10 May 00:38 theme
word/_rels:
total 8
-rw-r--r-- 1 mark staff 1307 28 Apr 21:35 document.xml.rels
word/media:
total 320
-rw-r--r-- 1 mark staff 162672 28 Apr 21:35 image1.wmf <--- HERE IT IS
You can see your WMF file there and copy it to the current directory and rename it for simpler access:
cp word/media/image1.wmf image.emf
You can then convert that to a PNG with either Inkscape or LibreOffice
inkscape -d 288 -e file.png image.emf
libreoffice --headless --convert-to png image.emf
I think it has messed up a little on my system because I lack your fonts.
Solution 2:[2]
I don't know much about Python, but I've implemented the WMF/EMF/EMF+ classes in Apache POI. I would use the location of the text records to give them some meaning. The rest is for you to figure out, e.g. by only using lines with the same amount of columns.
import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.apache.poi.hemf.usermodel.HemfPicture;
import org.apache.poi.hwmf.record.HwmfText.WmfExtTextOut;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.junit.jupiter.api.Test;
public class TestWmfExtract {
@Test
void blub() throws IOException {
Map<Double, Map<Double,String>> tab = new TreeMap<>();
try (InputStream is = new FileInputStream("36C77022Q0250.docx");
XWPFDocument doc = new XWPFDocument(is);
InputStream is2 = doc.getAllPictures().get(0).getPackagePart().getInputStream()
) {
HemfPicture emf = new HemfPicture(is2);
Stream<WmfExtTextOut> st = emf.getRecords().stream()
.filter(r -> r instanceof WmfExtTextOut)
.map(WmfExtTextOut.class::cast);
for (WmfExtTextOut hr : (Iterable<WmfExtTextOut>) (st::iterator)) {
Point2D p2d = hr.getReference();
String txt = hr.getText(StandardCharsets.UTF_16LE);
Rectangle2D bi = (Rectangle2D)hr.getGenericProperties().get("boundsIgnored").get();
double x = bi != null ? bi.getCenterX() : p2d.getX();
x = 20. * Math.round(x / 20.);
tab.computeIfAbsent(p2d.getY(), (d) -> new TreeMap<>()).put(x, txt);
}
List<Double> colX = tab.values().stream().flatMap((m) -> m.keySet().stream())
.distinct().sorted().collect(Collectors.toList());
try (Workbook wb = new XSSFWorkbook();
FileOutputStream fos = new FileOutputStream("tab-out.xlsx")) {
Sheet sh = wb.createSheet();
int rowIdx = 0;
for (Map<Double, String> cols : tab.values()) {
Row row = sh.createRow(rowIdx);
cols.forEach((x, txt) -> row.createCell(colX.indexOf(x)).setCellValue(txt));
rowIdx++;
}
wb.write(fos);
}
}
}
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |