'Extract Keywords from Office Documents with Sharepoint Flow

I am trying to implement a document management system using Sharepoint. One major issue is that colleagues cannot find documents in the current setup (local fileserver). They have asked that we have a system that scans uploaded documents and automatically looks for keywords in them and then populates a "Meta" column.

I have had sort of success with OCR on image files, but getting keywords out of office documents (doc, xls etc.) I have had no success until now.

Is there a way to setup a flow to do this task for me?

any help is much aprechiated.

i tried "Get file metadata" and Azure "Text analysis", but it seems to take the raw data of the files (XML I assume) and returns that the document is to large to analyse.



Solution 1:[1]

There is something vague about this requirement - how is a keyword defined in a document?

Therefore, first obvious solution would be to assign keywords for each file upon uploading it. You may create a process for this with flow - have tasks, reminders and so on.

Automating this with OCR first means that you need to user OCR that works with MS flow you have only one choice - ElasticOCR. Then, in your flow - feed the document content to the ElasticOCR action - keep in mind that OCR is not 100% accurate - analyze the generated text content according to your keyword definition - finally write the meta back to the library in the corresponding columns.

Having worked on a similar requirement, we asked uploaders to publish their documents with a short abstract(column from the content type). The assumption is the abstract contains the keywords and is stored in a multi-line column - making it searchable site wide.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 chris_to