'Google App Script OCR PDF to text Page Number Limitation

I am very new to Google Script. I have some pdf files in a folder on Google Drive and I am trying to convert pdf to google doc and extract specific text. PDF has more than 200 pages but even the google.doc file is limited to 80 pages. Is there a limit on number of pages you can run OCR on? Or I am missing something....

My code below:

//#####GLOBALS#####

const FOLDER_ID = "1rlAL4WrnxQ6pEY2uOmzWA_csUIDdBjVK"; //Folder ID of all PDFs
const SS = "1XS_YUUdu9FK_bBumK3lFu9fU_M9w7NGydZqOzu9vTyE";//The spreadsheet ID
cons

SHEET = "Extracted";//The sheet tab name

/*########################################################

  • Main run file: extracts student IDs from PDFs and their
  • section from the PDF name from multiple documents.
  • Displays a list of students and sections in a Google Sheet.

*/

function extractInfo(){
  const ss = SpreadsheetApp.getActiveSpreadsheet()
  //Get all PDF files:
  const folder = DriveApp.getFolderById(FOLDER_ID);
  //const files = folder.getFiles();
  const files = folder.getFilesByType("application/pdf");
  
  let allInfo = []
  //Iterate through each folderr
  while(files.hasNext()){
    Logger.log('first call');
    let file = files.next();
    let fileID = file.getId();
   
    const doc = getTextFromPDF(fileID);
    const invDate = extractInvDate(doc.text);
    
        
    allInfo = allInfo.concat(invDate);

Logger.log("Length of allInfo array: ")
Logger.log(allInfo.length);
    
  }
    importToSpreadsheet(allInfo);       //this is 80, even though pdf has more than 200 pages with
                                        //required text (invoice date) on each page
};


/*########################################################
 * Extracts the text from a PDF and stores it in memory.
 * Also extracts the file name.
 *
 * param {string} : fileID : file ID of the PDF that the text will be extracted from.
 *
 * returns {array} : Contains the file name  and PDF text.
 *
 */
function getTextFromPDF(fileID) {
  var blob = DriveApp.getFileById(fileID).getBlob()
  var resource = {
    title: blob.getName(),
    mimeType: blob.getContentType()
  };
  var options = {
    ocr: true, 
    ocrLanguage: "en"
  };
  // Convert the pdf to a Google Doc with ocr.
  var file = Drive.Files.insert(resource, blob, options);

  // Get the texts from the newly created text.
  var doc = DocumentApp.openById(file.id);
  var text = doc.getBody().getText();
  var title = doc.getName();
  
  // Deleted the document once the text has been stored.
  Drive.Files.remove(doc.getId());
  
  return {
    name:title,
    text:text
  };
}


function extractInvDate(text){
  const regexp = /Invoice Date:/g;//commented out \d{2}\/\d{2}\/\d{4}/gi;
  try{
    let array = [...text.match (regexp)];
    return array;
  }catch(e){
    
  }
};


function importToSpreadsheet(data){
  const sheet = SpreadsheetApp.openById(SS).getSheetByName(SHEET);
  
  const range = sheet.getRange(3,1,data.length,1);
  
  var j = 0;
  for (j = 0; j < data.length; j++){
    Logger.log(j);
  range.getCell(j+1,1).setValue(data[j]);
  }
  //range.sort([2,1]);
}


Solution 1:[1]

The problem or limitation is with the Drive.Files.insert function

When the blob is extracted, fetch the string but it has mime details also...one may need to process it. Sample code is below. modify as per ur need

var blob =  DriveApp.getFileById(fileID).getBlob()
var txt = blob.getDataAsString()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1