'How do I replace a string in a PDF file using NodeJS?
I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.
Thanks!
Edit: Just found HummusJS, I'll see if I can make progress and post it here.
Solution 1:[1]
I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347
Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:
var hummus = require('hummus');
/**
* Returns a byteArray string
*
* @param {string} str - input string
*/
function strToByteArray(str) {
var myBuffer = [];
var buffer = new Buffer(str);
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
return myBuffer;
}
function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {
var writer = hummus.createWriterToModify(sourceFile, {
modifiedFilePath: targetFile
});
var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
var pageObject = sourceParser.parsePage(pageNumber);
var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
//read the original block of text data
var data = [];
var readStream = sourceParser.startReadingFromStream(textStream);
while(readStream.notEnded()){
Array.prototype.push.apply(data, readStream.read(10000));
}
var string = new Buffer(data).toString().replace(findText, replaceText);
//Create and write our new text object
var objectsContext = writer.getObjectsContext();
objectsContext.startModifiedIndirectObject(textObjectId);
var stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(string));
objectsContext.endPDFStream(stream);
objectsContext.endIndirectObject();
writer.end();
}
// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');
UPDATE:
The version used at the time of writing an example was 1.0.83
, things might change recently.
UPDATE 2:
Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890-
got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-)
I had no idea what else to do rather than make up a regex.. So instead of this one line:
var string = new Buffer(data).toString().replace(findText, replaceText);
I have something like this now:
var string = Buffer.from(data).toString();
var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}
string = string.replace(new RegExp(match.join('')), function(m, m1) {
// m1 holds the first item which is a space
return m1 + '( ' + REPLACE_WITH_THIS + ')';
});
Solution 2:[2]
Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:
- Avoids corrupting non-text data
- Uses streams instead of files
- Allows multiple patterns/replacements
- Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
- Is written in TypeScript (feel free to remove the types for JS)
import muhammara from "muhammara";
interface Pattern {
searchValue: RegExp | string;
replaceValue: string;
}
/**
* Modify a PDF by replacing text in it
*/
const modifyPdf = ({
sourceStream,
targetStream,
patterns,
}: {
sourceStream: muhammara.ReadStream;
targetStream: muhammara.WriteStream;
patterns: Pattern[];
}): void => {
const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
const numPages = modPdfWriter
.createPDFCopyingContextForModifiedFile()
.getSourceDocumentParser()
.getPagesCount();
for (let page = 0; page < numPages; page++) {
const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
const objectsContext = modPdfWriter.getObjectsContext();
const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
const textStream = copyingContext
.getSourceDocumentParser()
.queryDictionaryObject(pageObject.getDictionary(), "Contents");
const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();
let data: number[] = [];
const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
while (readStream.notEnded()) {
const readData = readStream.read(10000);
data = data.concat(readData);
}
const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1
let modifiedPdfPageAsString = pdfPageAsString;
for (const pattern of patterns) {
modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
}
// Create what will become our new text object
objectsContext.startModifiedIndirectObject(textObjectID);
const stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
objectsContext.endPDFStream(stream);
objectsContext.endIndirectObject();
}
modPdfWriter.end();
};
/**
* Create a byte array from a string, as muhammara expects
*/
const strToByteArray = (str: string): number[] => {
const myBuffer = [];
const buffer = Buffer.from(str, "binary"); // key change 2
for (let i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
return myBuffer;
};
And then to use it:
/**
* Fill a PDF with template data
*/
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
const targetStream = new muhammara.PDFWStreamForBuffer();
modifyPdf({
sourceStream,
targetStream,
patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
});
return targetStream.buffer;
};
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | miguelmorin |
Solution 2 |