'Detect and crop a box in .pdf or image as individual images
I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?
I tried using the PyPDF2
package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("data/samples.pdf", "r")
# getting the first page
page = reader.getPage(0)
writer = PdfFileWriter()
# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page.cropBox.setLowerLeft((42, 115))
page.cropBox.setUpperRight((500, 245))
writer.addPage(page)
with open("samples_cropped.pdf", "wb") as fp:
writer.write(fp)
Thank you in advance for your help
Solution 1:[1]
Here's a simple approach using OpenCV
- Convert image to grayscale and Gaussian blur
- Threshold image
- Find contours
- Iterate through contours and filter using contour area
- Extract ROI
After extracting the ROI, you can save each as a separate image and then perform OCR text extraction using pytesseract
or some other tool.
Results
You mention this
The boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf.
Currently, your approach of using (x,y)
coordinates isn't very robust since the boxes could be anywhere on the image. A better approach is to filter using a minimum threshold contour area to detect the boxes. Depending on how small/large of a box you want to detect, you can adjust the variable. If you want additional filtering to prevent false positives, you can add into aspect ratio as another filtering mechanism. For instance, calculating aspect ratio for each contour then if it is within bounds (say 0.8
to 1.2
for a square/rectangle ROI) then it's a valid box.
import cv2
image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blurred, 230,255,cv2.THRESH_BINARY_INV)[1]
# Find contours
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# Iterate thorugh contours and filter for ROI
image_number = 0
min_area = 10000
for c in cnts:
area = cv2.contourArea(c)
if area > min_area:
x,y,w,h = cv2.boundingRect(c)
cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
ROI = original[y:y+h, x:x+w]
cv2.imwrite("ROI_{}.png".format(image_number), ROI)
image_number += 1
cv2.imshow('image', image)
cv2.waitKey(0)
Solution 2:[2]
Detect and pdf or image using defined bounding box as individual images
Using Opencv method to detect image and crop would be unreasonable for small projects for smaller projects with knowledge of bounding area following code works perfect and also saves image with same resolution as in original pdf or image
from PIL import Image
def ImageCrop():
img = Image.open("page_1.jpg")
left = 90
top = 580
right = 1600
bottom = 2000
img_res = img.crop((left, top, right, bottom))
with open(outfile4, 'w') as f:
img_res.save(outfile4,'JPEG')
ImageCrop()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 |