'Django PDF parser on POST request
I hope for your help. Because I have been struggling with this problem for a long time. A POST request comes from the frontend with one PDF file, after which I need to take a screenshot of the first page and extract its metadata and save it all in the database. at the moment I have overridden the POST method and am intercepting the JSON which contains the PDF. I pass this file to my parsing function. But other than the file name, the function cannot find the file. What could be the problem?
view
import fitz as fitz
from rest_framework.views import APIView
from rest_framework.parsers import MultiPartParser, FormParser
from rest_framework.response import Response
from rest_framework import status
from .serializers import FileSerializer
def parce_pdf(test):
doc = fitz.open(test) # open document
pixel = doc[0] # page pdf
pix = pixel.get_pixmap() # render page to an image
pix.save("media/page.png") # store image as a PNG
print(doc.metadata)
class FileView(APIView):
parser_classes = (MultiPartParser, FormParser)
def post(self, request, *args, **kwargs):
file_serializer = FileSerializer(data=request.data)
# test = request.data['file'].content_type
test = request.data['file']
# print(request.data.get)
print(test)
print(request.accepted_media_type)
parce_pdf(test)
print(file_serializer)
# print(test)
if file_serializer.is_valid():
file_serializer.save()
return Response(file_serializer.data, status=status.HTTP_201_CREATED)
else:
return Response(file_serializer.errors, status=status.HTTP_400_BAD_REQUEST)
serializer
from rest_framework import serializers
from .models import File
class FileSerializer(serializers.ModelSerializer):
class Meta():
model = File
fields = ('file', 'remark', 'timestamp')
models
from django.db import models
class File(models.Model):
file = models.FileField(blank=False, null=False)
remark = models.CharField(max_length=20)
timestamp = models.DateTimeField(auto_now_add=True)
Solution 1:[1]
The problem is that you are passing only the file name from the front end. It should be either FileObject or Stream.
Try this after changing the front-end code.
def parce_pdf(test): # test is a file stream object
doc = fitz.open('pdf', test) # open document
pixel = doc[0] # page pdf
pix = pixel.get_pixmap() # render page to an image
pix.save("media/page.png") # store image as a PNG
print(doc.metadata)
def post(self, request, *args, **kwargs):
file_serializer = FileSerializer(data=request.data)
# test = request.data['file'].content_type
test = request.FILES['file'] # send file object from front-end
print(test)
print(request.accepted_media_type)
parce_pdf(test)
print(file_serializer)
# print(test)
if file_serializer.is_valid():
file_serializer.save()
return Response(file_serializer.data, status=status.HTTP_201_CREATED)
else:
return Response(file_serializer.errors, status=status.HTTP_400_BAD_REQUEST)
Solution 2:[2]
Thank You
Nanthakumar J J for your answer. I have just little update in parce_pdf function
as test is a file stream object so it produced error in line doc = fitz.open('pdf', test)
The updated funtion is:
def parce_pdf(test):
doc = fitz.open(stream=test.read(), filetype="pdf") # this line is imporatant
raw_text = ""
for page in doc:
raw_text = raw_text + str(page.getText())
return raw_text
this function is woking very well in my django project.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Nanthakumar J J |
Solution 2 | MEHEDI BIN HAFIZ |