'Django PDF parser on POST request

I hope for your help. Because I have been struggling with this problem for a long time. A POST request comes from the frontend with one PDF file, after which I need to take a screenshot of the first page and extract its metadata and save it all in the database. at the moment I have overridden the POST method and am intercepting the JSON which contains the PDF. I pass this file to my parsing function. But other than the file name, the function cannot find the file. What could be the problem?

view

import fitz as fitz
from rest_framework.views import APIView
from rest_framework.parsers import MultiPartParser, FormParser
from rest_framework.response import Response
from rest_framework import status
from .serializers import FileSerializer


def parce_pdf(test):
    doc = fitz.open(test)  # open document
    pixel = doc[0]  # page pdf
    pix = pixel.get_pixmap()  # render page to an image
    pix.save("media/page.png")  # store image as a PNG
    print(doc.metadata)


class FileView(APIView):
    parser_classes = (MultiPartParser, FormParser)

    def post(self, request, *args, **kwargs):
        file_serializer = FileSerializer(data=request.data)
        # test = request.data['file'].content_type
        test = request.data['file']
        # print(request.data.get)
        print(test)
        print(request.accepted_media_type)

        parce_pdf(test)
        print(file_serializer)
        # print(test)
        if file_serializer.is_valid():
            file_serializer.save()
            return Response(file_serializer.data, status=status.HTTP_201_CREATED)
        else:
            return Response(file_serializer.errors, status=status.HTTP_400_BAD_REQUEST)

serializer

from rest_framework import serializers
from .models import File


class FileSerializer(serializers.ModelSerializer):
    class Meta():
        model = File
        fields = ('file', 'remark', 'timestamp')

models

from django.db import models


class File(models.Model):
    file = models.FileField(blank=False, null=False)
    remark = models.CharField(max_length=20)
    timestamp = models.DateTimeField(auto_now_add=True)

ERROR enter image description here



Solution 1:[1]

The problem is that you are passing only the file name from the front end. It should be either FileObject or Stream.

Try this after changing the front-end code.

def parce_pdf(test): # test is a file stream object
    doc = fitz.open('pdf', test)  # open document
    pixel = doc[0]  # page pdf
    pix = pixel.get_pixmap()  # render page to an image
    pix.save("media/page.png")  # store image as a PNG
    print(doc.metadata)


def post(self, request, *args, **kwargs):
    file_serializer = FileSerializer(data=request.data)
    # test = request.data['file'].content_type
    test = request.FILES['file'] # send file object from front-end
    print(test)
    print(request.accepted_media_type)
    parce_pdf(test)
    print(file_serializer)
    # print(test)
    if file_serializer.is_valid():
        file_serializer.save()
        return Response(file_serializer.data, status=status.HTTP_201_CREATED)
    else:
        return Response(file_serializer.errors, status=status.HTTP_400_BAD_REQUEST)

Solution 2:[2]

Thank You Nanthakumar J J for your answer. I have just little update in parce_pdf function as test is a file stream object so it produced error in line doc = fitz.open('pdf', test) The updated funtion is:

def parce_pdf(test):
    doc = fitz.open(stream=test.read(), filetype="pdf")  # this line is imporatant
    raw_text = ""
    for page in doc:
        raw_text = raw_text + str(page.getText())
    return raw_text

this function is woking very well in my django project.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nanthakumar J J
Solution 2 MEHEDI BIN HAFIZ