'Scrapy: How to output items in a specific json format

I output the scraped data in json format. Default scrapy exporter outputs list of dict in json format. Item type looks like:

[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]

But I want to export the data in a specific format like this:

{
"Shop Name":"Shop 1",
"Location":"XXXXXXXXX",
"Contact":"XXXX-XXXXX",
"Products":
[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]
}

Please advice me any solution. Thank you.



Solution 1:[1]

This is well documented at scrapy web page here.

from scrapy.exporters import JsonItemExporter


class ItemPipeline(object):

    file = None

    def open_spider(self, spider):
        self.file = open('item.json', 'w')
        self.exporter = JsonItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

This will create a json file containing your items.

Solution 2:[2]

I was trying to export pretty printed JSON and this is what worked for me.

I created a pipeline that looked like this:

class JsonPipeline(object):

    def open_spider(self, spider):
        self.file = open('your_file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item

It's similar to the example from the scrapy docs https://doc.scrapy.org/en/latest/topics/item-pipeline.html except it prints each JSON property indented and on a new line.

See the part about pretty printing here https://docs.python.org/2/library/json.html

Solution 3:[3]

There is also one more possible solution to generate spider outputs in a json directly from the spider directly from the command line.

scrapy crawl "name_of_your_spider" -a NAME_OF_ANY_ARGUMENT=VALUE_OF_THE_ARGUMENT -o output_data.json

Solution 4:[4]

Another way to take a json export of the scraped/crawled output from a scrapy spider is to enable feed export which is one of the inherent, inbuilt capabilities that are offered in the scrapy classes which could be enabled or disabled as per the requirements. One can go about doing this by defining the custom_settings (overriding) of the spider in the following manner. This ends up overriding the overall scrapy project settings for this specific spider.

So, for any spider named 'sample_spider':

class SampleSpider(scrapy.Spider):
    name = "sample_spider"
    allowed_domains = []

    custom_settings = {
        'FEED_URI': 'sample_spider_crawled_data.json',
        'FEED_FORMAT': 'json',
        'FEED_EXPORTERS': {
            'json': 'scrapy.exporters.JsonItemExporter',
        },
        'FEED_EXPORT_ENCODING': 'utf-8',
    }

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Gocht
Solution 2 Max
Solution 3 dhruv sharma
Solution 4 dhruv sharma