'Search and filter text from a column using Pyspark

I am new to Data Scraping. I am reading the data from a file having JSON objects as one row

 {"name": "Soul Sweet \u2018Taters (Step-by-Step!)", "ingredients": "4 whole Medium Sweet Potatoes\n1 cup Sugar\n1 cup Milk\n2 whole Eggs\n1 teaspoon Vanilla Extract\n1 teaspoon Salt\n1 cup Brown Sugar\n1 cup Pecans\n1/2 cup Flour\n3/4 sticks Butter", "cookTime": "PT30M","prepTime": "PT45M"}
        
 {"name": "Cranberry-Pomegranate Sauce", "ingredients": "1 bag (about 12 To 16 Oz) Fresh Cranberries\n16 ounces, fluid Pomegranate Juice\n3/4 cups Sugar, More Or Less To Taste", "cookTime": "PT15M","prepTime": "PT2M"}
            
 {"name": "Whiskey Maple Cream Sauce", "ingredients": "1-1/2 cup Heavy Cream\n5 Tablespoons Pure Maple Syrup\n3 Tablespoons Light Corn Syrup\n1 Tablespoon Whiskey (can Add More If Desired)","cookTime": "PT15M","prepTime": "PT5M"}

I am looking for assistance on below:

  1. Filter rows that contain text sugar in the Ingredients column.
  2. convert ISO prep time and ISO cook time to human-readable format to add a new column total time(prep time+cook time) for filtered rows.The ISO time is in minutes and even in Hours

Expected Output

name total_cook_time
Soul Sweet \u2018Taters (Step-by-Step!) 75M
Cranberry-Pomegranate Sauce 17M

My sample code

import json
from datetime import datetime
currentdate=datetime.today().strftime('%Y/%m/%d')
absolutepath='project/sniper/'+'/'+currentdate+'/*.json' 
new_data = []
totaltime = {}
data = [json.loads(line) for line in open('absolutepath', 'r')]
for d in data:
   if 'sugar' in d.get('ingredients').lower() # case senstitive
         new_data.append(d.get('name'))
         total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
         totaltime[d['name']] = total_time


Solution 1:[1]

This is one way to do it, as long as the durations are the same format. Otherwise more logic and/or regex could be friendlier. Assumes you are working with a list of recipes called 'data'.

The new_data list holds recipes with Sugar and totaltime dictionary is the recipe name total time.

new_data = []
totaltime = {}
for d in data:
     # print(d.get('ingredients')
     if 'Sugar' in d.get('ingredients'): # case senstitive
         new_data.append(d.get('ingredients'))
         total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
         totaltime[d['name']] = total_time

totaltime
{'Soul Sweet ‘Taters (Step-by-Step!)': 75, 'Cranberry-Pomegranate Sauce': 17}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jonathan Leon