'Search and filter text from a column using Pyspark
I am new to Data Scraping. I am reading the data from a file having JSON objects as one row
{"name": "Soul Sweet \u2018Taters (Step-by-Step!)", "ingredients": "4 whole Medium Sweet Potatoes\n1 cup Sugar\n1 cup Milk\n2 whole Eggs\n1 teaspoon Vanilla Extract\n1 teaspoon Salt\n1 cup Brown Sugar\n1 cup Pecans\n1/2 cup Flour\n3/4 sticks Butter", "cookTime": "PT30M","prepTime": "PT45M"}
{"name": "Cranberry-Pomegranate Sauce", "ingredients": "1 bag (about 12 To 16 Oz) Fresh Cranberries\n16 ounces, fluid Pomegranate Juice\n3/4 cups Sugar, More Or Less To Taste", "cookTime": "PT15M","prepTime": "PT2M"}
{"name": "Whiskey Maple Cream Sauce", "ingredients": "1-1/2 cup Heavy Cream\n5 Tablespoons Pure Maple Syrup\n3 Tablespoons Light Corn Syrup\n1 Tablespoon Whiskey (can Add More If Desired)","cookTime": "PT15M","prepTime": "PT5M"}
I am looking for assistance on below:
- Filter rows that contain text
sugar
in theIngredients
column. - convert ISO prep time and ISO cook time to human-readable format to add a new column total time(prep time+cook time) for filtered rows.The ISO time is in minutes and even in Hours
Expected Output
name | total_cook_time |
---|---|
Soul Sweet \u2018Taters (Step-by-Step!) | 75M |
Cranberry-Pomegranate Sauce | 17M |
My sample code
import json
from datetime import datetime
currentdate=datetime.today().strftime('%Y/%m/%d')
absolutepath='project/sniper/'+'/'+currentdate+'/*.json'
new_data = []
totaltime = {}
data = [json.loads(line) for line in open('absolutepath', 'r')]
for d in data:
if 'sugar' in d.get('ingredients').lower() # case senstitive
new_data.append(d.get('name'))
total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
totaltime[d['name']] = total_time
Solution 1:[1]
This is one way to do it, as long as the durations are the same format. Otherwise more logic and/or regex could be friendlier. Assumes you are working with a list of recipes called 'data'.
The new_data list holds recipes with Sugar and totaltime dictionary is the recipe name total time.
new_data = []
totaltime = {}
for d in data:
# print(d.get('ingredients')
if 'Sugar' in d.get('ingredients'): # case senstitive
new_data.append(d.get('ingredients'))
total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
totaltime[d['name']] = total_time
totaltime
{'Soul Sweet ‘Taters (Step-by-Step!)': 75, 'Cranberry-Pomegranate Sauce': 17}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonathan Leon |