'How to extract data from grib files in AWS without downloading?
I'm looking to access a grib file to extract parameters (such as temperature, etc) from within the cloud without ever having to store the file locally. I've heard this can be done with the cfgrib API, but can't find any example documentation (I checked the source documentation here, but this doesn't include anything for accessing within the cloud).
From experience working with pygrib, I know that API reads in a grib file as a bytes representation, and cfgrib appears to handle it similarly. After some researching and trial and error, I've come up with this code that tries to read a byte string representation of the file:
import boto3 import boto from botocore.config import Config from botocore import UNSIGNED import pygrib import cfgrib
if __name__ == '__main__':
# Define boto config
my_config = Config(
signature_version = UNSIGNED,
retries = {
'max_attempts': 10,
'mode': 'standard'
}
)
session = boto3.Session(profile_name='default')
s3 = session.resource('s3')
my_bucket = s3.Bucket('nbmdata')
# Get a unique key for each file in s3
file_keys = []
for my_bucket_object in my_bucket.objects.all():
file_keys.append(my_bucket_object.key)
# Extract each file as a binary string (without downloading)
grib_files = []
for key in file_keys:
s3 = boto.connect_s3()
bucket = s3.lookup('bucket') # Removed bucket name
key = bucket.lookup(key)
your_bytes = key.get_contents_as_string(headers={'Range' : 'bytes=73-1024'})
grib_files.append(your_bytes)
# Interpret binary string into pygrib
for grib_file in grib_files:
grbs = pygrib.open(grib_file)
This appears to ALMOST work. I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 7: invalid continuation byte
I get the same error when I try to swap this out with cfgrib. What am I missing here?
Solution 1:[1]
I realize you are using boto
to use that particular get_contents_as_string
method, but if you are just trying to get the bytes, will this work? I think the bot method is trying to decode as utf-8, so maybe you need specify encoding=None to get bytes array???
But in boto3, I use this without decoding the filestreams and then print the first 50 characters of each file.
grib_files = []
for key in file_keys:
response = boto3.client('s3').get_object(Bucket='nbmdata', Key=key)
grib_files.append(response['Body'].read())
for grib in grib_files:
print(grib[0:50])
b'GRIB\x00\x00\x00\x02\x00\x00\x00\x00\x00\x16\xa7\x7f\x00\x00\x00\x15\x01\x00\x07\x00\x0e\x01\x01\x01\x07\xe5\x05\x1b\x03\x00\x00\x00\x01\x00\x00\x00Q\x03\x00\x009$\xc5\x00\x00\x00'
b'GRIB\x00\x00\x00\x02\x00\x00\x00\x00\x00\x16\x8b\xa8\x00\x00\x00\x15\x01\x00\x07\x00\x0e\x01\x01\x01\x07\xe5\x05\x1b\x03\x00\x00\x00\x01\x00\x00\x00Q\x03\x00\x009$\xc5\x00\x00\x00'
If you try to decode these with utf-8 it throws the same error you are receiving. From here, I don't know how decode and process, so maybe this helps????
Solution 2:[2]
Try something like this. I was using the GEFS data hosted on AWS instead and it worked great. I believe there is nbmdata on AWS also that can be found here: https://registry.opendata.aws/noaa-nbm/. No account should be needed, so it would just be a matter of changing the s3_object
name to the path/filename of the file you want from here https://noaa-nbm-pds.s3.amazonaws.com/index.html
import pygrib
import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = 'noaa-nbm-pds'
s3_object = 'path/to/filename'
obj = s3.get_object(Bucket=bucket_name, Key=s3_object)['Body'].read()
grbs = pygrib.fromstring(obj)
# this should print: <class 'pygrib._pygrib.gribmessage'>
print(type(grbs))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Jonathan Leon |
Solution 2 | Thomas Cannon |