'How to extract efficiently time-series data from a netCDF file?

I want to extract time-series of data from a unique netCDF file. I have to extract three-time series of daily temperatures across more than 500 cities from 2004 to 2016 (more precisely, I extract 3-time series across 3 points coordinates for each city).

The following program works, but it is very slow. (More than 8hours to obtain one location time series). I have already tried to divide coordinates into several CSV files and run the program separately for each of these files, but it is not very efficient. Maybe I should chunck the netCDF file (5 Go) into smaller files to reduce the "reading" process. But I don't know how to do that.

from netCDF4 import Dataset
from datetime import datetime
from netCDF4 import Dataset
import pandas as pd
import os
import numpy as np


os.chdir('D:PATH/tmp/')

date_range = pd.date_range(start = "2004-01-01", end = "2016-12-31", freq ='D')

df = pd.DataFrame(0.0, columns = ['Temp1','Temp2','Temp3'], index = date_range)

cities = pd.read_csv(r'D:\PATH\cities_coordinates.csv', sep =',')


cities['NUTS_ID']= cities['NUTS_ID'].map(str)
  
for index, row in cities.iterrows():
    location = row['NUTS_ID']

    location_latitude1 = row['lat1']
    location_longitude1 = row['lon1']
   
    location_latitude2 = row['lat2']
    location_longitude2 = row['lon2']
   
    location_latitude3 = row['lat3']
    location_longitude3 = row['lon3']


    for day in date_range:
      
        data = Dataset("D:/PATH/temperature.nc",'r')
       
       
        # Storing the lat and lon data into variables of the netCDF file into variables
        lat = data.variables['latitude'][:]
        lon = data.variables['longitude'][:]
   
   
        # Squared difference between the specified lat, lon and the lat, lon of the netCDF
        sq_diff_lat1 = (lat - location_latitude1)**2
        sq_diff_lon1 = (lon - location_longitude1)**2
       
        sq_diff_lat2 = (lat - location_latitude2)**2
        sq_diff_lon2 = (lon - location_longitude2)**2
   
        sq_diff_lat3 = (lat - location_latitude3)**2
        sq_diff_lon3 = (lon - location_longitude3)**2
   

        # Identify the index of the min value for lat and lon
        min_index_lat1 = sq_diff_lat1.argmin()
        min_index_lon1 = sq_diff_lon1.argmin()
   
        min_index_lat2 = sq_diff_lat2.argmin()
        min_index_lon2 = sq_diff_lon2.argmin()
       
        min_index_lat3 = sq_diff_lat3.argmin()
        min_index_lon3 = sq_diff_lon3.argmin()
   
        # Accessing the temperature data
        tx = data.variables['tx']
       
        start = '2004-01-01'
        end = '2016-12-31'
        d_range = pd.date_range(start = start, end = end, freq='D')
       
        for t_index in np.arange(0, len(d_range)):
             print('Recording the value for: '+str(d_range[t_index]))
             df.loc[d_range[t_index]]['Temp1']=tx[t_index, min_index_lat1, min_index_lon1]
             df.loc[d_range[t_index]]['Temp2']=tx[t_index, min_index_lat2, min_index_lon2]
             df.loc[d_range[t_index]]['Temp3']=tx[t_index, min_index_lat3, min_index_lon3]
                      
    df.to_csv(location +'.csv')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to extract efficiently time-series data from a netCDF file?

Sources

Related Questions