Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
868 views
in Technique[技术] by (71.8m points)

python - Applying numpy.polyfit to xarray Dataset

Does Xarray support numpy computation functions such as polyfit? Or is there an efficient way to apply such functions to datasets?

Example: I want to calculate the slope of a line fitted to two variables (Temperature and Height), to calculate a lapse rate. I have a dataset (below) with these two variables with dimensions of (vertical, time, xgrid_0, ygrid_0).

<xarray.Dataset>
Dimensions:    (PressLev: 7, time: 48, xgrid_0: 685, ygrid_0: 485)
Coordinates:
    gridlat_0  (ygrid_0, xgrid_0) float32 44.6896 44.6956 44.7015 44.7075 ...
    gridlon_0  (ygrid_0, xgrid_0) float32 -129.906 -129.879 -129.851 ...
  * ygrid_0    (ygrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * xgrid_0    (xgrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
  * time       (time) datetime64[ns] 2016-08-15T01:00:00 2016-08-15T02:00:00 ...
  * PressLev   (PressLev) int64 0 1 2 3 4 5 6
Data variables:
    Temperature       (PressLev, time, ygrid_0, xgrid_0) float64 289.4 289.4 289.4 ...
    Height       (PressLev, time, ygrid_0, xgrid_0) float64 85.23 85.13 84.98 ...

If I extract the Temperature and Height for a given time, xgrid_0, ygrid_0; I can use the numpy.polyfit function.

ds_LR = ds.TMP_P0_L103_GST0 * 0 -9999 # Quick way to make dataarray with -9999 values but with correct dims/coords
for cts in np.arange(0,len(ds_UA.time)):
        for cx in ds_UA.xgrid_0.values:
                for cy in ds_UA.ygrid_0.values:
                        x_temp = ds_UA.Temperature[:,cts,cy,cx] # Grab the vertical profile of air temperature
                        y_hgt  = ds_UA.Height[:,cts,cy,cx] # Grab the vertical heights of air temperature values
                        s      = np.polyfit(y_hgt,x_temp,1) # Fit a line to the data
                        ds_LR[cts,cy,cx].values = s[0] # Grab the slope (first element)

But this is a slow and inefficient approach. Any suggestions on a better way to approach this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is becoming a pretty common question among xarray users as far as I can tell (myself included), and is closely related to this Github issue. Generally, a NumPy implementation of some function exists (in your case, np.polyfit()), but it's not clear how best to apply this calculation to every grid cell, possibly over multiple dimensions.

In a geoscience context, there are two main use cases, one with a simple solution and the other is more complicated:

(1) Easy case:

You have an xr.DataArray of temp, which is a function of (time, lat, lon) and you want to find the trend in time, in each gridbox. The easiest way to do this is by grouping the (lat, lon) coords into one new coord, grouping by that coord and then using the .apply() method.

Inspired by this Gist from Ryan Abernathy: <3

# Example data
da = xr.DataArray(np.random.randn(20, 180, 360),
                  dims=('time', 'lat', 'lon'),
                  coords={'time': np.linspace(0,19, 20), 
                  'lat': np.linspace(-90,90,180), 
                  'lon': np.linspace(0,359, 360)})

# define a function to compute a linear trend of a timeseries
def linear_trend(x):
    pf = np.polyfit(x.time, x, 1)
    # need to return an xr.DataArray for groupby
    return xr.DataArray(pf[0])

# stack lat and lon into a single dimension called allpoints
stacked = da.stack(allpoints=['lat','lon'])
# apply the function over allpoints to calculate the trend at each point
trend = stacked.groupby('allpoints').apply(linear_trend)
# unstack back to lat lon coordinates
trend_unstacked = trend.unstack('allpoints')

Downsides: This method becomes very slow for larger arrays and doesn't easily lend itself to other problems which may feel quite similar in their essence. This leads us to...

(2) Harder case (and the OP's question):

You have an xr.Dataset with variables temp and height, each of function of (plev, time, lat, lon) and you want to find the regression of temp against height (the lapse rate) for each (time, lat, lon) point.

The easiest way around this is to use xr.apply_ufunc(), which gives you some degree of vectorization and dask-compatibility. (Speed!)

# Example DataArrays
da1 = xr.DataArray(np.random.randn(20, 20, 180, 360),
                   dims=('plev', 'time', 'lat', 'lon'),
                   coords={'plev': np.linspace(0,19, 20), 
                   'time': np.linspace(0,19, 20), 
                   'lat': np.linspace(-90,90,180), 
                   'lon': np.linspace(0,359, 360)})

# Create dataset
ds = xr.Dataset({'Temp': da1, 'Height': da1})

As before, we create a function to compute the linear trend we need:

def linear_trend(x, y):
    pf = np.polyfit(x, y, 1)
    return xr.DataArray(pf[0])

Now, we can use xr.apply_ufunc() to regress the two DataArrays of temp and height against each other, along the plev dimension !

%%time
slopes = xr.apply_ufunc(linear_trend,
                        ds.Height, ds.Temp,
                        vectorize=True,
                        input_core_dims=[['plev'], ['plev']],# reduce along 'plev'
                        )

However, this approach is also quite slow and, as before, won't scale up well for larger arrays.

CPU times: user 2min 44s, sys: 2.1 s, total: 2min 46s
Wall time: 2min 48s

Speed it up:

To speed up this calculation, we can convert our height and temp data to dask.arrays using xr.DataArray.chunk(). This splits up our data into small, manageable chunks which we can then use to parallelize our calculation with dask=parallelized in our apply_ufunc().

N.B. You must be careful not to chunk along the dimension that you're applying the regression to!

dask_height = ds.Height.chunk({'lat':10, 'lon':10, 'time': 10})
dask_temp   = ds.Temp.chunk({'lat':10, 'lon':10, 'time': 10})
dask_height

<xarray.DataArray 'Height' (plev: 20, time: 20, lat: 180, lon: 360)>
dask.array<xarray-<this-array>, shape=(20, 20, 180, 360), dtype=float64, chunksize=(20, 10, 10, 10), chunktype=numpy.ndarray>
Coordinates:
  * plev     (plev) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
  * lat      (lat) float64 -90.0 -88.99 -87.99 -86.98 ... 86.98 87.99 88.99 90.0
  * lon      (lon) int64 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359

Now, do the calculation again!

%%time
slopes_dask = xr.apply_ufunc(linear_trend,
                             dask_height, dask_temp,
                             vectorize=True,
                             dask='parallelized',
                             input_core_dims=[['plev'], ['plev']], # reduce along 'plev'
                             output_dtypes=['d'],
                             )
CPU times: user 6.55 ms, sys: 2.39 ms, total: 8.94 ms
Wall time: 9.24 ms

SIGNIFICANT SPEEDUP !

Hope this helps! I learnt a lot trying to answer it :)

Best

EDIT: As pointed out in the comments, to really compare the processing time between the dask and non-dask methods, you should use:

%%time
slopes_dask.compute()

which gives you a comparable wall time to the non-dask method.

However it's important to point out that operating lazily on the data (i.e. not loading it in until you absolutely need it) is much preferred for working with the kind of large datasets that you encounter in climate analysis. So I'd still suggest using the dask method, because then you can operate many different processes on the input array and each will take only a few ms, then only at the end do you have to wait for a few minutes to get your finished product out. :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...