pyXpcm: Ocean Profile Classification Model

https://zenodo.org/badge/DOI/10.5281/zenodo.3906236.svg

pyXpcm is a python package to create and work with ocean Profile Classification Model that consumes and produces Xarray objects. Xarray objects are N-D labeled arrays and datasets in Python.

An ocean Profile Classification Model allows to automatically assemble ocean profiles in clusters according to their vertical structure similarities. The geospatial properties of these clusters can be used to address a large variety of oceanographic problems: front detection, water mass identification, natural region contouring (gyres, eddies), reference profile selection for QC validation, etc… The vertical structure of these clusters furthermore provides a highly synthetic representation of large ocean areas that can be used for dimensionality reduction and coherent intercomparisons of ocean data (re)-analysis or simulations.

Documentation

Getting Started

Overview

What is an ocean PCM?

An ocean PCM is a Profile Classification Model for ocean data, a statistical procedure to classify ocean vertical profiles into a finite set of “clusters”. Depending on the dataset, such clusters can show space/time coherence that can be used in many different ways to study the ocean.

Statistic method

It consists in conducting un-supervised classification (clustering) with vertical profiles of one or more ocean variables.

Each levels of the vertical axis of each ocean variables are considered a feature. One ocean vertical profile with ocean variables is considered a sample.

All the details of the Profile Classification Modelling (PCM) statistical methodology can be found in Maze et al, 2017.

Illustration

Given a collection of Argo temperature profiles in the North Atlantic, a PCM analysis is applied and produces an optimal set of 8 ocean temperature profile classes. The PCM clusters synthesize the structural information of heat distribution in the North Atlantic. Each clusters objectively define an ocean region where dynamic gives rise to an unique vertical stratification pattern.

_images/graphical-abstract.png

Maze et al, 2017 applied it to the North Atlantic with Argo temperature data. Jones et al, 2019, later applied it to the Southern Ocean, also with Argo temperature data. Rosso et al (in prep) has applied it to the Southern Indian Ocean using both temperature and salinity Argo data.

pyXpcm

pyXpcm is an Python implementation of the PCM method that consumes and produces Xarray objects (xarray.Dataset and xarray.DataArray), hence the x.

With pyXpcm you can conduct a PCM analysis for a collection of profiles (gridded or not), of one or more ocean variables, stored in an xarray.Dataset. pyXpcm also provides basic statistics and plotting functions to get you started with your analysis.

The philosophy of the pyXpcm toolbox is to create and be able to use a PCM from and on different ocean datasets and variables. In order to achieve this, a PCM is created with information about ocean variables to classify and the vertical axis of these variables. Then this PCM can be fitted and subsequently classify ocean profiles from any datasets, as long as it contains the PCM variables.

The pyXpcm procedure is to preprocess (stack, scale, reduce and combine data) and then to fit a classifier on data. Once the model is fitted pyXpcm can classify data. The library uses many language and logic from Scikit-learn but doesn’t inherit from a sklearn.BaseEstimator.

Installation

Required dependencies

  • Python 3.6

  • Xarray 0.12

  • Dask 0.16

  • Scikit-learn 0.19

Note that Scikit-learn is the default statistic backend, but that if Dask_ml is installed you can use it as well (see API reference).

For full plotting functionality (see the Plotting API) the following packages are required:

  • Matplotlib 3.0 (mandatory)

  • Cartopy 0.17 (for some methods only)

  • Seaborn 0.9.0 (for some methods only)

Instructions

For the latest public release:

pip install pyxpcm

For the latest development version:

pip install git+http://github.com/obidam/pyxpcm.git

Pre-trained PCM

We’ll try to list here pre-trained PCM models ready for you to use and classify your data with.

Features

K [1]

Relevant domain

Training data

Access link

Reference

Temperature (0-1400m,5m)

8

North Atlantic

Argo (2000-2014)

Archimer

Maze et al (2017)

Temperature (15-980db,5db)

8

Southern Ocean

Argo (2001-2017)

Zenodo

Jones et al (2019)

User guide

Standard procedure

Here is a standard procedure based on pyXpcm. This will show you how to create a model, how to fit/train it, how to classify data and to visualise results.

Create a model

Let’s import the Profile Classification Model (PCM) constructor:

[2]:
from pyxpcm.models import pcm

A PCM can be created independently of any dataset using the class constructor.

To be created a PCM requires a number of classes (or clusters) and a dictionary to define the list of features and their vertical axis:

[3]:
z = np.arange(0.,-1000,-10.)
pcm_features = {'temperature': z, 'salinity':z}

We can now instantiate a PCM, say with 8 classes:

[4]:
m = pcm(K=8, features=pcm_features)
m
[4]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>

Here we created a PCM with 8 classes (K=8) and 2 features (F=2) that are temperature and salinity profiles defined between the surface and 1000m depth.

We furthermore note the list of transform methods that will be used to preprocess each of the features (see the preprocessing documentation page for more details).

Note that the number of classes and features are PCM properties accessible at pyxpcm.pcm.K and pyxpcm.pcm.F.

Load training data

pyXpcm is able to work with both gridded datasets (eg: model outputs with longitude,latitude,time dimensions) and collection of profiles (eg: Argo, XBT, CTD section profiles).

In this example, let’s import a sample of North Atlantic Argo data that come with pyxpcm.pcm:

[5]:
import pyxpcm
ds = pyxpcm.tutorial.open_dataset('argo').load()
print(ds)
<xarray.Dataset>
Dimensions:    (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * DEPTH      (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Dimensions without coordinates: N_PROF
Data variables:
    LATITUDE   (N_PROF) float32 ...
    LONGITUDE  (N_PROF) float32 ...
    TIME       (N_PROF) datetime64[ns] ...
    DBINDEX    (N_PROF) float64 ...
    TEMP       (N_PROF, DEPTH) float32 ...
    PSAL       (N_PROF, DEPTH) float32 ...
    SIG0       (N_PROF, DEPTH) float32 ...
    BRV2       (N_PROF, DEPTH) float32 ...
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Fit the model on data

Fitting can be done on any dataset coherent with the PCM definition, in a sense that it must have the feature variables of the PCM.

To tell the PCM model how to identify features in any xarray.Dataset, we need to provide a dictionary of variable names mapping:

[6]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}

which means that the PCM feature temperature is to be found in the dataset variables TEMP.

We also need to specify what is the vertical dimension of the dataset variables:

[7]:
features_zdim='DEPTH'

Now we’re ready to fit the model on the this dataset:

[8]:
m.fit(ds, features=features_in_ds, dim=features_zdim)
m
[8]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
         log likelihood of the training set: 38.784750

Note

pyXpcm can also identify PCM features and axis within a xarray.DataSet with variable attributes. From the example above we can set:

ds['TEMP'].attrs['feature_name'] = 'temperature'
ds['PSAL'].attrs['feature_name'] = 'salinity'
ds['DEPTH'].attrs['axis'] = 'Z'

And then simply call the fit method without arguments:

m.fit(ds)

Note that if data follows the CF the vertical dimension axis attribute should already be set to Z.

Classify data

Now that the PCM is fitted, we can predict the classification results like:

[9]:
m.predict(ds, features=features_in_ds, inplace=True)
ds
[9]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • DEPTH: 282
    • N_PROF: 7560
    • N_PROF
      (N_PROF)
      int64
      0 1 2 3 4 ... 7556 7557 7558 7559
      array([   0,    1,    2, ..., 7557, 7558, 7559])
    • DEPTH
      (DEPTH)
      float32
      0.0 -5.0 -10.0 ... -1400.0 -1405.0
      axis :
      Z
      standard_name :
      depth
      long_name :
      Vertical Distance Below the Surface
      convention :
      Negative, downward oriented
      units :
      meters
      positive :
      up
      array([    0.,    -5.,   -10., ..., -1395., -1400., -1405.], dtype=float32)
    • LATITUDE
      (N_PROF)
      float32
      ...
      axis :
      Y
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degrees_north
      array([27.122, 27.818, 27.452, ...,  4.243,  4.15 ,  4.44 ], dtype=float32)
    • LONGITUDE
      (N_PROF)
      float32
      ...
      axis :
      X
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degrees_east
      array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01,
             -2.0000e-03], dtype=float32)
    • TIME
      (N_PROF)
      datetime64[ns]
      ...
      standard_name :
      time
      short_name :
      time
      long_name :
      Time of Measurement
      array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064',
             '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064',
             '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'],
            dtype='datetime64[ns]')
    • DBINDEX
      (N_PROF)
      float64
      ...
      short_name :
      index
      long_name :
      Profile index within the complete database
      array([14840., 16215., 16220., ...,  8556.,  8557., 10628.])
    • TEMP
      (N_PROF, DEPTH)
      float32
      27.422163 27.422163 ... 4.391791
      long_name :
      Sea Temperature In-Situ ITS-90 Scale
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -2.5.f
      valid_max :
      40.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[27.422163, 27.422163, 27.29007 , ...,  4.436046,  4.423681,  4.411316],
             [25.129957, 25.129957, 24.970064, ...,  4.757417,  4.743126,  4.728835],
             [28.132914, 28.132914, 27.969038, ...,  4.371902,  4.356699,  4.341496],
             ...,
             [28.722956, 28.722956, 28.721252, ...,  4.275817,  4.270741,  4.266666],
             [28.643309, 28.643309, 28.64578 , ...,  4.292866,  4.285667,  4.278265],
             [29.249382, 29.249382, 29.13765 , ...,  4.412214,  4.403805,  4.391791]],
            dtype=float32)
    • PSAL
      (N_PROF, DEPTH)
      float32
      36.35267 36.35267 ... 34.910286
      long_name :
      Practical Salinity
      standard_name :
      sea_water_salinity
      units :
      psu
      valid_min :
      2.f
      valid_max :
      41.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903],
             [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157],
             [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634],
             ...,
             [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ],
             [35.184   , 35.184   , 35.184   , ..., 34.931606, 34.933006, 34.934402],
             [35.05187 , 35.05187 , 35.05117 , ..., 34.9081  , 34.909523, 34.910286]],
            dtype=float32)
    • SIG0
      (N_PROF, DEPTH)
      float32
      ...
      standard_name :
      sea_water_density
      long_name :
      Potential Density Referenced to Surface
      units :
      kg/m^3
      [2131920 values with dtype=float32]
    • BRV2
      (N_PROF, DEPTH)
      float32
      ...
      standard_name :
      N2
      long_name :
      Brunt-Vaisala Frequency Squared
      units :
      1/s^2
      [2131920 values with dtype=float32]
    • PCM_LABELS
      (N_PROF)
      int64
      1 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
      long_name :
      PCM labels
      units :
      valid_min :
      0
      valid_max :
      7
      llh :
      38.784750479707895
      _pyXpcm_cleanable :
      1
      array([1, 1, 1, ..., 3, 3, 3])
  • Sample test prepared by :
    G. Maze
    Institution :
    Ifremer/LOPS
    Data source DOI :
    10.17882/42182

Prediction labels are automatically added to the dataset as PCM_LABELS because the option inplace was set to True. We didn’t specify the dim option because our dataset is CF compliant.

pyXpcm use a GMM classifier by default, which is a fuzzy classifier. So we can also predict the probability of each classes for all profiles, the so-called posteriors:

[10]:
m.predict_proba(ds, features=features_in_ds, inplace=True)
ds
[10]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • DEPTH: 282
    • N_PROF: 7560
    • pcm_class: 8
    • N_PROF
      (N_PROF)
      int64
      0 1 2 3 4 ... 7556 7557 7558 7559
      array([   0,    1,    2, ..., 7557, 7558, 7559])
    • DEPTH
      (DEPTH)
      float32
      0.0 -5.0 -10.0 ... -1400.0 -1405.0
      axis :
      Z
      standard_name :
      depth
      long_name :
      Vertical Distance Below the Surface
      convention :
      Negative, downward oriented
      units :
      meters
      positive :
      up
      array([    0.,    -5.,   -10., ..., -1395., -1400., -1405.], dtype=float32)
    • LATITUDE
      (N_PROF)
      float32
      27.122 27.818 27.452 ... 4.15 4.44
      axis :
      Y
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degrees_north
      array([27.122, 27.818, 27.452, ...,  4.243,  4.15 ,  4.44 ], dtype=float32)
    • LONGITUDE
      (N_PROF)
      float32
      -74.86 -75.6 ... -0.821 -0.002
      axis :
      X
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degrees_east
      array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01,
             -2.0000e-03], dtype=float32)
    • TIME
      (N_PROF)
      datetime64[ns]
      2008-06-23T13:07:30 ... 2013-03-09T14:52:58.124999936
      standard_name :
      time
      short_name :
      time
      long_name :
      Time of Measurement
      array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064',
             '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064',
             '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'],
            dtype='datetime64[ns]')
    • DBINDEX
      (N_PROF)
      float64
      1.484e+04 1.622e+04 ... 1.063e+04
      short_name :
      index
      long_name :
      Profile index within the complete database
      array([14840., 16215., 16220., ...,  8556.,  8557., 10628.])
    • TEMP
      (N_PROF, DEPTH)
      float32
      27.422163 27.422163 ... 4.391791
      long_name :
      Sea Temperature In-Situ ITS-90 Scale
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -2.5.f
      valid_max :
      40.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[27.422163, 27.422163, 27.29007 , ...,  4.436046,  4.423681,  4.411316],
             [25.129957, 25.129957, 24.970064, ...,  4.757417,  4.743126,  4.728835],
             [28.132914, 28.132914, 27.969038, ...,  4.371902,  4.356699,  4.341496],
             ...,
             [28.722956, 28.722956, 28.721252, ...,  4.275817,  4.270741,  4.266666],
             [28.643309, 28.643309, 28.64578 , ...,  4.292866,  4.285667,  4.278265],
             [29.249382, 29.249382, 29.13765 , ...,  4.412214,  4.403805,  4.391791]],
            dtype=float32)
    • PSAL
      (N_PROF, DEPTH)
      float32
      36.35267 36.35267 ... 34.910286
      long_name :
      Practical Salinity
      standard_name :
      sea_water_salinity
      units :
      psu
      valid_min :
      2.f
      valid_max :
      41.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903],
             [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157],
             [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634],
             ...,
             [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ],
             [35.184   , 35.184   , 35.184   , ..., 34.931606, 34.933006, 34.934402],
             [35.05187 , 35.05187 , 35.05117 , ..., 34.9081  , 34.909523, 34.910286]],
            dtype=float32)
    • SIG0
      (N_PROF, DEPTH)
      float32
      ...
      standard_name :
      sea_water_density
      long_name :
      Potential Density Referenced to Surface
      units :
      kg/m^3
      [2131920 values with dtype=float32]
    • BRV2
      (N_PROF, DEPTH)
      float32
      ...
      standard_name :
      N2
      long_name :
      Brunt-Vaisala Frequency Squared
      units :
      1/s^2
      [2131920 values with dtype=float32]
    • PCM_LABELS
      (N_PROF)
      int64
      1 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
      long_name :
      PCM labels
      units :
      valid_min :
      0
      valid_max :
      7
      llh :
      38.784750479707895
      _pyXpcm_cleanable :
      1
      array([1, 1, 1, ..., 3, 3, 3])
    • PCM_POST
      (pcm_class, N_PROF)
      float64
      0.0 0.0 0.0 ... 1.388e-19 6.625e-20
      long_name :
      PCM posteriors
      units :
      valid_min :
      0
      valid_max :
      1
      llh :
      38.784750479707895
      _pyXpcm_cleanable :
      1
      array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
              0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
             [1.00000000e+000, 1.00000000e+000, 1.00000000e+000, ...,
              0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
             [2.75142938e-035, 1.71775866e-048, 1.47313441e-051, ...,
              9.86880534e-065, 3.81613835e-035, 1.17629638e-060],
             ...,
             [3.62174559e-041, 1.53234929e-041, 2.04866455e-011, ...,
              3.58739580e-156, 7.71107953e-076, 5.77624573e-087],
             [1.75054380e-043, 4.39827507e-058, 9.91303461e-043, ...,
              0.00000000e+000, 2.51971559e-279, 0.00000000e+000],
             [1.51181396e-037, 2.06372255e-057, 1.60024100e-030, ...,
              1.09259071e-008, 1.38768016e-019, 6.62479874e-020]])
  • Sample test prepared by :
    G. Maze
    Institution :
    Ifremer/LOPS
    Data source DOI :
    10.17882/42182

which are added to the dataset as the PCM_POST variables. The probability of classes for each profiles has a new dimension pcm_class by default that goes from 0 to K-1.

Note

You can delete variables added by pyXpcm to the xarray.DataSet with the pyxpcm.xarray.pyXpcmDataSetAccessor.drop_all() method:

ds.pyxpcm.drop_all()

Or you can split pyXpcm variables out of the original xarray.DataSet:

ds_pcm, ds = ds.pyxpcm.split()

It is important to note that once the PCM is fitted, you can predict labels for any dataset, as long as it has the PCM features.

For instance, let’s predict labels for a gridded dataset:

[12]:
ds_gridded = pyxpcm.tutorial.open_dataset('isas_snapshot').load()
ds_gridded
[12]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • depth: 152
    • latitude: 53
    • longitude: 61
    • latitude
      (latitude)
      float32
      30.023445 30.455408 ... 49.737103
      standard_name :
      latitude
      units :
      degree_north
      valid_min :
      -90.0
      valid_max :
      90.0
      axis :
      Y
      array([30.023445, 30.455408, 30.885464, 31.313599, 31.739796, 32.16404 ,
             32.58632 , 33.006615, 33.424923, 33.84122 , 34.2555  , 34.66775 ,
             35.07796 , 35.48612 , 35.892216, 36.296238, 36.698177, 37.09803 ,
             37.49578 , 37.891426, 38.284954, 38.67636 , 39.06564 , 39.452785,
             39.837788, 40.220642, 40.60135 , 40.979897, 41.35629 , 41.73051 ,
             42.10257 , 42.472458, 42.84017 , 43.20571 , 43.569073, 43.930252,
             44.289257, 44.646076, 45.000717, 45.353176, 45.703453, 46.051548,
             46.39746 , 46.7412  , 47.08276 , 47.422142, 47.75935 , 48.094387,
             48.427258, 48.757957, 49.0865  , 49.41288 , 49.737103], dtype=float32)
    • longitude
      (longitude)
      float32
      -70.0 -69.5 -69.0 ... -40.5 -40.0
      standard_name :
      longitude
      units :
      degree_east
      valid_min :
      -180.0
      valid_max :
      180.0
      axis :
      X
      array([-70. , -69.5, -69. , -68.5, -68. , -67.5, -67. , -66.5, -66. , -65.5,
             -65. , -64.5, -64. , -63.5, -63. , -62.5, -62. , -61.5, -61. , -60.5,
             -60. , -59.5, -59. , -58.5, -58. , -57.5, -57. , -56.5, -56. , -55.5,
             -55. , -54.5, -54. , -53.5, -53. , -52.5, -52. , -51.5, -51. , -50.5,
             -50. , -49.5, -49. , -48.5, -48. , -47.5, -47. , -46.5, -46. , -45.5,
             -45. , -44.5, -44. , -43.5, -43. , -42.5, -42. , -41.5, -41. , -40.5,
             -40. ], dtype=float32)
    • depth
      (depth)
      float32
      -1.0 -3.0 -5.0 ... -1980.0 -2000.0
      axis :
      Z
      units :
      meters
      positive :
      up
      array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01,
             -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01,
             -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01,
             -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02,
             -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02,
             -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02,
             -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02,
             -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02,
             -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02,
             -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02,
             -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02,
             -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02,
             -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02,
             -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02,
             -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02,
             -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02,
             -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03,
             -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03,
             -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03,
             -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03,
             -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03,
             -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03,
             -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03,
             -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03,
             -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03,
             -1.98e+03, -2.00e+03], dtype=float32)
    • TEMP
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Temperature
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -23000
      valid_max :
      20000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • TEMP_ERR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Temperature Error
      standard_name :
      units :
      degree_Celsius
      valid_min :
      0
      valid_max :
      20000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • TEMP_PCTVAR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Error on TEMP variable (% variance)
      standard_name :
      units :
      %
      valid_min :
      0.0
      valid_max :
      100.0
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Practical salinity
      standard_name :
      sea_water_salinity
      units :
      PSS-78
      valid_min :
      -26000
      valid_max :
      30000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL_ERR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Practical salinity Error
      standard_name :
      units :
      PSS-78
      valid_min :
      0
      valid_max :
      15000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL_PCTVAR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Error on PSAL variable (% variance)
      standard_name :
      units :
      %
      valid_min :
      0.0
      valid_max :
      100.0
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • SST
      (latitude, longitude)
      float32
      dask.array<chunksize=(53, 61), meta=np.ndarray>
      long_name :
      Temperature
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -23000
      valid_max :
      20000
      Array Chunk
      Bytes 12.93 kB 12.93 kB
      Shape (53, 61) (53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53
[13]:
m.predict(ds_gridded, features={'temperature':'TEMP','salinity':'PSAL'}, dim='depth', inplace=True)
ds_gridded
[13]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • depth: 152
    • latitude: 53
    • longitude: 61
    • latitude
      (latitude)
      float64
      30.02 30.46 30.89 ... 49.41 49.74
      array([30.023445, 30.455408, 30.885464, 31.313599, 31.739796, 32.16404 ,
             32.586319, 33.006615, 33.424923, 33.841221, 34.255501, 34.667751,
             35.077961, 35.486118, 35.892216, 36.296238, 36.698177, 37.09803 ,
             37.495781, 37.891426, 38.284954, 38.676361, 39.065639, 39.452785,
             39.837788, 40.220642, 40.601349, 40.979897, 41.356289, 41.730511,
             42.10257 , 42.472458, 42.840172, 43.205711, 43.569073, 43.930252,
             44.289257, 44.646076, 45.000717, 45.353176, 45.703453, 46.051548,
             46.397461, 46.741199, 47.08276 , 47.422142, 47.75935 , 48.094387,
             48.427258, 48.757957, 49.086498, 49.41288 , 49.737103])
    • longitude
      (longitude)
      float64
      -70.0 -69.5 -69.0 ... -40.5 -40.0
      array([-70. , -69.5, -69. , -68.5, -68. , -67.5, -67. , -66.5, -66. , -65.5,
             -65. , -64.5, -64. , -63.5, -63. , -62.5, -62. , -61.5, -61. , -60.5,
             -60. , -59.5, -59. , -58.5, -58. , -57.5, -57. , -56.5, -56. , -55.5,
             -55. , -54.5, -54. , -53.5, -53. , -52.5, -52. , -51.5, -51. , -50.5,
             -50. , -49.5, -49. , -48.5, -48. , -47.5, -47. , -46.5, -46. , -45.5,
             -45. , -44.5, -44. , -43.5, -43. , -42.5, -42. , -41.5, -41. , -40.5,
             -40. ])
    • depth
      (depth)
      float32
      -1.0 -3.0 -5.0 ... -1980.0 -2000.0
      axis :
      Z
      units :
      meters
      positive :
      up
      array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01,
             -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01,
             -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01,
             -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02,
             -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02,
             -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02,
             -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02,
             -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02,
             -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02,
             -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02,
             -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02,
             -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02,
             -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02,
             -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02,
             -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02,
             -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02,
             -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03,
             -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03,
             -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03,
             -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03,
             -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03,
             -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03,
             -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03,
             -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03,
             -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03,
             -1.98e+03, -2.00e+03], dtype=float32)
    • TEMP
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Temperature
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -23000
      valid_max :
      20000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • TEMP_ERR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Temperature Error
      standard_name :
      units :
      degree_Celsius
      valid_min :
      0
      valid_max :
      20000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • TEMP_PCTVAR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Error on TEMP variable (% variance)
      standard_name :
      units :
      %
      valid_min :
      0.0
      valid_max :
      100.0
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Practical salinity
      standard_name :
      sea_water_salinity
      units :
      PSS-78
      valid_min :
      -26000
      valid_max :
      30000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL_ERR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Practical salinity Error
      standard_name :
      units :
      PSS-78
      valid_min :
      0
      valid_max :
      15000
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • PSAL_PCTVAR
      (depth, latitude, longitude)
      float32
      dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
      long_name :
      Error on PSAL variable (% variance)
      standard_name :
      units :
      %
      valid_min :
      0.0
      valid_max :
      100.0
      Array Chunk
      Bytes 1.97 MB 1.97 MB
      Shape (152, 53, 61) (152, 53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53 152
    • SST
      (latitude, longitude)
      float32
      dask.array<chunksize=(53, 61), meta=np.ndarray>
      long_name :
      Temperature
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -23000
      valid_max :
      20000
      Array Chunk
      Bytes 12.93 kB 12.93 kB
      Shape (53, 61) (53, 61)
      Count 2 Tasks 1 Chunks
      Type float32 numpy.ndarray
      61 53
    • PCM_LABELS
      (latitude, longitude)
      float64
      1.0 1.0 1.0 1.0 ... 7.0 7.0 7.0 7.0
      long_name :
      PCM labels
      units :
      valid_min :
      0
      valid_max :
      7
      llh :
      37.003444293299125
      _pyXpcm_cleanable :
      1
      array([[ 1.,  1.,  1., ...,  6.,  6.,  6.],
             [ 1.,  1.,  1., ...,  6.,  6.,  6.],
             [ 1.,  1.,  1., ...,  6.,  6.,  6.],
             ...,
             [nan, nan, nan, ...,  7.,  7.,  7.],
             [nan, nan, nan, ...,  7.,  7.,  7.],
             [nan, nan, nan, ...,  7.,  7.,  7.]])

where you can see the adition of the PCM_LABELS variable.

Vertical structure of classes

One key outcome of the PCM analysis if the vertical structure of each classes. This can be computed using the :meth:pyxpcm.stat.quantile method.

Below we compute the 5, 50 and 95% quantiles for temperature and salinity of each classes:

[14]:
for vname in ['TEMP', 'PSAL']:
    ds = ds.pyxpcm.quantile(m, q=[0.05, 0.5, 0.95], of=vname, outname=vname + '_Q', keep_attrs=True, inplace=True)
ds
[14]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • DEPTH: 282
    • N_PROF: 7560
    • pcm_class: 8
    • quantile: 3
    • pcm_class
      (pcm_class)
      int64
      0 1 2 3 4 5 6 7
      array([0, 1, 2, 3, 4, 5, 6, 7])
    • N_PROF
      (N_PROF)
      int64
      0 1 2 3 4 ... 7556 7557 7558 7559
      array([   0,    1,    2, ..., 7557, 7558, 7559])
    • DEPTH
      (DEPTH)
      float32
      0.0 -5.0 -10.0 ... -1400.0 -1405.0
      axis :
      Z
      standard_name :
      depth
      long_name :
      Vertical Distance Below the Surface
      convention :
      Negative, downward oriented
      units :
      meters
      positive :
      up
      array([    0.,    -5.,   -10., ..., -1395., -1400., -1405.], dtype=float32)
    • quantile
      (quantile)
      float64
      0.05 0.5 0.95
      array([0.05, 0.5 , 0.95])
    • LATITUDE
      (N_PROF)
      float32
      27.122 27.818 27.452 ... 4.15 4.44
      axis :
      Y
      standard_name :
      latitude
      long_name :
      Latitude
      units :
      degrees_north
      array([27.122, 27.818, 27.452, ...,  4.243,  4.15 ,  4.44 ], dtype=float32)
    • LONGITUDE
      (N_PROF)
      float32
      -74.86 -75.6 ... -0.821 -0.002
      axis :
      X
      standard_name :
      longitude
      long_name :
      Longitude
      units :
      degrees_east
      array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01,
             -2.0000e-03], dtype=float32)
    • TIME
      (N_PROF)
      datetime64[ns]
      2008-06-23T13:07:30 ... 2013-03-09T14:52:58.124999936
      standard_name :
      time
      short_name :
      time
      long_name :
      Time of Measurement
      array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064',
             '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064',
             '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'],
            dtype='datetime64[ns]')
    • DBINDEX
      (N_PROF)
      float64
      1.484e+04 1.622e+04 ... 1.063e+04
      short_name :
      index
      long_name :
      Profile index within the complete database
      array([14840., 16215., 16220., ...,  8556.,  8557., 10628.])
    • TEMP
      (N_PROF, DEPTH)
      float32
      27.422163 27.422163 ... 4.391791
      long_name :
      Sea Temperature In-Situ ITS-90 Scale
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -2.5.f
      valid_max :
      40.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[27.422163, 27.422163, 27.29007 , ...,  4.436046,  4.423681,  4.411316],
             [25.129957, 25.129957, 24.970064, ...,  4.757417,  4.743126,  4.728835],
             [28.132914, 28.132914, 27.969038, ...,  4.371902,  4.356699,  4.341496],
             ...,
             [28.722956, 28.722956, 28.721252, ...,  4.275817,  4.270741,  4.266666],
             [28.643309, 28.643309, 28.64578 , ...,  4.292866,  4.285667,  4.278265],
             [29.249382, 29.249382, 29.13765 , ...,  4.412214,  4.403805,  4.391791]],
            dtype=float32)
    • PSAL
      (N_PROF, DEPTH)
      float32
      36.35267 36.35267 ... 34.910286
      long_name :
      Practical Salinity
      standard_name :
      sea_water_salinity
      units :
      psu
      valid_min :
      2.f
      valid_max :
      41.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903],
             [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157],
             [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634],
             ...,
             [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ],
             [35.184   , 35.184   , 35.184   , ..., 34.931606, 34.933006, 34.934402],
             [35.05187 , 35.05187 , 35.05117 , ..., 34.9081  , 34.909523, 34.910286]],
            dtype=float32)
    • SIG0
      (N_PROF, DEPTH)
      float32
      23.601229 23.601229 ... 27.685583
      standard_name :
      sea_water_density
      long_name :
      Potential Density Referenced to Surface
      units :
      kg/m^3
      array([[23.601229, 23.601229, 23.731516, ..., 27.760933, 27.761826, 27.762716],
             [24.442646, 24.442646, 24.48671 , ..., 27.740076, 27.74091 , 27.741743],
             [23.47391 , 23.47391 , 23.545067, ..., 27.755363, 27.7559  , 27.756437],
             ...,
             [22.136534, 22.136534, 22.13814 , ..., 27.709084, 27.710722, 27.71197 ],
             [22.321472, 22.321472, 22.321053, ..., 27.712975, 27.714897, 27.716837],
             [22.019827, 22.019827, 22.057219, ..., 27.681578, 27.683651, 27.685583]],
            dtype=float32)
    • BRV2
      (N_PROF, DEPTH)
      float32
      0.00029447526 ... 4.500769e-06
      standard_name :
      N2
      long_name :
      Brunt-Vaisala Frequency Squared
      units :
      1/s^2
      array([[2.944753e-04, 2.944753e-04, 2.944753e-04, ..., 2.669623e-06,
              2.621662e-06, 2.573702e-06],
             [1.496874e-04, 1.496874e-04, 1.496874e-04, ..., 2.985381e-06,
              2.966512e-06, 2.947643e-06],
             [1.443552e-04, 1.443552e-04, 1.443552e-04, ..., 2.068531e-06,
              2.030830e-06, 1.993128e-06],
             ...,
             [1.353956e-04, 1.353956e-04, 1.353956e-04, ..., 4.763126e-06,
              4.643939e-06, 4.543649e-06],
             [3.694623e-05, 3.694623e-05, 3.694623e-05, ..., 4.115821e-06,
              4.053568e-06, 3.997502e-06],
             [3.564891e-04, 3.564891e-04, 3.564891e-04, ..., 4.519470e-06,
              4.508477e-06, 4.500769e-06]], dtype=float32)
    • PCM_LABELS
      (N_PROF)
      int64
      1 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
      long_name :
      PCM labels
      units :
      valid_min :
      0
      valid_max :
      7
      llh :
      38.784750479707895
      _pyXpcm_cleanable :
      1
      array([1, 1, 1, ..., 3, 3, 3])
    • PCM_POST
      (pcm_class, N_PROF)
      float64
      0.0 0.0 0.0 ... 1.388e-19 6.625e-20
      long_name :
      PCM posteriors
      units :
      valid_min :
      0
      valid_max :
      1
      llh :
      38.784750479707895
      _pyXpcm_cleanable :
      1
      array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
              0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
             [1.00000000e+000, 1.00000000e+000, 1.00000000e+000, ...,
              0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
             [2.75142938e-035, 1.71775866e-048, 1.47313441e-051, ...,
              9.86880534e-065, 3.81613835e-035, 1.17629638e-060],
             ...,
             [3.62174559e-041, 1.53234929e-041, 2.04866455e-011, ...,
              3.58739580e-156, 7.71107953e-076, 5.77624573e-087],
             [1.75054380e-043, 4.39827507e-058, 9.91303461e-043, ...,
              0.00000000e+000, 2.51971559e-279, 0.00000000e+000],
             [1.51181396e-037, 2.06372255e-057, 1.60024100e-030, ...,
              1.09259071e-008, 1.38768016e-019, 6.62479874e-020]])
    • TEMP_Q
      (pcm_class, quantile, DEPTH)
      float64
      3.07 3.07 3.07 ... 4.844 4.83 4.814
      long_name :
      Sea Temperature In-Situ ITS-90 Scale
      standard_name :
      sea_water_temperature
      units :
      degree_Celsius
      valid_min :
      -2.5.f
      valid_max :
      40.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      _pyXpcm_cleanable :
      1
      array([[[ 3.0700623 ,  3.0700623 ,  3.0700623 , ...,  3.35431061,
                3.3517427 ,  3.34991541],
              [ 7.28403378,  7.28403378,  7.28452396, ...,  3.60048938,
                3.60041142,  3.5968492 ],
              [12.47129002, 12.47129002, 12.44652042, ...,  3.80719414,
                3.80425549,  3.80177546]],
      
             [[20.68908138, 20.68908138, 20.68477373, ...,  4.42709074,
                4.4156461 ,  4.40264764],
              [25.28415775, 25.28415775, 25.23971653, ...,  4.78096437,
                4.76742649,  4.75472689],
              [28.65389462, 28.65389462, 28.64907074, ...,  5.28582361,
                5.26761687,  5.24142032]],
      
             [[22.10998592, 22.1114954 , 22.11068916, ...,  4.61464977,
                4.60262957,  4.59209461],
              [26.21299934, 26.21311951, 26.21419525, ...,  4.8479557 ,
                4.83457565,  4.82184219],
              [28.55138512, 28.55138512, 28.5267458 , ...,  5.32599645,
                5.31082473,  5.29089804]],
      
             ...,
      
             [[11.23751669, 11.23751669, 11.22832422, ...,  3.77324021,
                3.76979504,  3.76486225],
              [17.56500053, 17.56500053, 17.55599976, ...,  5.02459002,
                5.00341415,  4.96935749],
              [24.89213123, 24.89213123, 24.83814144, ...,  8.82467117,
                8.78297062,  8.7307991 ]],
      
             [[18.3668314 , 18.3668314 , 18.36513605, ...,  5.00285695,
                4.98958879,  4.9757998 ],
              [23.00972939, 23.00972939, 22.99462128, ...,  5.90254211,
                5.87882209,  5.85720801],
              [27.31159039, 27.31159039, 27.30601883, ...,  7.54496312,
                7.52574463,  7.49450197]],
      
             [[ 4.38251553,  4.38251553,  4.35822134, ...,  3.57857409,
                3.57432342,  3.5740941 ],
              [17.41753769, 17.41753769, 17.31687546, ...,  3.95726109,
                3.95336437,  3.94936728],
              [27.90592651, 27.90592651, 27.89972496, ...,  4.84449224,
                4.83016787,  4.81395521]]])
    • PSAL_Q
      (pcm_class, quantile, DEPTH)
      float64
      34.03 34.03 34.03 ... 35.01 35.01
      long_name :
      Practical Salinity
      standard_name :
      sea_water_salinity
      units :
      psu
      valid_min :
      2.f
      valid_max :
      41.f
      C_format :
      %9.3f
      FORTRAN_format :
      F9.3
      resolution :
      0.001f
      _pyXpcm_cleanable :
      1
      array([[[34.02629547, 34.02629547, 34.02854004, ..., 34.8774765 ,
               34.87806854, 34.87847443],
              [34.75354385, 34.75354385, 34.75354385, ..., 34.90897751,
               34.90917969, 34.90939331],
              [35.05836639, 35.05836639, 35.05836639, ..., 34.94658127,
               34.94690933, 34.94753036]],
      
             [[36.14413528, 36.14738312, 36.14889565, ..., 34.98531075,
               34.98470345, 34.98376789],
              [36.59150124, 36.59150124, 36.59508324, ..., 35.03655815,
               35.03551102, 35.03469086],
              [36.99061012, 36.99440384, 36.99430428, ..., 35.12143936,
               35.12091942, 35.12045898]],
      
             [[34.7966114 , 34.7966114 , 34.79847603, ..., 34.97176018,
               34.97247543, 34.97301025],
              [36.4469986 , 36.4469986 , 36.44810867, ..., 35.00600052,
               35.00595856, 35.00588608],
              [37.18030014, 37.18030014, 37.18030357, ..., 35.10130501,
               35.10105553, 35.10047226]],
      
             ...,
      
             [[35.13364563, 35.13364563, 35.13353119, ..., 34.90844994,
               34.90806541, 34.90796356],
              [35.95100021, 35.95100021, 35.95291901, ..., 35.08145142,
               35.07888412, 35.07514954],
              [36.60344238, 36.60344238, 36.6132122 , ..., 35.85812607,
               35.85144691, 35.84058685]],
      
             [[36.49176941, 36.49176941, 36.49402256, ..., 35.04038658,
               35.04073067, 35.04102249],
              [37.13119698, 37.13119698, 37.13119698, ..., 35.23085976,
               35.22877693, 35.22781181],
              [37.57555389, 37.57555389, 37.57554474, ..., 35.58191986,
               35.57843609, 35.57495441]],
      
             [[32.63421402, 32.63421402, 32.65712433, ..., 34.89247131,
               34.89298935, 34.89305954],
              [35.338871  , 35.338871  , 35.34162903, ..., 34.94197083,
               34.94210052, 34.94210052],
              [36.35235672, 36.35235672, 36.35383072, ..., 35.01437607,
               35.01456451, 35.01464539]]])
  • Sample test prepared by :
    G. Maze
    Institution :
    Ifremer/LOPS
    Data source DOI :
    10.17882/42182

Quantiles can be plotted using the :func:pyxpcm.plot.quantile method.

[15]:
fig, ax = m.plot.quantile(ds['TEMP_Q'], maxcols=4, figsize=(10, 8), sharey=True)
_images/example_38_0.png

Geographic distribution of classes

Warning

To follow this section you’ll need to have Cartopy installed and working.

A map of labels can now easily be plotted:

[16]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)

kmap = m.plot.cmap()
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
cl = m.plot.colorbar(ax=ax)

gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set')
plt.show()
_images/example_42_0.png

Since we predicted labels for 2 datasets, we can superimpose them

[17]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-75,-35,25,55]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)

kmap = m.plot.cmap()
sc = ax.pcolor(ds_gridded['longitude'], ds_gridded['latitude'], ds_gridded['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=10, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K, edgecolors=[0.3]*3, linewidths=0.3)
cl = m.plot.colorbar(ax=ax)

gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set (dots) and another product (shade)')
plt.show()
_images/example_44_0.png

Posteriors are defined for each data point and give the probability of that point to belong to any of the classes. It can be plotted this way:

[18]:
cmap = sns.light_palette("blue", as_cmap=True)
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = m.plot.subplots(figsize=(10,22), maxcols=2, subplot_kw=subplot_kw)

for k in m:
    sc = ax[k].scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_POST'].sel(pcm_class=k),
                       cmap=cmap, transform=proj, vmin=0, vmax=1)
    cl = plt.colorbar(sc, ax=ax[k], fraction=0.03)
    gl = m.plot.latlongrid(ax[k], fontsize=8, dx=20, dy=10)
    ax[k].add_feature(cfeature.LAND)
    ax[k].add_feature(cfeature.COASTLINE)
    ax[k].set_title('PCM Posteriors k=%i' % k)
_images/example_46_0.png

PCM properties

The PCM class does a lot of data preprocessing under the hood in order to classify profiles.

Here is how to access PCM preprocessed results and data.

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

Let’s work with a standard PCM of temperature and salinity, from the surface down to -1000m:

[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)

# Define features to use
features_pcm = {'temperature': z, 'salinity': z}

# Instantiate the PCM
m = pcm(K=4, features=features_pcm, maxvar=2)
print(m)
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>

Note that here we used a strong dimensionality reduction to limit the dimensions and size of the plots to come (maxvar==2 tell the PCM to use the first 2 PCAs of each variables).

Now we can load a dataset to be used for fitting.

[4]:
ds = pyxpcm.tutorial.open_dataset('argo').load()

Fit and predict model on data:

[5]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}
ds = ds.pyxpcm.fit_predict(m, features=features_in_ds, inplace=True)
print(ds)
<xarray.Dataset>
Dimensions:     (DEPTH: 282, N_PROF: 7560)
Coordinates:
  * N_PROF      (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
  * DEPTH       (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Data variables:
    LATITUDE    (N_PROF) float32 ...
    LONGITUDE   (N_PROF) float32 ...
    TIME        (N_PROF) datetime64[ns] ...
    DBINDEX     (N_PROF) float64 ...
    TEMP        (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
    PSAL        (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
    SIG0        (N_PROF, DEPTH) float32 ...
    BRV2        (N_PROF, DEPTH) float32 ...
    PCM_LABELS  (N_PROF) int64 1 1 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 2 2
Attributes:
    Sample test prepared by:  G. Maze
    Institution:              Ifremer/LOPS
    Data source DOI:          10.17882/42182

Scaler properties

[6]:
fig, ax = m.plot.scaler()

# More options:
# m.plot.scaler(style='darkgrid')
# m.plot.scaler(style='darkgrid', subplot_kw={'ylim':[-1000,0]})
_images/pcm_prop_11_0.png

Reducer properties

Plot eigen vectors for a PCA reducer or nothing if no reduced used

[7]:
fig, ax = m.plot.reducer()
# Equivalent to:
# pcmplot.reducer(m)

# More options:
# m.plot.reducer(pcalist = range(0,4));
# m.plot.reducer(pcalist = [0], maxcols=1);
# m.plot.reducer(pcalist = range(0,4), style='darkgrid',  plot_kw={'linewidth':1.5}, subplot_kw={'ylim':[-1400,0]}, figsize=(12,10));
_images/pcm_prop_13_0.png

Scatter plot of features, as seen by the classifier

You can have access to pre-processed data for your own plot/analysis through the pyxpcm.pcm.preprocessing() method:

[8]:
X, sampling_dims = m.preprocessing(ds, features=features_in_ds)
X
[8]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • n_samples: 7560
  • n_features: 4
  • 1.9281656 -0.09149919 1.7340997 ... -0.6237318 -1.7922652 -0.46551177
    array([[ 1.9281656 , -0.09149919,  1.7340997 , -0.27024782],
           [ 2.314077  ,  0.10684185,  2.0836833 , -0.18765019],
           [ 1.6755121 , -0.17313023,  1.5637012 , -0.43244886],
           ...,
           [-0.802601  , -0.5783772 , -1.5761338 , -0.31184074],
           [-0.9552184 , -0.6094395 , -1.8049222 , -0.4273216 ],
           [-0.8925139 , -0.6237318 , -1.7922652 , -0.46551177]],
          dtype=float32)
    • n_samples
      (n_samples)
      int64
      0 1 2 3 4 ... 7556 7557 7558 7559
      array([   0,    1,    2, ..., 7557, 7558, 7559])
    • n_features
      (n_features)
      <U13
      'temperature_0' ... 'salinity_1'
      array(['temperature_0', 'temperature_1', 'salinity_0', 'salinity_1'],
            dtype='<U13')

pyXpcm return a 2-dimensional xarray.DataArray for which pairwise relationship can easily be visualise with the pyxpcm.plot.preprocessed() method (this requires Seaborn):

[9]:
g = m.plot.preprocessed(ds, features=features_in_ds, style='darkgrid')

# A posteriori adjustements:
# g.set(xlim=(-3,3),ylim=(-3,3))
# g.savefig('toto.png')
_images/pcm_prop_17_0.png
[10]:
# Combine KDE with histrograms (very slow plot, so commented here):
g = m.plot.preprocessed(ds, features=features_in_ds, kde=True)
_images/pcm_prop_18_0.png

Save and load PCM from local files

PCM instances are light weigth python objects and can easily be saved on and loaded from files. pyXpcm uses the netcdf file format because it is easy to add meta-data to numerical arrays.

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

# Load tutorial data:
ds = pyxpcm.tutorial.open_dataset('argo').load()

Saving a model

Let’s first create a PCM and fit it onto the tutorial dataset:

[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)

# Define features to use
features_pcm = {'temperature': z, 'salinity': z}

# Instantiate the PCM:
m = pcm(K=4, features=features_pcm)

# Fit:
m.fit(ds, features={'temperature': 'TEMP', 'salinity': 'PSAL'})
[3]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
         log likelihood of the training set: 33.335059

We can now save the fitted model to a local file:

[4]:
m.to_netcdf('my_pcm.nc')

Loading a model

To load a PCM from file, use:

[5]:
m_loaded = pyxpcm.load_netcdf('my_pcm.nc')
m_loaded
[5]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
         Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
         Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
         Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
         log likelihood of the training set: 33.335059

Features preprocessing explained

Note

pyXpcm performs automatically all the data preprocessing for you. You should not have to manipulate your data before being able to fit or classify. This page is an explanation of the preprocessing steps performed under the hood by pyXpcm.

The Profile Classification Model (PCM) requires data to be preprocessed in order to match the model vertical axis, to scale feature dimensions with each others and to reduce the dimensionality of the problem. Preprocessing is done internally by pyXpcm. Each step can be parameterised.

The PCM preprocessing operations are organised into 4 steps:

_images/Preprocessing_pipeline_2lines.png

Stack

This step mask, extract, flatten and transform any ND-array set of feature variables (eg: temperature, salinity) into a plain 2D-array collection of vertical profiles usable for machine learning methods.

Mask

This step computes a mask of the input data that will reject all profiles that are full of nans over the depth range of feature vertical axis. This ensure that all feature variables will be successfully retrieved to fill in the plain 2D-array collection of profiles.

This operation is conducted by pyxpcm.xarray.pyXpcmDataSetAccessor.mask(), so that the mask can be computed (and plotted) this way:

[2]:
mask = ds.pyxpcm.mask(m)
print(mask)
<xarray.DataArray 'pcm_MASK' (latitude: 53, longitude: 61)>
dask.array<eq, shape=(53, 61), dtype=bool, chunksize=(53, 61), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float32 30.023445 30.455408 ... 49.41288 49.737103
  * longitude  (longitude) float32 -70.0 -69.5 -69.0 -68.5 ... -41.0 -40.5 -40.0
[3]:
mask.plot();
_images/preprocessing_9_0.png
Ravel

For ND-array to be used as a feature, it has to be ravelled, flatten, along the N-1 dimensions that are not the vertical one. This operation will thus transform any ND-array into a 2D-array (sampling and vertical_axis dimensions) and additionnaly drop profiles according to the PCM mask determined above.

This operation is conducted by pyxpcm.pcm.ravel().

The output 2D-array is a xarray.DataArray that can be chunked along the sampling dimension with the PCM constructor option chunk_size:

[4]:
m = pcm(K=3, features=features_pcm, chunk_size=1e3).fit(ds)

By default, chunk_size='auto'.

[5]:
X, z, sampling_dims = m.ravel(ds['TEMP'], dim='depth', feature_name='TEMP')
X
[5]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
'TEMP'
  • sampling: 2289
  • depth: 152
  • dask.array<chunksize=(1000, 152), meta=np.ndarray>
    Array Chunk
    Bytes 1.39 MB 608.00 kB
    Shape (2289, 152) (1000, 152)
    Count 27 Tasks 3 Chunks
    Type float32 numpy.ndarray
    152 2289
    • depth
      (depth)
      float32
      -1.0 -3.0 -5.0 ... -1980.0 -2000.0
      axis :
      Z
      units :
      meters
      positive :
      up
      array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01,
             -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01,
             -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01,
             -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02,
             -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02,
             -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02,
             -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02,
             -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02,
             -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02,
             -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02,
             -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02,
             -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02,
             -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02,
             -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02,
             -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02,
             -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02,
             -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03,
             -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03,
             -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03,
             -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03,
             -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03,
             -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03,
             -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03,
             -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03,
             -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03,
             -1.98e+03, -2.00e+03], dtype=float32)
    • sampling
      (sampling)
      MultiIndex
      (latitude, longitude)
      array([(30.02344512939453, -70.0), (30.02344512939453, -69.5),
             (30.02344512939453, -69.0), ..., (49.73710250854492, -41.0),
             (49.73710250854492, -40.5), (49.73710250854492, -40.0)], dtype=object)
    • latitude
      (sampling)
      float64
      30.02 30.02 30.02 ... 49.74 49.74
      array([30.023445, 30.023445, 30.023445, ..., 49.737103, 49.737103, 49.737103])
    • longitude
      (sampling)
      float64
      -70.0 -69.5 -69.0 ... -40.5 -40.0
      array([-70. , -69.5, -69. , ..., -41. , -40.5, -40. ])
  • long_name :
    Temperature
    standard_name :
    sea_water_temperature
    units :
    degree_Celsius
    valid_min :
    -23000
    valid_max :
    20000

See the chunksize of the dask.array.Array for this feature.

Interpolate

Even if input data vertical axis are in the range of the PCM feature axis, they may not be defined on similar level values. In this step, if the input data are not defined on the same vertical axis as the PCM, an interpolation is triggered. The interpolation is conducted following these rules:

  • If PCM axis levels are found into the input data vertical axis, then a simple intersection is used.

  • If PCM axis starts at the surface (0 value) and not the input data, the 1st non-nan value is replicated to the surface, as a mixed layer.

  • If PCM axis levels are not in the input data vertical axis, a linear interpolation through the xarray.DataArray.interp() method is triggered for each profiles.

The entire interpolation processed is managed by a pyxpcm.utils.Vertical_Interpolator instance that is created at the time of PCM instanciation.

Scale

Each variable can be normalised along a vertical level. This step ensures that structures/patterns located at depth in the profile, will be considered similarly to those close to the surface by the classifier.

Scaling is defined at the PCM creation (pyxpcm.models.pcm) with the option scale. It is an integer value with the following meaning:

  • 0: No scaling

  • 1: Center on sample mean and scale by sample std

  • 2: Center on sample mean only

Recuce

[TBC]

Combine

[TBC]

Debugging and performances

Import and set-up

Import the library and toy data

[2]:
import pyxpcm
from pyxpcm.models import pcm

# Load a dataset to work with:
ds = pyxpcm.tutorial.open_dataset('argo').load()

# Define vertical axis and features to use:
z = np.arange(0.,-1000.,-10.)
features_pcm = {'temperature': z, 'salinity': z}
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}

Debugging

Use option debug to print log messages

[3]:
# Instantiate a new PCM:
m = pcm(K=8, features=features_pcm, debug=True)

# Fit with log:
m.fit(ds, features=features_in_ds);
> Start preprocessing for action 'fit'

        > Preprocessing xarray dataset 'TEMP' as PCM feature 'temperature'
         X RAVELED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))]
                Output axis is in the input axis, not need to interpolate, simple intersection
         X INTERPOLATED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))]
         X SCALED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
         X REDUCED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        temperature pre-processed with success,  [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        Homogenisation for fit of temperature

        > Preprocessing xarray dataset 'PSAL' as PCM feature 'salinity'
         X RAVELED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))]
                Output axis is in the input axis, not need to interpolate, simple intersection
         X INTERPOLATED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))]
         X SCALED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
         X REDUCED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        salinity pre-processed with success,  [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
        Homogenisation for fit of salinity
        Features array shape and type for xarray: (7560, 30) <class 'numpy.ndarray'> <class 'memoryview'>
> Preprocessing done, working with final X (<class 'xarray.core.dataarray.DataArray'>) array of shape: (7560, 30)  and sampling dimensions: ['N_PROF']

Performance / Optimisation

Use timeit and timeit_verb to compute computation time of PCM operations

Times are accessible as a pandas Dataframe in timeit pyXpcm instance property.

The pyXpcm m.plot.timeit() plot method allows for a simple visualisation of times.

Time readings during execution
[4]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=1)
m.fit(ds, features=features_in_ds);
  fit.1-preprocess.1-mask: 26 ms
  fit.1-preprocess.2-feature_temperature.1-ravel: 48 ms
  fit.1-preprocess.2-feature_temperature.2-interp: 2 ms
  fit.1-preprocess.2-feature_temperature.3-scale_fit: 14 ms
  fit.1-preprocess.2-feature_temperature.4-scale_transform: 8 ms
  fit.1-preprocess.2-feature_temperature.5-reduce_fit: 22 ms
  fit.1-preprocess.2-feature_temperature.6-reduce_transform: 5 ms
  fit.1-preprocess.2-feature_temperature.total: 103 ms
  fit.1-preprocess: 103 ms
  fit.1-preprocess.3-homogeniser: 1 ms
  fit.1-preprocess.2-feature_salinity.1-ravel: 42 ms
  fit.1-preprocess.2-feature_salinity.2-interp: 1 ms
  fit.1-preprocess.2-feature_salinity.3-scale_fit: 12 ms
  fit.1-preprocess.2-feature_salinity.4-scale_transform: 10 ms
  fit.1-preprocess.2-feature_salinity.5-reduce_fit: 14 ms
  fit.1-preprocess.2-feature_salinity.6-reduce_transform: 3 ms
  fit.1-preprocess.2-feature_salinity.total: 85 ms
  fit.1-preprocess: 85 ms
  fit.1-preprocess.3-homogeniser: 1 ms
  fit.1-preprocess.4-xarray: 1 ms
  fit.1-preprocess: 225 ms
  fit.fit: 2206 ms
  fit.score: 10 ms
  fit: 2442 ms
A posteriori Execution time analysis
[5]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=0)
m.fit(ds, features=features_in_ds);
m.predict(ds, features=features_in_ds);
m.fit_predict(ds, features=features_in_ds);

Execution times are accessible through a dataframe with the pyxpcm.pcm.timeit property

[6]:
m.timeit
[6]:
Method       Sub-method    Sub-sub-method         Sub-sub-sub-method
fit          1-preprocess  1-mask                 total                  19.836187
                           2-feature_temperature  1-ravel                32.550097
                                                  2-interp                0.828981
                                                  3-scale_fit             9.926319
                                                  4-scale_transform       5.232811
                                                                           ...
fit_predict  fit           total                                        737.503052
             score         total                                          8.855104
             predict       total                                          9.073734
             xarray        total                                          8.368969
             total                                                      882.753134
Length: 66, dtype: float64

Visualisation help

To facilitate your analysis of execution times, you can use pyxpcm.plot.timeit().

Main steps by method
[7]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-method', style='darkgrid') # Default group/split
df
[7]:
Sub-method 1-preprocess fit predict score xarray
Method
fit 556.826353 1123.903990 NaN 10.521889 NaN
fit_predict 416.832924 737.503052 9.073734 8.855104 8.368969
predict 402.269125 NaN 9.886980 9.831905 8.663177
_images/debug_perf_16_1.png
Preprocessing main steps by method
[8]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-method')
df
[8]:
Sub-sub-method 1-mask 2-feature_salinity 2-feature_temperature 3-homogeniser 4-xarray
Method
fit 19.836187 122.781515 130.059242 2.494097 2.145052
fit_predict 20.085096 93.130827 89.278936 3.419876 1.243114
predict 24.972916 73.968887 98.288298 1.749992 1.240015
_images/debug_perf_18_1.png
Preprocessing details by method
[9]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-sub-method')
df
[9]:
Sub-sub-sub-method 1-ravel 2-interp 3-scale_fit 4-scale_transform 5-reduce_fit 6-reduce_transform
Method
fit 63.939333 1.705170 18.492460 10.242701 25.922775 5.970001
fit_predict 67.147017 1.942873 0.005245 12.181759 0.003815 9.781122
predict 67.429066 1.815319 0.004053 10.471821 0.003099 6.281853
_images/debug_perf_20_1.png
Preprocessing details by features
[10]:
fig, ax, df = m.plot.timeit(split='Sub-sub-sub-method', group='Sub-sub-method', unit='s')
df
[10]:
Sub-sub-sub-method 1-ravel 2-interp 3-scale_fit 4-scale_transform 5-reduce_fit 6-reduce_transform
Sub-sub-method
2-feature_salinity 0.093312 0.002782 0.008571 0.017127 0.012878 0.010059
2-feature_temperature 0.105203 0.002681 0.009930 0.015770 0.013052 0.011974
_images/debug_perf_22_1.png

Help & reference

Bibliography

  • Maze, Guillaume and Mercier, Herlé and Fablet, Ronan and Tandeo, Pierre and Lopez Radcenco, Manuel and Lenca, Philippe and Feucher, Charlène and Le Goff, Clément. Coherent heat patterns revealed by unsupervised classification of Argo temperature profiles in the North Atlantic Ocean. Progress in Oceanography (2017). 10.1016/j.pocean.2016.12.008.

  • Maze, Guillaume and Mercier, Herl’e and Cabanes, C’ecile. Profile Classification Models. Mercator Ocean Journal (2017). ` <https://archimer.ifremer.fr/doc/00387/49816/>`__.

  • Jones, Daniel C. and Holt, Harry J. and Meijers, Andrew J. S. and Shuckburgh, Emily. Unsupervised Clustering of Southern Ocean Argo Float Temperature Profiles. Journal of Geophysical Research: Oceans (2019). 10.1029/2018jc014629.

  • Rosso, Isabella and Mazloff, Matthew R. and Talley, Lynne D. and Purkey, Sarah G. and Freeman, Natalie M. and Maze, Guillaume. Water Mass and Biogeochemical Variability in the Kerguelen Sector of the Southern Ocean: A Machine Learning Approach for a Mixing Hot Spot. Journal of Geophysical Research: Oceans (2020). https://doi.org/10.1029/2019JC015877.

  • Boehme, Lars and Rosso, Isabella. Classifying Oceanographic Structures in the Amundsen Sea, Antarctica. Geophysical Research Letters (2021). https://doi.org/10.1029/2020GL089412.

  • Thomas, S. D. A. and Jones, D. C. and Faul, A. and Mackie, E. and Pauthenet, E.. Defining Southern Ocean fronts using unsupervised classification. Ocean Science (2021). 10.5194/os-17-1545-2021.

  • Li, Xiaojuan and Mao, Zhihua and Zheng, Hongrui and Zhang, Wei and Yuan, Dapeng and Li, Youzhi and Wang, Zheng and Liu, Yunxin. Process-Oriented Estimation of Chlorophyll-a Vertical Profile in the Mediterranean Sea Using MODIS and Oceanographic Float Products. Frontiers in Marine Science (2022). 10.3389/fmars.2022.933680.

  • Sambe, Fumika and Suga, Toshio. Unsupervised Clustering of Argo Temperature and Salinity Profiles in the Mid-Latitude Northwest Pacific Ocean and Revealed Influence of the Kuroshio Extension Variability on the Vertical Structure Distribution. Journal of Geophysical Research: Oceans (2022). https://doi.org/10.1029/2021JC018138.

What’s New

Upcoming release

v0.4.1 (21 Feb. 2020)

  • Improved documentation

  • Improved unit testing

  • Bug fix:
    • Fix a bug in the preprocessing step using dask_ml bakend that would cause an error for data already in dask arrays

v0.4.0 (1 Nov. 2019)

Warning

The API has changed, break backward compatibility.

  • Enhancements:

    • Multiple-features classification

    • ND-Array classification (so that you can classify directly profiles from gridded products, eg: latitude/longitude/time grid, and not only a collection of profiles already in 2D array)

    • pyXpcm methods can be accessed through the xarray.Dataset accessor namespace pyxpcm

    • Allow to choose statistic backends (sklearn, dask_ml or user-defined)

    • Save/load PCM to/from netcdf files

  • pyXpcm now consumes xarray/dask objects all along, not only on the user front-end. This add a small overhead with small dataset but allows for PCM to handle large and more complex datasets.

v0.3 (5 Apr. 2019)

  • Removed support for python 2.7

  • Added more data input consistency checks

  • Fix bug in interpolation and plotting methods

  • Added custom colormap and colorbar to plot module

v0.2 (26 Mar. 2019)

  • Upgrade to python 3.6 (compatible 2.7)

  • Added test for continuous coverage

  • Added score and bic methods

  • Improved vocabulary consistency in methods

v0.1.3 (12 Nov. 2018)

  • Initial release.

API reference

This page provides an auto-generated summary of pyXpcm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

Top-level PCM functions

Creating a PCM

pcm(K, features[, scaling, reduction, ...])

Profile Classification Model class constructor

pyxpcm.load_netcdf(ncfile)

Load a PCM model from netcdf file

Attributes

pcm.K

Return the number of classes

pcm.F

Return the number of features

pcm.features

Return features definition dictionnary

Computation

pcm.fit(ds[, features, dim])

Estimate PCM parameters

pcm.fit_predict(ds[, features, dim, ...])

Estimate PCM parameters and predict classes.

pcm.predict(ds[, features, dim, inplace, name])

Predict labels for profile samples

pcm.predict_proba(ds[, features, dim, ...])

Predict posterior probability of each components given the data

pcm.score(ds[, features, dim])

Compute the per-sample average log-likelihood of the given data

pcm.bic(ds[, features, dim])

Compute Bayesian information criterion for the current model on the input dataset

Low-level PCM properties and functions

pcm.timeit

Return a pandas.DataFrame with Execution time of method called on this instance

pcm.ravel(da[, dim, feature_name])

Extract from N-d array a X(feature,sample) 2-d array and vertical dimension z

pcm.unravel(ds, sampling_dims, X)

Create a DataArray from a numpy array and sampling dimensions

Plotting

pcm.plot

Access plotting functions

Plot PCM Contents

plot.quantile(m, da[, xlim, classdimname, ...])

Plot q-th quantiles of a dataArray for each PCM components

plot.scaler(m[, style, plot_kw, subplot_kw])

Plot PCM scalers properties

plot.reducer(m[, pcalist, style, maxcols, ...])

Plot PCM reducers properties

plot.preprocessed(m, ds[, features, dim, n, ...])

Plot preprocessed features as pairwise scatter plots

plot.timeit(m[, group, split, subplot_kw, ...])

Plot PCM registered timing of operations

Tools

plot.cmap(m, name[, palette, usage])

Return categorical colormaps

plot.colorbar(m[, cmap])

Add a colorbar to the current plot with centered ticks on discrete colors

plot.subplots(m[, maxcols, K, subplot_kw])

Return (figure, axis) with one subplot per cluster

plot.latlongrid(ax[, dx, dy, fontsize])

Add latitude/longitude grid line and labels to a cartopy geoaxes

Statistics

pcm.stat

Access statistics functions

stat.quantile(ds[, q, of, using, outname, ...])

Compute q-th quantile of a xarray.DataArray for each PCM components

stat.robustness(ds[, name, classdimname, ...])

Compute classification robustness

stat.robustness_digit(ds[, name, ...])

Digitize classification robustness

Save/load PCM models

pcm.to_netcdf(ncfile, **ka)

Save a PCM to a netcdf file

pyxpcm.load_netcdf(ncfile)

Load a PCM model from netcdf file

Helper

tutorial.open_dataset(name)

Open a dataset from the pyXpcm online data repository (requires internet).

Xarray pyxpcm name space

Provide accessor to enhance interoperability between xarray and pyxpcm.

Provide a scope named pyxpcm as accessor to xarray.Dataset objects.

class pyxpcm.xarray.pyXpcmDataSetAccessor[source]

Class registered under scope pyxpcm to access xarray.Dataset objects.

add(da)[source]

Add a xarray.DataArray to this xarray.Dataset

bic(this_pcm, **kwargs)[source]

Compute Bayesian information criterion for the current model on the input dataset

Only for a GMM classifier

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
bic: float

The lower the better

drop_all()[source]

Remove xarray.DataArray created with pyXpcm front this xarray.Dataset

feature_dict(this_pcm, features=None)[source]

Return dictionary of features for this xarray.Dataset and a PCM

Parameters:
pcmpyxpcm.pcmmodel.pcm
featuresdict

Keys are PCM feature name, Values are corresponding xarray.Dataset variable names

Returns:
dict()

Dictionary where keys are PCM feature names and values the corresponding xarray.Dataset variables

fit(this_pcm, **kwargs)[source]

Estimate PCM parameters

For a PCM, the fit method consists in the following operations:

  • pre-processing
    • interpolation to the feature_axis levels of the model

    • scaling

    • reduction

  • estimate classifier parameters

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
self
fit_predict(this_pcm, **kwargs)[source]

Estimate PCM parameters and predict classes.

This method add these properties to the PCM object:

  • llh: The log likelihood of the model with regard to new data

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted labels If True, return the input xarray.Dataset with labels added as a new xarray.DataArray

name: string (‘PCM_LABELS’)

Name of the DataArray holding labels.

Returns:
xarray.DataArray

Component labels (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with component labels as a ‘PCM_LABELS’ new xarray.DataArray (if option ‘inplace’ = True)

mask(this_pcm, features=None, dim=None)[source]

Create a mask where all PCM features are defined

Create a mask where all feature profiles are not null over the PCM feature axis.

Parameters:
:class:`pyxpcm.pcmmodel.pcm`
featuresdict()

Definitions of this_pcm features in the xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dimstr

Name of the vertical dimension in the xarray.Dataset. If not specified or set to None, dim is identified as the xarray.DataArray variables with attributes ‘axis’ set to ‘z’.

Returns:
xarray.DataArray
predict(this_pcm, inplace=False, **kwargs)[source]

Predict labels for profile samples

This method add these properties to the PCM object:

  • llh: The log likelihood of the model with regard to new data

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted labels If True, return the input xarray.Dataset with labels added as a new xarray.DataArray

name: str, default is ‘PCM_LABELS’

Name of the xarray.DataArray with labels

Returns:
xarray.DataArray

Component labels (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with Component labels as a ‘PCM_LABELS’ new xarray.DataArray (if option ‘inplace’ = True)

predict_proba(this_pcm, **kwargs)[source]

Predict posterior probability of each components given the data

This method adds these properties to the PCM instance:

  • llh: The log likelihood of the model with regard to new data

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

inplace: boolean, False by default

If False, return a xarray.DataArray with predicted probabilities If True, return the input xarray.Dataset with probabilities added as a new xarray.DataArray

name: str, default is ‘PCM_POST’

Name of the DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

Returns:
xarray.DataArray

Probability of each Gaussian (state) in the model given each sample (if option ‘inplace’ = False)

or
xarray.Dataset

Input dataset with Component Probability as a ‘PCM_POST’ new xarray.DataArray (if option ‘inplace’ = True)

quantile(this_pcm, inplace=False, **kwargs)[source]

Compute q-th quantile of a xarray.DataArray for each PCM components

Parameters:
q: float in the range of [0,1] (or sequence of floats)

Quantiles to compute, which must be between 0 and 1 inclusive.

of: str

Name of the xarray.Dataset variable to compute quantiles for.

using: str

Name of the xarray.Dataset variable with classification labels to use. Use ‘PCM_LABELS’ by default.

outname: ‘PCM_QUANT’ or str

Name of the xarray.DataArray with quantile

keep_attrs: boolean, False by default

Preserve of xarray.Dataset attributes or not in the new quantile variable.

Returns:
xarray.Dataset with shape (K, n_quantiles, N_z=n_features)
or
xarray.DataArray with shape (K, n_quantiles, N_z=n_features)
robustness(this_pcm, inplace=False, **kwargs)[source]

Compute classification robustness

Parameters:
name: str, default is ‘PCM_POST’

Name of the xarray.DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

outname: ‘PCM_ROBUSTNESS’ or str

Name of the xarray.DataArray with robustness

inplace: boolean, False by default

If False, return a xarray.DataArray with robustness If True, return the input xarray.Dataset with robustness added as a new xarray.DataArray

Returns:
xarray.Dataset if inplace=True
or
xarray.DataArray if inplace=False
robustness_digit(this_pcm, inplace=False, **kwargs)[source]

Digitize classification robustness

Parameters:
ds: :class:`xarray.Dataset`

Input dataset

name: str, default is ‘PCM_POST’

Name of the xarray.DataArray with prediction probability (posteriors)

classdimname: str, default is ‘pcm_class’

Name of the dimension holding classes

outname: ‘PCM_ROBUSTNESS_CAT’ or str

Name of the xarray.DataArray with robustness categories

inplace: boolean, False by default

If False, return a xarray.DataArray with robustness If True, return the input xarray.Dataset with robustness categories added as a new xarray.DataArray

Returns:
xarray.Dataset if inplace=True
or
xarray.DataArray if inplace=False
sampling_dim(this_pcm, features=None, dim=None)[source]

Return the list of dimensions to be stacked for sampling

Parameters:
pcmpyxpcm.pcm
featuresNone (default) or dict()

Keys are PCM feature name, Values are corresponding xarray.Dataset variable names. It set to None, all PCM features are used.

dimNone (default) or str()

The xarray.Dataset dimension to use as vertical axis in all features. If set to None, it is automatically set to the dimension with an attribute axis set to Z.

Returns:
dict()

Dictionary where keys are xarray.Dataset variable names of features and values are another dictionary with the list of sampling dimension in DIM_SAMPLING key and the name of the vertical axis in the DIM_VERTICAL key.

score(this_pcm, **kwargs)[source]

Compute the per-sample average log-likelihood of the given data

Parameters:
ds: :class:`xarray.Dataset`

The dataset to work with

features: dict()

Definitions of PCM features in the input xarray.Dataset. If not specified or set to None, features are identified using xarray.DataArray attributes ‘feature_name’.

dim: str

Name of the vertical dimension in the input xarray.Dataset

Returns:
log_likelihood: float

In the case of a GMM classifier, this is the Log likelihood of the Gaussian mixture given data

split()[source]

Split pyXpcm variables from the original xarray.Dataset

Returns:
list of xarray.Dataset, xarray.Dataset

Two DataSest: one with pyXpcm variables, one with the original DataSet