pyXpcm: Ocean Profile Classification Model
pyXpcm is a python package to create and work with ocean Profile Classification Model that consumes and produces Xarray objects. Xarray objects are N-D labeled arrays and datasets in Python.
An ocean Profile Classification Model allows to automatically assemble ocean profiles in clusters according to their vertical structure similarities. The geospatial properties of these clusters can be used to address a large variety of oceanographic problems: front detection, water mass identification, natural region contouring (gyres, eddies), reference profile selection for QC validation, etc… The vertical structure of these clusters furthermore provides a highly synthetic representation of large ocean areas that can be used for dimensionality reduction and coherent intercomparisons of ocean data (re)-analysis or simulations.
Documentation
Getting Started
Overview
What is an ocean PCM?
An ocean PCM is a Profile Classification Model for ocean data, a statistical procedure to classify ocean vertical profiles into a finite set of “clusters”. Depending on the dataset, such clusters can show space/time coherence that can be used in many different ways to study the ocean.
Statistic method
It consists in conducting un-supervised classification (clustering) with vertical profiles of one or more ocean variables.
Each levels of the vertical axis of each ocean variables are considered a feature. One ocean vertical profile with ocean variables is considered a sample.
All the details of the Profile Classification Modelling (PCM) statistical methodology can be found in Maze et al, 2017.
Illustration
Given a collection of Argo temperature profiles in the North Atlantic, a PCM analysis is applied and produces an optimal set of 8 ocean temperature profile classes. The PCM clusters synthesize the structural information of heat distribution in the North Atlantic. Each clusters objectively define an ocean region where dynamic gives rise to an unique vertical stratification pattern.

Maze et al, 2017 applied it to the North Atlantic with Argo temperature data. Jones et al, 2019, later applied it to the Southern Ocean, also with Argo temperature data. Rosso et al (in prep) has applied it to the Southern Indian Ocean using both temperature and salinity Argo data.
pyXpcm
pyXpcm is an Python implementation of the PCM method that consumes and produces Xarray objects (xarray.Dataset
and xarray.DataArray
), hence the x.
With pyXpcm you can conduct a PCM analysis for a collection of profiles (gridded or not), of one or more ocean variables, stored in an xarray.Dataset
.
pyXpcm also provides basic statistics and plotting functions to get you started with your analysis.
The philosophy of the pyXpcm toolbox is to create and be able to use a PCM from and on different ocean datasets and variables. In order to achieve this, a PCM is created with information about ocean variables to classify and the vertical axis of these variables. Then this PCM can be fitted and subsequently classify ocean profiles from any datasets, as long as it contains the PCM variables.
The pyXpcm procedure is to preprocess (stack, scale, reduce and combine data) and then to fit a classifier on data. Once the model is fitted pyXpcm can classify data. The library uses many language and logic from Scikit-learn but doesn’t inherit from a sklearn.BaseEstimator
.
Installation
Required dependencies
Python 3.6
Xarray 0.12
Dask 0.16
Scikit-learn 0.19
Note that Scikit-learn is the default statistic backend, but that if Dask_ml is installed you can use it as well (see API reference).
For full plotting functionality (see the Plotting API) the following packages are required:
Matplotlib 3.0 (mandatory)
Cartopy 0.17 (for some methods only)
Seaborn 0.9.0 (for some methods only)
Instructions
For the latest public release:
pip install pyxpcm
For the latest development version:
pip install git+http://github.com/obidam/pyxpcm.git
Pre-trained PCM
We’ll try to list here pre-trained PCM models ready for you to use and classify your data with.
Features |
K [1] |
Relevant domain |
Training data |
Access link |
Reference |
---|---|---|---|---|---|
Temperature (0-1400m,5m) |
8 |
North Atlantic |
Argo (2000-2014) |
||
Temperature (15-980db,5db) |
8 |
Southern Ocean |
Argo (2001-2017) |
User guide
Standard procedure
Here is a standard procedure based on pyXpcm. This will show you how to create a model, how to fit/train it, how to classify data and to visualise results.
Create a model
Let’s import the Profile Classification Model (PCM) constructor:
[2]:
from pyxpcm.models import pcm
A PCM can be created independently of any dataset using the class constructor.
To be created a PCM requires a number of classes (or clusters) and a dictionary to define the list of features and their vertical axis:
[3]:
z = np.arange(0.,-1000,-10.)
pcm_features = {'temperature': z, 'salinity':z}
We can now instantiate a PCM, say with 8 classes:
[4]:
m = pcm(K=8, features=pcm_features)
m
[4]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
Here we created a PCM with 8 classes (K=8) and 2 features (F=2) that are temperature and salinity profiles defined between the surface and 1000m depth.
We furthermore note the list of transform methods that will be used to preprocess each of the features (see the preprocessing documentation page for more details).
Note that the number of classes and features are PCM properties accessible at pyxpcm.pcm.K
and pyxpcm.pcm.F
.
Load training data
pyXpcm is able to work with both gridded datasets (eg: model outputs with longitude,latitude,time dimensions) and collection of profiles (eg: Argo, XBT, CTD section profiles).
In this example, let’s import a sample of North Atlantic Argo data that come with pyxpcm.pcm
:
[5]:
import pyxpcm
ds = pyxpcm.tutorial.open_dataset('argo').load()
print(ds)
<xarray.Dataset>
Dimensions: (DEPTH: 282, N_PROF: 7560)
Coordinates:
* DEPTH (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Dimensions without coordinates: N_PROF
Data variables:
LATITUDE (N_PROF) float32 ...
LONGITUDE (N_PROF) float32 ...
TIME (N_PROF) datetime64[ns] ...
DBINDEX (N_PROF) float64 ...
TEMP (N_PROF, DEPTH) float32 ...
PSAL (N_PROF, DEPTH) float32 ...
SIG0 (N_PROF, DEPTH) float32 ...
BRV2 (N_PROF, DEPTH) float32 ...
Attributes:
Sample test prepared by: G. Maze
Institution: Ifremer/LOPS
Data source DOI: 10.17882/42182
Fit the model on data
Fitting can be done on any dataset coherent with the PCM definition, in a sense that it must have the feature variables of the PCM.
To tell the PCM model how to identify features in any xarray.Dataset
, we need to provide a dictionary of variable names mapping:
[6]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}
which means that the PCM feature temperature
is to be found in the dataset variables TEMP
.
We also need to specify what is the vertical dimension of the dataset variables:
[7]:
features_zdim='DEPTH'
Now we’re ready to fit the model on the this dataset:
[8]:
m.fit(ds, features=features_in_ds, dim=features_zdim)
m
[8]:
<pcm 'gmm' (K: 8, F: 2)>
Number of class: 8
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
log likelihood of the training set: 38.784750
Note
pyXpcm can also identify PCM features and axis within a xarray.DataSet
with variable attributes. From the example above we can set:
ds['TEMP'].attrs['feature_name'] = 'temperature'
ds['PSAL'].attrs['feature_name'] = 'salinity'
ds['DEPTH'].attrs['axis'] = 'Z'
And then simply call the fit method without arguments:
m.fit(ds)
Note that if data follows the CF the vertical dimension axis
attribute should already be set to Z
.
Classify data
Now that the PCM is fitted, we can predict the classification results like:
[9]:
m.predict(ds, features=features_in_ds, inplace=True)
ds
[9]:
- DEPTH: 282
- N_PROF: 7560
- N_PROF(N_PROF)int640 1 2 3 4 ... 7556 7557 7558 7559
array([ 0, 1, 2, ..., 7557, 7558, 7559])
- DEPTH(DEPTH)float320.0 -5.0 -10.0 ... -1400.0 -1405.0
- axis :
- Z
- standard_name :
- depth
- long_name :
- Vertical Distance Below the Surface
- convention :
- Negative, downward oriented
- units :
- meters
- positive :
- up
array([ 0., -5., -10., ..., -1395., -1400., -1405.], dtype=float32)
- LATITUDE(N_PROF)float32...
- axis :
- Y
- standard_name :
- latitude
- long_name :
- Latitude
- units :
- degrees_north
array([27.122, 27.818, 27.452, ..., 4.243, 4.15 , 4.44 ], dtype=float32)
- LONGITUDE(N_PROF)float32...
- axis :
- X
- standard_name :
- longitude
- long_name :
- Longitude
- units :
- degrees_east
array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01, -2.0000e-03], dtype=float32)
- TIME(N_PROF)datetime64[ns]...
- standard_name :
- time
- short_name :
- time
- long_name :
- Time of Measurement
array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064', '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064', '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'], dtype='datetime64[ns]')
- DBINDEX(N_PROF)float64...
- short_name :
- index
- long_name :
- Profile index within the complete database
array([14840., 16215., 16220., ..., 8556., 8557., 10628.])
- TEMP(N_PROF, DEPTH)float3227.422163 27.422163 ... 4.391791
- long_name :
- Sea Temperature In-Situ ITS-90 Scale
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -2.5.f
- valid_max :
- 40.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[27.422163, 27.422163, 27.29007 , ..., 4.436046, 4.423681, 4.411316], [25.129957, 25.129957, 24.970064, ..., 4.757417, 4.743126, 4.728835], [28.132914, 28.132914, 27.969038, ..., 4.371902, 4.356699, 4.341496], ..., [28.722956, 28.722956, 28.721252, ..., 4.275817, 4.270741, 4.266666], [28.643309, 28.643309, 28.64578 , ..., 4.292866, 4.285667, 4.278265], [29.249382, 29.249382, 29.13765 , ..., 4.412214, 4.403805, 4.391791]], dtype=float32)
- PSAL(N_PROF, DEPTH)float3236.35267 36.35267 ... 34.910286
- long_name :
- Practical Salinity
- standard_name :
- sea_water_salinity
- units :
- psu
- valid_min :
- 2.f
- valid_max :
- 41.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903], [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157], [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634], ..., [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ], [35.184 , 35.184 , 35.184 , ..., 34.931606, 34.933006, 34.934402], [35.05187 , 35.05187 , 35.05117 , ..., 34.9081 , 34.909523, 34.910286]], dtype=float32)
- SIG0(N_PROF, DEPTH)float32...
- standard_name :
- sea_water_density
- long_name :
- Potential Density Referenced to Surface
- units :
- kg/m^3
[2131920 values with dtype=float32]
- BRV2(N_PROF, DEPTH)float32...
- standard_name :
- N2
- long_name :
- Brunt-Vaisala Frequency Squared
- units :
- 1/s^2
[2131920 values with dtype=float32]
- PCM_LABELS(N_PROF)int641 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
- long_name :
- PCM labels
- units :
- valid_min :
- 0
- valid_max :
- 7
- llh :
- 38.784750479707895
- _pyXpcm_cleanable :
- 1
array([1, 1, 1, ..., 3, 3, 3])
- Sample test prepared by :
- G. Maze
- Institution :
- Ifremer/LOPS
- Data source DOI :
- 10.17882/42182
Prediction labels are automatically added to the dataset as PCM_LABELS
because the option inplace
was set to True
. We didn’t specify the dim
option because our dataset is CF compliant.
pyXpcm use a GMM classifier by default, which is a fuzzy classifier. So we can also predict the probability of each classes for all profiles, the so-called posteriors:
[10]:
m.predict_proba(ds, features=features_in_ds, inplace=True)
ds
[10]:
- DEPTH: 282
- N_PROF: 7560
- pcm_class: 8
- N_PROF(N_PROF)int640 1 2 3 4 ... 7556 7557 7558 7559
array([ 0, 1, 2, ..., 7557, 7558, 7559])
- DEPTH(DEPTH)float320.0 -5.0 -10.0 ... -1400.0 -1405.0
- axis :
- Z
- standard_name :
- depth
- long_name :
- Vertical Distance Below the Surface
- convention :
- Negative, downward oriented
- units :
- meters
- positive :
- up
array([ 0., -5., -10., ..., -1395., -1400., -1405.], dtype=float32)
- LATITUDE(N_PROF)float3227.122 27.818 27.452 ... 4.15 4.44
- axis :
- Y
- standard_name :
- latitude
- long_name :
- Latitude
- units :
- degrees_north
array([27.122, 27.818, 27.452, ..., 4.243, 4.15 , 4.44 ], dtype=float32)
- LONGITUDE(N_PROF)float32-74.86 -75.6 ... -0.821 -0.002
- axis :
- X
- standard_name :
- longitude
- long_name :
- Longitude
- units :
- degrees_east
array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01, -2.0000e-03], dtype=float32)
- TIME(N_PROF)datetime64[ns]2008-06-23T13:07:30 ... 2013-03-09T14:52:58.124999936
- standard_name :
- time
- short_name :
- time
- long_name :
- Time of Measurement
array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064', '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064', '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'], dtype='datetime64[ns]')
- DBINDEX(N_PROF)float641.484e+04 1.622e+04 ... 1.063e+04
- short_name :
- index
- long_name :
- Profile index within the complete database
array([14840., 16215., 16220., ..., 8556., 8557., 10628.])
- TEMP(N_PROF, DEPTH)float3227.422163 27.422163 ... 4.391791
- long_name :
- Sea Temperature In-Situ ITS-90 Scale
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -2.5.f
- valid_max :
- 40.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[27.422163, 27.422163, 27.29007 , ..., 4.436046, 4.423681, 4.411316], [25.129957, 25.129957, 24.970064, ..., 4.757417, 4.743126, 4.728835], [28.132914, 28.132914, 27.969038, ..., 4.371902, 4.356699, 4.341496], ..., [28.722956, 28.722956, 28.721252, ..., 4.275817, 4.270741, 4.266666], [28.643309, 28.643309, 28.64578 , ..., 4.292866, 4.285667, 4.278265], [29.249382, 29.249382, 29.13765 , ..., 4.412214, 4.403805, 4.391791]], dtype=float32)
- PSAL(N_PROF, DEPTH)float3236.35267 36.35267 ... 34.910286
- long_name :
- Practical Salinity
- standard_name :
- sea_water_salinity
- units :
- psu
- valid_min :
- 2.f
- valid_max :
- 41.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903], [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157], [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634], ..., [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ], [35.184 , 35.184 , 35.184 , ..., 34.931606, 34.933006, 34.934402], [35.05187 , 35.05187 , 35.05117 , ..., 34.9081 , 34.909523, 34.910286]], dtype=float32)
- SIG0(N_PROF, DEPTH)float32...
- standard_name :
- sea_water_density
- long_name :
- Potential Density Referenced to Surface
- units :
- kg/m^3
[2131920 values with dtype=float32]
- BRV2(N_PROF, DEPTH)float32...
- standard_name :
- N2
- long_name :
- Brunt-Vaisala Frequency Squared
- units :
- 1/s^2
[2131920 values with dtype=float32]
- PCM_LABELS(N_PROF)int641 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
- long_name :
- PCM labels
- units :
- valid_min :
- 0
- valid_max :
- 7
- llh :
- 38.784750479707895
- _pyXpcm_cleanable :
- 1
array([1, 1, 1, ..., 3, 3, 3])
- PCM_POST(pcm_class, N_PROF)float640.0 0.0 0.0 ... 1.388e-19 6.625e-20
- long_name :
- PCM posteriors
- units :
- valid_min :
- 0
- valid_max :
- 1
- llh :
- 38.784750479707895
- _pyXpcm_cleanable :
- 1
array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ..., 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [1.00000000e+000, 1.00000000e+000, 1.00000000e+000, ..., 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [2.75142938e-035, 1.71775866e-048, 1.47313441e-051, ..., 9.86880534e-065, 3.81613835e-035, 1.17629638e-060], ..., [3.62174559e-041, 1.53234929e-041, 2.04866455e-011, ..., 3.58739580e-156, 7.71107953e-076, 5.77624573e-087], [1.75054380e-043, 4.39827507e-058, 9.91303461e-043, ..., 0.00000000e+000, 2.51971559e-279, 0.00000000e+000], [1.51181396e-037, 2.06372255e-057, 1.60024100e-030, ..., 1.09259071e-008, 1.38768016e-019, 6.62479874e-020]])
- Sample test prepared by :
- G. Maze
- Institution :
- Ifremer/LOPS
- Data source DOI :
- 10.17882/42182
which are added to the dataset as the PCM_POST
variables. The probability of classes for each profiles has a new dimension pcm_class
by default that goes from 0 to K-1.
Note
You can delete variables added by pyXpcm to the xarray.DataSet
with the pyxpcm.xarray.pyXpcmDataSetAccessor.drop_all()
method:
ds.pyxpcm.drop_all()
Or you can split pyXpcm variables out of the original xarray.DataSet
:
ds_pcm, ds = ds.pyxpcm.split()
It is important to note that once the PCM is fitted, you can predict labels for any dataset, as long as it has the PCM features.
For instance, let’s predict labels for a gridded dataset:
[12]:
ds_gridded = pyxpcm.tutorial.open_dataset('isas_snapshot').load()
ds_gridded
[12]:
- depth: 152
- latitude: 53
- longitude: 61
- latitude(latitude)float3230.023445 30.455408 ... 49.737103
- standard_name :
- latitude
- units :
- degree_north
- valid_min :
- -90.0
- valid_max :
- 90.0
- axis :
- Y
array([30.023445, 30.455408, 30.885464, 31.313599, 31.739796, 32.16404 , 32.58632 , 33.006615, 33.424923, 33.84122 , 34.2555 , 34.66775 , 35.07796 , 35.48612 , 35.892216, 36.296238, 36.698177, 37.09803 , 37.49578 , 37.891426, 38.284954, 38.67636 , 39.06564 , 39.452785, 39.837788, 40.220642, 40.60135 , 40.979897, 41.35629 , 41.73051 , 42.10257 , 42.472458, 42.84017 , 43.20571 , 43.569073, 43.930252, 44.289257, 44.646076, 45.000717, 45.353176, 45.703453, 46.051548, 46.39746 , 46.7412 , 47.08276 , 47.422142, 47.75935 , 48.094387, 48.427258, 48.757957, 49.0865 , 49.41288 , 49.737103], dtype=float32)
- longitude(longitude)float32-70.0 -69.5 -69.0 ... -40.5 -40.0
- standard_name :
- longitude
- units :
- degree_east
- valid_min :
- -180.0
- valid_max :
- 180.0
- axis :
- X
array([-70. , -69.5, -69. , -68.5, -68. , -67.5, -67. , -66.5, -66. , -65.5, -65. , -64.5, -64. , -63.5, -63. , -62.5, -62. , -61.5, -61. , -60.5, -60. , -59.5, -59. , -58.5, -58. , -57.5, -57. , -56.5, -56. , -55.5, -55. , -54.5, -54. , -53.5, -53. , -52.5, -52. , -51.5, -51. , -50.5, -50. , -49.5, -49. , -48.5, -48. , -47.5, -47. , -46.5, -46. , -45.5, -45. , -44.5, -44. , -43.5, -43. , -42.5, -42. , -41.5, -41. , -40.5, -40. ], dtype=float32)
- depth(depth)float32-1.0 -3.0 -5.0 ... -1980.0 -2000.0
- axis :
- Z
- units :
- meters
- positive :
- up
array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01, -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01, -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01, -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02, -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02, -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02, -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02, -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02, -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02, -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02, -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02, -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02, -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02, -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02, -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02, -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02, -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03, -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03, -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03, -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03, -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03, -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03, -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03, -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03, -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03, -1.98e+03, -2.00e+03], dtype=float32)
- TEMP(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Temperature
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -23000
- valid_max :
- 20000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - TEMP_ERR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Temperature Error
- standard_name :
- units :
- degree_Celsius
- valid_min :
- 0
- valid_max :
- 20000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - TEMP_PCTVAR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Error on TEMP variable (% variance)
- standard_name :
- units :
- %
- valid_min :
- 0.0
- valid_max :
- 100.0
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Practical salinity
- standard_name :
- sea_water_salinity
- units :
- PSS-78
- valid_min :
- -26000
- valid_max :
- 30000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL_ERR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Practical salinity Error
- standard_name :
- units :
- PSS-78
- valid_min :
- 0
- valid_max :
- 15000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL_PCTVAR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Error on PSAL variable (% variance)
- standard_name :
- units :
- %
- valid_min :
- 0.0
- valid_max :
- 100.0
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - SST(latitude, longitude)float32dask.array<chunksize=(53, 61), meta=np.ndarray>
- long_name :
- Temperature
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -23000
- valid_max :
- 20000
Array Chunk Bytes 12.93 kB 12.93 kB Shape (53, 61) (53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray
[13]:
m.predict(ds_gridded, features={'temperature':'TEMP','salinity':'PSAL'}, dim='depth', inplace=True)
ds_gridded
[13]:
- depth: 152
- latitude: 53
- longitude: 61
- latitude(latitude)float6430.02 30.46 30.89 ... 49.41 49.74
array([30.023445, 30.455408, 30.885464, 31.313599, 31.739796, 32.16404 , 32.586319, 33.006615, 33.424923, 33.841221, 34.255501, 34.667751, 35.077961, 35.486118, 35.892216, 36.296238, 36.698177, 37.09803 , 37.495781, 37.891426, 38.284954, 38.676361, 39.065639, 39.452785, 39.837788, 40.220642, 40.601349, 40.979897, 41.356289, 41.730511, 42.10257 , 42.472458, 42.840172, 43.205711, 43.569073, 43.930252, 44.289257, 44.646076, 45.000717, 45.353176, 45.703453, 46.051548, 46.397461, 46.741199, 47.08276 , 47.422142, 47.75935 , 48.094387, 48.427258, 48.757957, 49.086498, 49.41288 , 49.737103])
- longitude(longitude)float64-70.0 -69.5 -69.0 ... -40.5 -40.0
array([-70. , -69.5, -69. , -68.5, -68. , -67.5, -67. , -66.5, -66. , -65.5, -65. , -64.5, -64. , -63.5, -63. , -62.5, -62. , -61.5, -61. , -60.5, -60. , -59.5, -59. , -58.5, -58. , -57.5, -57. , -56.5, -56. , -55.5, -55. , -54.5, -54. , -53.5, -53. , -52.5, -52. , -51.5, -51. , -50.5, -50. , -49.5, -49. , -48.5, -48. , -47.5, -47. , -46.5, -46. , -45.5, -45. , -44.5, -44. , -43.5, -43. , -42.5, -42. , -41.5, -41. , -40.5, -40. ])
- depth(depth)float32-1.0 -3.0 -5.0 ... -1980.0 -2000.0
- axis :
- Z
- units :
- meters
- positive :
- up
array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01, -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01, -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01, -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02, -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02, -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02, -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02, -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02, -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02, -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02, -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02, -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02, -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02, -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02, -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02, -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02, -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03, -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03, -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03, -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03, -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03, -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03, -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03, -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03, -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03, -1.98e+03, -2.00e+03], dtype=float32)
- TEMP(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Temperature
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -23000
- valid_max :
- 20000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - TEMP_ERR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Temperature Error
- standard_name :
- units :
- degree_Celsius
- valid_min :
- 0
- valid_max :
- 20000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - TEMP_PCTVAR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Error on TEMP variable (% variance)
- standard_name :
- units :
- %
- valid_min :
- 0.0
- valid_max :
- 100.0
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Practical salinity
- standard_name :
- sea_water_salinity
- units :
- PSS-78
- valid_min :
- -26000
- valid_max :
- 30000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL_ERR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Practical salinity Error
- standard_name :
- units :
- PSS-78
- valid_min :
- 0
- valid_max :
- 15000
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PSAL_PCTVAR(depth, latitude, longitude)float32dask.array<chunksize=(152, 53, 61), meta=np.ndarray>
- long_name :
- Error on PSAL variable (% variance)
- standard_name :
- units :
- %
- valid_min :
- 0.0
- valid_max :
- 100.0
Array Chunk Bytes 1.97 MB 1.97 MB Shape (152, 53, 61) (152, 53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - SST(latitude, longitude)float32dask.array<chunksize=(53, 61), meta=np.ndarray>
- long_name :
- Temperature
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -23000
- valid_max :
- 20000
Array Chunk Bytes 12.93 kB 12.93 kB Shape (53, 61) (53, 61) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray - PCM_LABELS(latitude, longitude)float641.0 1.0 1.0 1.0 ... 7.0 7.0 7.0 7.0
- long_name :
- PCM labels
- units :
- valid_min :
- 0
- valid_max :
- 7
- llh :
- 37.003444293299125
- _pyXpcm_cleanable :
- 1
array([[ 1., 1., 1., ..., 6., 6., 6.], [ 1., 1., 1., ..., 6., 6., 6.], [ 1., 1., 1., ..., 6., 6., 6.], ..., [nan, nan, nan, ..., 7., 7., 7.], [nan, nan, nan, ..., 7., 7., 7.], [nan, nan, nan, ..., 7., 7., 7.]])
where you can see the adition of the PCM_LABELS
variable.
Vertical structure of classes
One key outcome of the PCM analysis if the vertical structure of each classes.
This can be computed using the :meth:pyxpcm.stat.quantile
method.
Below we compute the 5, 50 and 95% quantiles for temperature and salinity of each classes:
[14]:
for vname in ['TEMP', 'PSAL']:
ds = ds.pyxpcm.quantile(m, q=[0.05, 0.5, 0.95], of=vname, outname=vname + '_Q', keep_attrs=True, inplace=True)
ds
[14]:
- DEPTH: 282
- N_PROF: 7560
- pcm_class: 8
- quantile: 3
- pcm_class(pcm_class)int640 1 2 3 4 5 6 7
array([0, 1, 2, 3, 4, 5, 6, 7])
- N_PROF(N_PROF)int640 1 2 3 4 ... 7556 7557 7558 7559
array([ 0, 1, 2, ..., 7557, 7558, 7559])
- DEPTH(DEPTH)float320.0 -5.0 -10.0 ... -1400.0 -1405.0
- axis :
- Z
- standard_name :
- depth
- long_name :
- Vertical Distance Below the Surface
- convention :
- Negative, downward oriented
- units :
- meters
- positive :
- up
array([ 0., -5., -10., ..., -1395., -1400., -1405.], dtype=float32)
- quantile(quantile)float640.05 0.5 0.95
array([0.05, 0.5 , 0.95])
- LATITUDE(N_PROF)float3227.122 27.818 27.452 ... 4.15 4.44
- axis :
- Y
- standard_name :
- latitude
- long_name :
- Latitude
- units :
- degrees_north
array([27.122, 27.818, 27.452, ..., 4.243, 4.15 , 4.44 ], dtype=float32)
- LONGITUDE(N_PROF)float32-74.86 -75.6 ... -0.821 -0.002
- axis :
- X
- standard_name :
- longitude
- long_name :
- Longitude
- units :
- degrees_east
array([-7.4860e+01, -7.5600e+01, -7.4949e+01, ..., -1.2630e+00, -8.2100e-01, -2.0000e-03], dtype=float32)
- TIME(N_PROF)datetime64[ns]2008-06-23T13:07:30 ... 2013-03-09T14:52:58.124999936
- standard_name :
- time
- short_name :
- time
- long_name :
- Time of Measurement
array(['2008-06-23T13:07:30.000000000', '2011-05-22T12:46:24.375000064', '2011-07-11T12:54:50.624999936', ..., '2014-03-17T05:44:31.875000064', '2014-03-27T03:05:37.500000000', '2013-03-09T14:52:58.124999936'], dtype='datetime64[ns]')
- DBINDEX(N_PROF)float641.484e+04 1.622e+04 ... 1.063e+04
- short_name :
- index
- long_name :
- Profile index within the complete database
array([14840., 16215., 16220., ..., 8556., 8557., 10628.])
- TEMP(N_PROF, DEPTH)float3227.422163 27.422163 ... 4.391791
- long_name :
- Sea Temperature In-Situ ITS-90 Scale
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -2.5.f
- valid_max :
- 40.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[27.422163, 27.422163, 27.29007 , ..., 4.436046, 4.423681, 4.411316], [25.129957, 25.129957, 24.970064, ..., 4.757417, 4.743126, 4.728835], [28.132914, 28.132914, 27.969038, ..., 4.371902, 4.356699, 4.341496], ..., [28.722956, 28.722956, 28.721252, ..., 4.275817, 4.270741, 4.266666], [28.643309, 28.643309, 28.64578 , ..., 4.292866, 4.285667, 4.278265], [29.249382, 29.249382, 29.13765 , ..., 4.412214, 4.403805, 4.391791]], dtype=float32)
- PSAL(N_PROF, DEPTH)float3236.35267 36.35267 ... 34.910286
- long_name :
- Practical Salinity
- standard_name :
- sea_water_salinity
- units :
- psu
- valid_min :
- 2.f
- valid_max :
- 41.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
array([[36.35267 , 36.35267 , 36.468353, ..., 35.01112 , 35.010513, 35.009903], [36.508953, 36.508953, 36.502014, ..., 35.029182, 35.02817 , 35.027157], [36.492146, 36.492146, 36.514534, ..., 34.995472, 34.994053, 34.992634], ..., [34.973022, 34.973022, 34.973873, ..., 34.92445 , 34.925774, 34.92674 ], [35.184 , 35.184 , 35.184 , ..., 34.931606, 34.933006, 34.934402], [35.05187 , 35.05187 , 35.05117 , ..., 34.9081 , 34.909523, 34.910286]], dtype=float32)
- SIG0(N_PROF, DEPTH)float3223.601229 23.601229 ... 27.685583
- standard_name :
- sea_water_density
- long_name :
- Potential Density Referenced to Surface
- units :
- kg/m^3
array([[23.601229, 23.601229, 23.731516, ..., 27.760933, 27.761826, 27.762716], [24.442646, 24.442646, 24.48671 , ..., 27.740076, 27.74091 , 27.741743], [23.47391 , 23.47391 , 23.545067, ..., 27.755363, 27.7559 , 27.756437], ..., [22.136534, 22.136534, 22.13814 , ..., 27.709084, 27.710722, 27.71197 ], [22.321472, 22.321472, 22.321053, ..., 27.712975, 27.714897, 27.716837], [22.019827, 22.019827, 22.057219, ..., 27.681578, 27.683651, 27.685583]], dtype=float32)
- BRV2(N_PROF, DEPTH)float320.00029447526 ... 4.500769e-06
- standard_name :
- N2
- long_name :
- Brunt-Vaisala Frequency Squared
- units :
- 1/s^2
array([[2.944753e-04, 2.944753e-04, 2.944753e-04, ..., 2.669623e-06, 2.621662e-06, 2.573702e-06], [1.496874e-04, 1.496874e-04, 1.496874e-04, ..., 2.985381e-06, 2.966512e-06, 2.947643e-06], [1.443552e-04, 1.443552e-04, 1.443552e-04, ..., 2.068531e-06, 2.030830e-06, 1.993128e-06], ..., [1.353956e-04, 1.353956e-04, 1.353956e-04, ..., 4.763126e-06, 4.643939e-06, 4.543649e-06], [3.694623e-05, 3.694623e-05, 3.694623e-05, ..., 4.115821e-06, 4.053568e-06, 3.997502e-06], [3.564891e-04, 3.564891e-04, 3.564891e-04, ..., 4.519470e-06, 4.508477e-06, 4.500769e-06]], dtype=float32)
- PCM_LABELS(N_PROF)int641 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
- long_name :
- PCM labels
- units :
- valid_min :
- 0
- valid_max :
- 7
- llh :
- 38.784750479707895
- _pyXpcm_cleanable :
- 1
array([1, 1, 1, ..., 3, 3, 3])
- PCM_POST(pcm_class, N_PROF)float640.0 0.0 0.0 ... 1.388e-19 6.625e-20
- long_name :
- PCM posteriors
- units :
- valid_min :
- 0
- valid_max :
- 1
- llh :
- 38.784750479707895
- _pyXpcm_cleanable :
- 1
array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ..., 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [1.00000000e+000, 1.00000000e+000, 1.00000000e+000, ..., 0.00000000e+000, 0.00000000e+000, 0.00000000e+000], [2.75142938e-035, 1.71775866e-048, 1.47313441e-051, ..., 9.86880534e-065, 3.81613835e-035, 1.17629638e-060], ..., [3.62174559e-041, 1.53234929e-041, 2.04866455e-011, ..., 3.58739580e-156, 7.71107953e-076, 5.77624573e-087], [1.75054380e-043, 4.39827507e-058, 9.91303461e-043, ..., 0.00000000e+000, 2.51971559e-279, 0.00000000e+000], [1.51181396e-037, 2.06372255e-057, 1.60024100e-030, ..., 1.09259071e-008, 1.38768016e-019, 6.62479874e-020]])
- TEMP_Q(pcm_class, quantile, DEPTH)float643.07 3.07 3.07 ... 4.844 4.83 4.814
- long_name :
- Sea Temperature In-Situ ITS-90 Scale
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -2.5.f
- valid_max :
- 40.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
- _pyXpcm_cleanable :
- 1
array([[[ 3.0700623 , 3.0700623 , 3.0700623 , ..., 3.35431061, 3.3517427 , 3.34991541], [ 7.28403378, 7.28403378, 7.28452396, ..., 3.60048938, 3.60041142, 3.5968492 ], [12.47129002, 12.47129002, 12.44652042, ..., 3.80719414, 3.80425549, 3.80177546]], [[20.68908138, 20.68908138, 20.68477373, ..., 4.42709074, 4.4156461 , 4.40264764], [25.28415775, 25.28415775, 25.23971653, ..., 4.78096437, 4.76742649, 4.75472689], [28.65389462, 28.65389462, 28.64907074, ..., 5.28582361, 5.26761687, 5.24142032]], [[22.10998592, 22.1114954 , 22.11068916, ..., 4.61464977, 4.60262957, 4.59209461], [26.21299934, 26.21311951, 26.21419525, ..., 4.8479557 , 4.83457565, 4.82184219], [28.55138512, 28.55138512, 28.5267458 , ..., 5.32599645, 5.31082473, 5.29089804]], ..., [[11.23751669, 11.23751669, 11.22832422, ..., 3.77324021, 3.76979504, 3.76486225], [17.56500053, 17.56500053, 17.55599976, ..., 5.02459002, 5.00341415, 4.96935749], [24.89213123, 24.89213123, 24.83814144, ..., 8.82467117, 8.78297062, 8.7307991 ]], [[18.3668314 , 18.3668314 , 18.36513605, ..., 5.00285695, 4.98958879, 4.9757998 ], [23.00972939, 23.00972939, 22.99462128, ..., 5.90254211, 5.87882209, 5.85720801], [27.31159039, 27.31159039, 27.30601883, ..., 7.54496312, 7.52574463, 7.49450197]], [[ 4.38251553, 4.38251553, 4.35822134, ..., 3.57857409, 3.57432342, 3.5740941 ], [17.41753769, 17.41753769, 17.31687546, ..., 3.95726109, 3.95336437, 3.94936728], [27.90592651, 27.90592651, 27.89972496, ..., 4.84449224, 4.83016787, 4.81395521]]])
- PSAL_Q(pcm_class, quantile, DEPTH)float6434.03 34.03 34.03 ... 35.01 35.01
- long_name :
- Practical Salinity
- standard_name :
- sea_water_salinity
- units :
- psu
- valid_min :
- 2.f
- valid_max :
- 41.f
- C_format :
- %9.3f
- FORTRAN_format :
- F9.3
- resolution :
- 0.001f
- _pyXpcm_cleanable :
- 1
array([[[34.02629547, 34.02629547, 34.02854004, ..., 34.8774765 , 34.87806854, 34.87847443], [34.75354385, 34.75354385, 34.75354385, ..., 34.90897751, 34.90917969, 34.90939331], [35.05836639, 35.05836639, 35.05836639, ..., 34.94658127, 34.94690933, 34.94753036]], [[36.14413528, 36.14738312, 36.14889565, ..., 34.98531075, 34.98470345, 34.98376789], [36.59150124, 36.59150124, 36.59508324, ..., 35.03655815, 35.03551102, 35.03469086], [36.99061012, 36.99440384, 36.99430428, ..., 35.12143936, 35.12091942, 35.12045898]], [[34.7966114 , 34.7966114 , 34.79847603, ..., 34.97176018, 34.97247543, 34.97301025], [36.4469986 , 36.4469986 , 36.44810867, ..., 35.00600052, 35.00595856, 35.00588608], [37.18030014, 37.18030014, 37.18030357, ..., 35.10130501, 35.10105553, 35.10047226]], ..., [[35.13364563, 35.13364563, 35.13353119, ..., 34.90844994, 34.90806541, 34.90796356], [35.95100021, 35.95100021, 35.95291901, ..., 35.08145142, 35.07888412, 35.07514954], [36.60344238, 36.60344238, 36.6132122 , ..., 35.85812607, 35.85144691, 35.84058685]], [[36.49176941, 36.49176941, 36.49402256, ..., 35.04038658, 35.04073067, 35.04102249], [37.13119698, 37.13119698, 37.13119698, ..., 35.23085976, 35.22877693, 35.22781181], [37.57555389, 37.57555389, 37.57554474, ..., 35.58191986, 35.57843609, 35.57495441]], [[32.63421402, 32.63421402, 32.65712433, ..., 34.89247131, 34.89298935, 34.89305954], [35.338871 , 35.338871 , 35.34162903, ..., 34.94197083, 34.94210052, 34.94210052], [36.35235672, 36.35235672, 36.35383072, ..., 35.01437607, 35.01456451, 35.01464539]]])
- Sample test prepared by :
- G. Maze
- Institution :
- Ifremer/LOPS
- Data source DOI :
- 10.17882/42182
Quantiles can be plotted using the :func:pyxpcm.plot.quantile
method.
[15]:
fig, ax = m.plot.quantile(ds['TEMP_Q'], maxcols=4, figsize=(10, 8), sharey=True)

Geographic distribution of classes
Warning
To follow this section you’ll need to have Cartopy installed and working.
A map of labels can now easily be plotted:
[16]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)
kmap = m.plot.cmap()
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
cl = m.plot.colorbar(ax=ax)
gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set')
plt.show()

Since we predicted labels for 2 datasets, we can superimpose them
[17]:
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-75,-35,25,55]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(5,5), dpi=120, facecolor='w', edgecolor='k', subplot_kw=subplot_kw)
kmap = m.plot.cmap()
sc = ax.pcolor(ds_gridded['longitude'], ds_gridded['latitude'], ds_gridded['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K)
sc = ax.scatter(ds['LONGITUDE'], ds['LATITUDE'], s=10, c=ds['PCM_LABELS'], cmap=kmap, transform=proj, vmin=0, vmax=m.K, edgecolors=[0.3]*3, linewidths=0.3)
cl = m.plot.colorbar(ax=ax)
gl = m.plot.latlongrid(ax, dx=10)
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.set_title('LABELS of the training set (dots) and another product (shade)')
plt.show()

Posteriors are defined for each data point and give the probability of that point to belong to any of the classes. It can be plotted this way:
[18]:
cmap = sns.light_palette("blue", as_cmap=True)
proj = ccrs.PlateCarree()
subplot_kw={'projection': proj, 'extent': np.array([-80,1,-1,66]) + np.array([-0.1,+0.1,-0.1,+0.1])}
fig, ax = m.plot.subplots(figsize=(10,22), maxcols=2, subplot_kw=subplot_kw)
for k in m:
sc = ax[k].scatter(ds['LONGITUDE'], ds['LATITUDE'], s=3, c=ds['PCM_POST'].sel(pcm_class=k),
cmap=cmap, transform=proj, vmin=0, vmax=1)
cl = plt.colorbar(sc, ax=ax[k], fraction=0.03)
gl = m.plot.latlongrid(ax[k], fontsize=8, dx=20, dy=10)
ax[k].add_feature(cfeature.LAND)
ax[k].add_feature(cfeature.COASTLINE)
ax[k].set_title('PCM Posteriors k=%i' % k)

PCM properties
The PCM class does a lot of data preprocessing under the hood in order to classify profiles.
Here is how to access PCM preprocessed results and data.
Import and set-up
Import the library and toy data
[2]:
import pyxpcm
from pyxpcm.models import pcm
Let’s work with a standard PCM of temperature and salinity, from the surface down to -1000m:
[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)
# Define features to use
features_pcm = {'temperature': z, 'salinity': z}
# Instantiate the PCM
m = pcm(K=4, features=features_pcm, maxvar=2)
print(m)
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: False
Feature: 'temperature'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
Note that here we used a strong dimensionality reduction to limit the dimensions and size of the plots to come (maxvar==2
tell the PCM to use the first 2 PCAs of each variables).
Now we can load a dataset to be used for fitting.
[4]:
ds = pyxpcm.tutorial.open_dataset('argo').load()
Fit and predict model on data:
[5]:
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}
ds = ds.pyxpcm.fit_predict(m, features=features_in_ds, inplace=True)
print(ds)
<xarray.Dataset>
Dimensions: (DEPTH: 282, N_PROF: 7560)
Coordinates:
* N_PROF (N_PROF) int64 0 1 2 3 4 5 6 ... 7554 7555 7556 7557 7558 7559
* DEPTH (DEPTH) float32 0.0 -5.0 -10.0 -15.0 ... -1395.0 -1400.0 -1405.0
Data variables:
LATITUDE (N_PROF) float32 ...
LONGITUDE (N_PROF) float32 ...
TIME (N_PROF) datetime64[ns] ...
DBINDEX (N_PROF) float64 ...
TEMP (N_PROF, DEPTH) float32 27.422163 27.422163 ... 4.391791
PSAL (N_PROF, DEPTH) float32 36.35267 36.35267 ... 34.910286
SIG0 (N_PROF, DEPTH) float32 ...
BRV2 (N_PROF, DEPTH) float32 ...
PCM_LABELS (N_PROF) int64 1 1 1 1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 2 2
Attributes:
Sample test prepared by: G. Maze
Institution: Ifremer/LOPS
Data source DOI: 10.17882/42182
Scaler properties
[6]:
fig, ax = m.plot.scaler()
# More options:
# m.plot.scaler(style='darkgrid')
# m.plot.scaler(style='darkgrid', subplot_kw={'ylim':[-1000,0]})

Reducer properties
Plot eigen vectors for a PCA reducer or nothing if no reduced used
[7]:
fig, ax = m.plot.reducer()
# Equivalent to:
# pcmplot.reducer(m)
# More options:
# m.plot.reducer(pcalist = range(0,4));
# m.plot.reducer(pcalist = [0], maxcols=1);
# m.plot.reducer(pcalist = range(0,4), style='darkgrid', plot_kw={'linewidth':1.5}, subplot_kw={'ylim':[-1400,0]}, figsize=(12,10));

Scatter plot of features, as seen by the classifier
You can have access to pre-processed data for your own plot/analysis through the pyxpcm.pcm.preprocessing()
method:
[8]:
X, sampling_dims = m.preprocessing(ds, features=features_in_ds)
X
[8]:
- n_samples: 7560
- n_features: 4
- 1.9281656 -0.09149919 1.7340997 ... -0.6237318 -1.7922652 -0.46551177
array([[ 1.9281656 , -0.09149919, 1.7340997 , -0.27024782], [ 2.314077 , 0.10684185, 2.0836833 , -0.18765019], [ 1.6755121 , -0.17313023, 1.5637012 , -0.43244886], ..., [-0.802601 , -0.5783772 , -1.5761338 , -0.31184074], [-0.9552184 , -0.6094395 , -1.8049222 , -0.4273216 ], [-0.8925139 , -0.6237318 , -1.7922652 , -0.46551177]], dtype=float32)
- n_samples(n_samples)int640 1 2 3 4 ... 7556 7557 7558 7559
array([ 0, 1, 2, ..., 7557, 7558, 7559])
- n_features(n_features)<U13'temperature_0' ... 'salinity_1'
array(['temperature_0', 'temperature_1', 'salinity_0', 'salinity_1'], dtype='<U13')
pyXpcm return a 2-dimensional xarray.DataArray
for which pairwise relationship can easily be visualise with the pyxpcm.plot.preprocessed()
method (this requires Seaborn):
[9]:
g = m.plot.preprocessed(ds, features=features_in_ds, style='darkgrid')
# A posteriori adjustements:
# g.set(xlim=(-3,3),ylim=(-3,3))
# g.savefig('toto.png')

[10]:
# Combine KDE with histrograms (very slow plot, so commented here):
g = m.plot.preprocessed(ds, features=features_in_ds, kde=True)

Save and load PCM from local files
PCM instances are light weigth python objects and can easily be saved on and loaded from files. pyXpcm uses the netcdf file format because it is easy to add meta-data to numerical arrays.
Import and set-up
Import the library and toy data
[2]:
import pyxpcm
from pyxpcm.models import pcm
# Load tutorial data:
ds = pyxpcm.tutorial.open_dataset('argo').load()
Saving a model
Let’s first create a PCM and fit it onto the tutorial dataset:
[3]:
# Define a vertical axis to work with
z = np.arange(0.,-1000,-10.)
# Define features to use
features_pcm = {'temperature': z, 'salinity': z}
# Instantiate the PCM:
m = pcm(K=4, features=features_pcm)
# Fit:
m.fit(ds, features={'temperature': 'TEMP', 'salinity': 'PSAL'})
[3]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
log likelihood of the training set: 33.335059
We can now save the fitted model to a local file:
[4]:
m.to_netcdf('my_pcm.nc')
Loading a model
To load a PCM from file, use:
[5]:
m_loaded = pyxpcm.load_netcdf('my_pcm.nc')
m_loaded
[5]:
<pcm 'gmm' (K: 4, F: 2)>
Number of class: 4
Number of feature: 2
Feature names: odict_keys(['temperature', 'salinity'])
Fitted: True
Feature: 'temperature'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Feature: 'salinity'
Interpoler: <class 'pyxpcm.utils.Vertical_Interpolator'>
Scaler: 'normal', <class 'sklearn.preprocessing._data.StandardScaler'>
Reducer: True, <class 'sklearn.decomposition._pca.PCA'>
Classifier: 'gmm', <class 'sklearn.mixture._gaussian_mixture.GaussianMixture'>
log likelihood of the training set: 33.335059
Features preprocessing explained
Note
pyXpcm performs automatically all the data preprocessing for you. You should not have to manipulate your data before being able to fit or classify. This page is an explanation of the preprocessing steps performed under the hood by pyXpcm.
The Profile Classification Model (PCM) requires data to be preprocessed in order to match the model vertical axis, to scale feature dimensions with each others and to reduce the dimensionality of the problem. Preprocessing is done internally by pyXpcm. Each step can be parameterised.
The PCM preprocessing operations are organised into 4 steps:

Stack
This step mask, extract, flatten and transform any ND-array set of feature variables (eg: temperature, salinity) into a plain 2D-array collection of vertical profiles usable for machine learning methods.
Mask
This step computes a mask of the input data that will reject all profiles that are full of nans over the depth range of feature vertical axis. This ensure that all feature variables will be successfully retrieved to fill in the plain 2D-array collection of profiles.
This operation is conducted by pyxpcm.xarray.pyXpcmDataSetAccessor.mask()
, so that the mask can be computed (and plotted) this way:
[2]:
mask = ds.pyxpcm.mask(m)
print(mask)
<xarray.DataArray 'pcm_MASK' (latitude: 53, longitude: 61)>
dask.array<eq, shape=(53, 61), dtype=bool, chunksize=(53, 61), chunktype=numpy.ndarray>
Coordinates:
* latitude (latitude) float32 30.023445 30.455408 ... 49.41288 49.737103
* longitude (longitude) float32 -70.0 -69.5 -69.0 -68.5 ... -41.0 -40.5 -40.0
[3]:
mask.plot();

Ravel
For ND-array to be used as a feature, it has to be ravelled, flatten, along the N-1 dimensions that are not the vertical one. This operation will thus transform any ND-array into a 2D-array (sampling
and vertical_axis
dimensions) and additionnaly drop profiles according to the PCM mask determined above.
This operation is conducted by pyxpcm.pcm.ravel()
.
The output 2D-array is a xarray.DataArray
that can be chunked along the sampling
dimension with the PCM constructor option chunk_size
:
[4]:
m = pcm(K=3, features=features_pcm, chunk_size=1e3).fit(ds)
By default, chunk_size='auto'
.
[5]:
X, z, sampling_dims = m.ravel(ds['TEMP'], dim='depth', feature_name='TEMP')
X
[5]:
- sampling: 2289
- depth: 152
- dask.array<chunksize=(1000, 152), meta=np.ndarray>
Array Chunk Bytes 1.39 MB 608.00 kB Shape (2289, 152) (1000, 152) Count 27 Tasks 3 Chunks Type float32 numpy.ndarray - depth(depth)float32-1.0 -3.0 -5.0 ... -1980.0 -2000.0
- axis :
- Z
- units :
- meters
- positive :
- up
array([-1.00e+00, -3.00e+00, -5.00e+00, -1.00e+01, -1.50e+01, -2.00e+01, -2.50e+01, -3.00e+01, -3.50e+01, -4.00e+01, -4.50e+01, -5.00e+01, -5.50e+01, -6.00e+01, -6.50e+01, -7.00e+01, -7.50e+01, -8.00e+01, -8.50e+01, -9.00e+01, -9.50e+01, -1.00e+02, -1.10e+02, -1.20e+02, -1.30e+02, -1.40e+02, -1.50e+02, -1.60e+02, -1.70e+02, -1.80e+02, -1.90e+02, -2.00e+02, -2.10e+02, -2.20e+02, -2.30e+02, -2.40e+02, -2.50e+02, -2.60e+02, -2.70e+02, -2.80e+02, -2.90e+02, -3.00e+02, -3.10e+02, -3.20e+02, -3.30e+02, -3.40e+02, -3.50e+02, -3.60e+02, -3.70e+02, -3.80e+02, -3.90e+02, -4.00e+02, -4.10e+02, -4.20e+02, -4.30e+02, -4.40e+02, -4.50e+02, -4.60e+02, -4.70e+02, -4.80e+02, -4.90e+02, -5.00e+02, -5.10e+02, -5.20e+02, -5.30e+02, -5.40e+02, -5.50e+02, -5.60e+02, -5.70e+02, -5.80e+02, -5.90e+02, -6.00e+02, -6.10e+02, -6.20e+02, -6.30e+02, -6.40e+02, -6.50e+02, -6.60e+02, -6.70e+02, -6.80e+02, -6.90e+02, -7.00e+02, -7.10e+02, -7.20e+02, -7.30e+02, -7.40e+02, -7.50e+02, -7.60e+02, -7.70e+02, -7.80e+02, -7.90e+02, -8.00e+02, -8.20e+02, -8.40e+02, -8.60e+02, -8.80e+02, -9.00e+02, -9.20e+02, -9.40e+02, -9.60e+02, -9.80e+02, -1.00e+03, -1.02e+03, -1.04e+03, -1.06e+03, -1.08e+03, -1.10e+03, -1.12e+03, -1.14e+03, -1.16e+03, -1.18e+03, -1.20e+03, -1.22e+03, -1.24e+03, -1.26e+03, -1.28e+03, -1.30e+03, -1.32e+03, -1.34e+03, -1.36e+03, -1.38e+03, -1.40e+03, -1.42e+03, -1.44e+03, -1.46e+03, -1.48e+03, -1.50e+03, -1.52e+03, -1.54e+03, -1.56e+03, -1.58e+03, -1.60e+03, -1.62e+03, -1.64e+03, -1.66e+03, -1.68e+03, -1.70e+03, -1.72e+03, -1.74e+03, -1.76e+03, -1.78e+03, -1.80e+03, -1.82e+03, -1.84e+03, -1.86e+03, -1.88e+03, -1.90e+03, -1.92e+03, -1.94e+03, -1.96e+03, -1.98e+03, -2.00e+03], dtype=float32)
- sampling(sampling)MultiIndex(latitude, longitude)
array([(30.02344512939453, -70.0), (30.02344512939453, -69.5), (30.02344512939453, -69.0), ..., (49.73710250854492, -41.0), (49.73710250854492, -40.5), (49.73710250854492, -40.0)], dtype=object)
- latitude(sampling)float6430.02 30.02 30.02 ... 49.74 49.74
array([30.023445, 30.023445, 30.023445, ..., 49.737103, 49.737103, 49.737103])
- longitude(sampling)float64-70.0 -69.5 -69.0 ... -40.5 -40.0
array([-70. , -69.5, -69. , ..., -41. , -40.5, -40. ])
- long_name :
- Temperature
- standard_name :
- sea_water_temperature
- units :
- degree_Celsius
- valid_min :
- -23000
- valid_max :
- 20000
See the chunksize of the dask.array.Array
for this feature.
Interpolate
Even if input data vertical axis are in the range of the PCM feature axis, they may not be defined on similar level values. In this step, if the input data are not defined on the same vertical axis as the PCM, an interpolation is triggered. The interpolation is conducted following these rules:
If PCM axis levels are found into the input data vertical axis, then a simple intersection is used.
If PCM axis starts at the surface (0 value) and not the input data, the 1st non-nan value is replicated to the surface, as a mixed layer.
If PCM axis levels are not in the input data vertical axis, a linear interpolation through the
xarray.DataArray.interp()
method is triggered for each profiles.
The entire interpolation processed is managed by a pyxpcm.utils.Vertical_Interpolator
instance that is created at the time of PCM instanciation.
Scale
Each variable can be normalised along a vertical level. This step ensures that structures/patterns located at depth in the profile, will be considered similarly to those close to the surface by the classifier.
Scaling is defined at the PCM creation (pyxpcm.models.pcm
) with the option scale
. It is an integer value with the following meaning:
0: No scaling
1: Center on sample mean and scale by sample std
2: Center on sample mean only
Recuce
[TBC]
Combine
[TBC]
Debugging and performances
Import and set-up
Import the library and toy data
[2]:
import pyxpcm
from pyxpcm.models import pcm
# Load a dataset to work with:
ds = pyxpcm.tutorial.open_dataset('argo').load()
# Define vertical axis and features to use:
z = np.arange(0.,-1000.,-10.)
features_pcm = {'temperature': z, 'salinity': z}
features_in_ds = {'temperature': 'TEMP', 'salinity': 'PSAL'}
Debugging
Use option debug
to print log messages
[3]:
# Instantiate a new PCM:
m = pcm(K=8, features=features_pcm, debug=True)
# Fit with log:
m.fit(ds, features=features_in_ds);
> Start preprocessing for action 'fit'
> Preprocessing xarray dataset 'TEMP' as PCM feature 'temperature'
X RAVELED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))]
Output axis is in the input axis, not need to interpolate, simple intersection
X INTERPOLATED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))]
X SCALED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
X REDUCED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
temperature pre-processed with success, [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
Homogenisation for fit of temperature
> Preprocessing xarray dataset 'PSAL' as PCM feature 'salinity'
X RAVELED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (282,))]
Output axis is in the input axis, not need to interpolate, simple intersection
X INTERPOLATED with success [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((7560,), (100,))]
X SCALED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
X REDUCED with success) [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
salinity pre-processed with success, [<class 'xarray.core.dataarray.DataArray'>, <class 'numpy.ndarray'>, None]
Homogenisation for fit of salinity
Features array shape and type for xarray: (7560, 30) <class 'numpy.ndarray'> <class 'memoryview'>
> Preprocessing done, working with final X (<class 'xarray.core.dataarray.DataArray'>) array of shape: (7560, 30) and sampling dimensions: ['N_PROF']
Performance / Optimisation
Use timeit
and timeit_verb
to compute computation time of PCM operations
Times are accessible as a pandas Dataframe in timeit
pyXpcm instance property.
The pyXpcm m.plot.timeit()
plot method allows for a simple visualisation of times.
Time readings during execution
[4]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=1)
m.fit(ds, features=features_in_ds);
fit.1-preprocess.1-mask: 26 ms
fit.1-preprocess.2-feature_temperature.1-ravel: 48 ms
fit.1-preprocess.2-feature_temperature.2-interp: 2 ms
fit.1-preprocess.2-feature_temperature.3-scale_fit: 14 ms
fit.1-preprocess.2-feature_temperature.4-scale_transform: 8 ms
fit.1-preprocess.2-feature_temperature.5-reduce_fit: 22 ms
fit.1-preprocess.2-feature_temperature.6-reduce_transform: 5 ms
fit.1-preprocess.2-feature_temperature.total: 103 ms
fit.1-preprocess: 103 ms
fit.1-preprocess.3-homogeniser: 1 ms
fit.1-preprocess.2-feature_salinity.1-ravel: 42 ms
fit.1-preprocess.2-feature_salinity.2-interp: 1 ms
fit.1-preprocess.2-feature_salinity.3-scale_fit: 12 ms
fit.1-preprocess.2-feature_salinity.4-scale_transform: 10 ms
fit.1-preprocess.2-feature_salinity.5-reduce_fit: 14 ms
fit.1-preprocess.2-feature_salinity.6-reduce_transform: 3 ms
fit.1-preprocess.2-feature_salinity.total: 85 ms
fit.1-preprocess: 85 ms
fit.1-preprocess.3-homogeniser: 1 ms
fit.1-preprocess.4-xarray: 1 ms
fit.1-preprocess: 225 ms
fit.fit: 2206 ms
fit.score: 10 ms
fit: 2442 ms
A posteriori Execution time analysis
[5]:
# Create a PCM and execute methods:
m = pcm(K=8, features=features_pcm, timeit=True, timeit_verb=0)
m.fit(ds, features=features_in_ds);
m.predict(ds, features=features_in_ds);
m.fit_predict(ds, features=features_in_ds);
Execution times are accessible through a dataframe with the pyxpcm.pcm.timeit
property
[6]:
m.timeit
[6]:
Method Sub-method Sub-sub-method Sub-sub-sub-method
fit 1-preprocess 1-mask total 19.836187
2-feature_temperature 1-ravel 32.550097
2-interp 0.828981
3-scale_fit 9.926319
4-scale_transform 5.232811
...
fit_predict fit total 737.503052
score total 8.855104
predict total 9.073734
xarray total 8.368969
total 882.753134
Length: 66, dtype: float64
Visualisation help
To facilitate your analysis of execution times, you can use pyxpcm.plot.timeit()
.
Main steps by method
[7]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-method', style='darkgrid') # Default group/split
df
[7]:
Sub-method | 1-preprocess | fit | predict | score | xarray |
---|---|---|---|---|---|
Method | |||||
fit | 556.826353 | 1123.903990 | NaN | 10.521889 | NaN |
fit_predict | 416.832924 | 737.503052 | 9.073734 | 8.855104 | 8.368969 |
predict | 402.269125 | NaN | 9.886980 | 9.831905 | 8.663177 |

Preprocessing main steps by method
[8]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-method')
df
[8]:
Sub-sub-method | 1-mask | 2-feature_salinity | 2-feature_temperature | 3-homogeniser | 4-xarray |
---|---|---|---|---|---|
Method | |||||
fit | 19.836187 | 122.781515 | 130.059242 | 2.494097 | 2.145052 |
fit_predict | 20.085096 | 93.130827 | 89.278936 | 3.419876 | 1.243114 |
predict | 24.972916 | 73.968887 | 98.288298 | 1.749992 | 1.240015 |

Preprocessing details by method
[9]:
fig, ax, df = m.plot.timeit(group='Method', split='Sub-sub-sub-method')
df
[9]:
Sub-sub-sub-method | 1-ravel | 2-interp | 3-scale_fit | 4-scale_transform | 5-reduce_fit | 6-reduce_transform |
---|---|---|---|---|---|---|
Method | ||||||
fit | 63.939333 | 1.705170 | 18.492460 | 10.242701 | 25.922775 | 5.970001 |
fit_predict | 67.147017 | 1.942873 | 0.005245 | 12.181759 | 0.003815 | 9.781122 |
predict | 67.429066 | 1.815319 | 0.004053 | 10.471821 | 0.003099 | 6.281853 |

Preprocessing details by features
[10]:
fig, ax, df = m.plot.timeit(split='Sub-sub-sub-method', group='Sub-sub-method', unit='s')
df
[10]:
Sub-sub-sub-method | 1-ravel | 2-interp | 3-scale_fit | 4-scale_transform | 5-reduce_fit | 6-reduce_transform |
---|---|---|---|---|---|---|
Sub-sub-method | ||||||
2-feature_salinity | 0.093312 | 0.002782 | 0.008571 | 0.017127 | 0.012878 | 0.010059 |
2-feature_temperature | 0.105203 | 0.002681 | 0.009930 | 0.015770 | 0.013052 | 0.011974 |

Help & reference
Bibliography
Maze, Guillaume and Mercier, Herlé and Fablet, Ronan and Tandeo, Pierre and Lopez Radcenco, Manuel and Lenca, Philippe and Feucher, Charlène and Le Goff, Clément. Coherent heat patterns revealed by unsupervised classification of Argo temperature profiles in the North Atlantic Ocean. Progress in Oceanography (2017). 10.1016/j.pocean.2016.12.008.
Maze, Guillaume and Mercier, Herl’e and Cabanes, C’ecile. Profile Classification Models. Mercator Ocean Journal (2017). ` <https://archimer.ifremer.fr/doc/00387/49816/>`__.
Jones, Daniel C. and Holt, Harry J. and Meijers, Andrew J. S. and Shuckburgh, Emily. Unsupervised Clustering of Southern Ocean Argo Float Temperature Profiles. Journal of Geophysical Research: Oceans (2019). 10.1029/2018jc014629.
Rosso, Isabella and Mazloff, Matthew R. and Talley, Lynne D. and Purkey, Sarah G. and Freeman, Natalie M. and Maze, Guillaume. Water Mass and Biogeochemical Variability in the Kerguelen Sector of the Southern Ocean: A Machine Learning Approach for a Mixing Hot Spot. Journal of Geophysical Research: Oceans (2020). https://doi.org/10.1029/2019JC015877.
Boehme, Lars and Rosso, Isabella. Classifying Oceanographic Structures in the Amundsen Sea, Antarctica. Geophysical Research Letters (2021). https://doi.org/10.1029/2020GL089412.
Thomas, S. D. A. and Jones, D. C. and Faul, A. and Mackie, E. and Pauthenet, E.. Defining Southern Ocean fronts using unsupervised classification. Ocean Science (2021). 10.5194/os-17-1545-2021.
Li, Xiaojuan and Mao, Zhihua and Zheng, Hongrui and Zhang, Wei and Yuan, Dapeng and Li, Youzhi and Wang, Zheng and Liu, Yunxin. Process-Oriented Estimation of Chlorophyll-a Vertical Profile in the Mediterranean Sea Using MODIS and Oceanographic Float Products. Frontiers in Marine Science (2022). 10.3389/fmars.2022.933680.
Sambe, Fumika and Suga, Toshio. Unsupervised Clustering of Argo Temperature and Salinity Profiles in the Mid-Latitude Northwest Pacific Ocean and Revealed Influence of the Kuroshio Extension Variability on the Vertical Structure Distribution. Journal of Geophysical Research: Oceans (2022). https://doi.org/10.1029/2021JC018138.
What’s New
Upcoming release
Fix bug with last numpy and integer management :issue:`42`. (:pr:`43`) by G. Maze
v0.4.1 (21 Feb. 2020)
Improved documentation
Improved unit testing
- Bug fix:
Fix a bug in the preprocessing step using dask_ml bakend that would cause an error for data already in dask arrays
v0.4.0 (1 Nov. 2019)
Warning
The API has changed, break backward compatibility.
Enhancements:
Multiple-features classification
ND-Array classification (so that you can classify directly profiles from gridded products, eg: latitude/longitude/time grid, and not only a collection of profiles already in 2D array)
pyXpcm methods can be accessed through the
xarray.Dataset
accessor namespacepyxpcm
Allow to choose statistic backends (sklearn, dask_ml or user-defined)
Save/load PCM to/from netcdf files
pyXpcm now consumes xarray/dask objects all along, not only on the user front-end. This add a small overhead with small dataset but allows for PCM to handle large and more complex datasets.
v0.3 (5 Apr. 2019)
Removed support for python 2.7
Added more data input consistency checks
Fix bug in interpolation and plotting methods
Added custom colormap and colorbar to plot module
v0.2 (26 Mar. 2019)
Upgrade to python 3.6 (compatible 2.7)
Added test for continuous coverage
Added score and bic methods
Improved vocabulary consistency in methods
v0.1.3 (12 Nov. 2018)
Initial release.
API reference
This page provides an auto-generated summary of pyXpcm’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
Top-level PCM functions
Creating a PCM
|
Profile Classification Model class constructor |
|
Load a PCM model from netcdf file |
Attributes
Return the number of classes |
|
Return the number of features |
|
Return features definition dictionnary |
Computation
|
Estimate PCM parameters |
|
Estimate PCM parameters and predict classes. |
|
Predict labels for profile samples |
|
Predict posterior probability of each components given the data |
|
Compute the per-sample average log-likelihood of the given data |
|
Compute Bayesian information criterion for the current model on the input dataset |
Low-level PCM properties and functions
Return a |
|
|
Extract from N-d array a X(feature,sample) 2-d array and vertical dimension z |
|
Create a DataArray from a numpy array and sampling dimensions |
Plotting
Access plotting functions |
Plot PCM Contents
|
Plot q-th quantiles of a dataArray for each PCM components |
|
Plot PCM scalers properties |
|
Plot PCM reducers properties |
|
Plot preprocessed features as pairwise scatter plots |
|
Plot PCM registered timing of operations |
Tools
|
Return categorical colormaps |
|
Add a colorbar to the current plot with centered ticks on discrete colors |
|
Return (figure, axis) with one subplot per cluster |
|
Add latitude/longitude grid line and labels to a cartopy geoaxes |
Statistics
Access statistics functions |
|
|
Compute q-th quantile of a |
|
Compute classification robustness |
|
Digitize classification robustness |
Save/load PCM models
|
Save a PCM to a netcdf file |
|
Load a PCM model from netcdf file |
Helper
|
Open a dataset from the pyXpcm online data repository (requires internet). |
Xarray pyxpcm name space
Provide accessor to enhance interoperability between xarray
and pyxpcm
.
Provide a scope named pyxpcm
as accessor to xarray.Dataset
objects.
- class pyxpcm.xarray.pyXpcmDataSetAccessor[source]
Class registered under scope
pyxpcm
to accessxarray.Dataset
objects.- add(da)[source]
Add a
xarray.DataArray
to thisxarray.Dataset
- bic(this_pcm, **kwargs)[source]
Compute Bayesian information criterion for the current model on the input dataset
Only for a GMM classifier
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- Returns:
- bic: float
The lower the better
- drop_all()[source]
Remove
xarray.DataArray
created with pyXpcm front thisxarray.Dataset
- feature_dict(this_pcm, features=None)[source]
Return dictionary of features for this
xarray.Dataset
and a PCM- Parameters:
- pcm
pyxpcm.pcmmodel.pcm
- featuresdict
Keys are PCM feature name, Values are corresponding
xarray.Dataset
variable names
- pcm
- Returns:
- dict()
Dictionary where keys are PCM feature names and values the corresponding
xarray.Dataset
variables
- fit(this_pcm, **kwargs)[source]
Estimate PCM parameters
For a PCM, the fit method consists in the following operations:
- pre-processing
interpolation to the
feature_axis
levels of the modelscaling
reduction
estimate classifier parameters
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- Returns:
- self
- fit_predict(this_pcm, **kwargs)[source]
Estimate PCM parameters and predict classes.
This method add these properties to the PCM object:
llh
: The log likelihood of the model with regard to new data
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- inplace: boolean, False by default
If False, return a
xarray.DataArray
with predicted labels If True, return the inputxarray.Dataset
with labels added as a newxarray.DataArray
- name: string (‘PCM_LABELS’)
Name of the DataArray holding labels.
- Returns:
xarray.DataArray
Component labels (if option ‘inplace’ = False)
- or
xarray.Dataset
Input dataset with component labels as a ‘PCM_LABELS’ new
xarray.DataArray
(if option ‘inplace’ = True)
- mask(this_pcm, features=None, dim=None)[source]
Create a mask where all PCM features are defined
Create a mask where all feature profiles are not null over the PCM feature axis.
- Parameters:
- :class:`pyxpcm.pcmmodel.pcm`
- featuresdict()
Definitions of this_pcm features in the
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dimstr
Name of the vertical dimension in the
xarray.Dataset
. If not specified or set to None, dim is identified as thexarray.DataArray
variables with attributes ‘axis’ set to ‘z’.
- Returns:
- predict(this_pcm, inplace=False, **kwargs)[source]
Predict labels for profile samples
This method add these properties to the PCM object:
llh
: The log likelihood of the model with regard to new data
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- inplace: boolean, False by default
If False, return a
xarray.DataArray
with predicted labels If True, return the inputxarray.Dataset
with labels added as a newxarray.DataArray
- name: str, default is ‘PCM_LABELS’
Name of the
xarray.DataArray
with labels
- Returns:
xarray.DataArray
Component labels (if option ‘inplace’ = False)
- or
xarray.Dataset
Input dataset with Component labels as a ‘PCM_LABELS’ new
xarray.DataArray
(if option ‘inplace’ = True)
- predict_proba(this_pcm, **kwargs)[source]
Predict posterior probability of each components given the data
This method adds these properties to the PCM instance:
llh
: The log likelihood of the model with regard to new data
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- inplace: boolean, False by default
If False, return a
xarray.DataArray
with predicted probabilities If True, return the inputxarray.Dataset
with probabilities added as a newxarray.DataArray
- name: str, default is ‘PCM_POST’
Name of the DataArray with prediction probability (posteriors)
- classdimname: str, default is ‘pcm_class’
Name of the dimension holding classes
- Returns:
xarray.DataArray
Probability of each Gaussian (state) in the model given each sample (if option ‘inplace’ = False)
- or
xarray.Dataset
Input dataset with Component Probability as a ‘PCM_POST’ new
xarray.DataArray
(if option ‘inplace’ = True)
- quantile(this_pcm, inplace=False, **kwargs)[source]
Compute q-th quantile of a
xarray.DataArray
for each PCM components- Parameters:
- q: float in the range of [0,1] (or sequence of floats)
Quantiles to compute, which must be between 0 and 1 inclusive.
- of: str
Name of the
xarray.Dataset
variable to compute quantiles for.- using: str
Name of the
xarray.Dataset
variable with classification labels to use. Use ‘PCM_LABELS’ by default.- outname: ‘PCM_QUANT’ or str
Name of the
xarray.DataArray
with quantile- keep_attrs: boolean, False by default
Preserve
of
xarray.Dataset
attributes or not in the new quantile variable.
- Returns:
xarray.Dataset
with shape (K, n_quantiles, N_z=n_features)- or
xarray.DataArray
with shape (K, n_quantiles, N_z=n_features)
- robustness(this_pcm, inplace=False, **kwargs)[source]
Compute classification robustness
- Parameters:
- name: str, default is ‘PCM_POST’
Name of the
xarray.DataArray
with prediction probability (posteriors)- classdimname: str, default is ‘pcm_class’
Name of the dimension holding classes
- outname: ‘PCM_ROBUSTNESS’ or str
Name of the
xarray.DataArray
with robustness- inplace: boolean, False by default
If False, return a
xarray.DataArray
with robustness If True, return the inputxarray.Dataset
with robustness added as a newxarray.DataArray
- Returns:
xarray.Dataset
if inplace=True- or
xarray.DataArray
if inplace=False
- robustness_digit(this_pcm, inplace=False, **kwargs)[source]
Digitize classification robustness
- Parameters:
- ds: :class:`xarray.Dataset`
Input dataset
- name: str, default is ‘PCM_POST’
Name of the
xarray.DataArray
with prediction probability (posteriors)- classdimname: str, default is ‘pcm_class’
Name of the dimension holding classes
- outname: ‘PCM_ROBUSTNESS_CAT’ or str
Name of the
xarray.DataArray
with robustness categories- inplace: boolean, False by default
If False, return a
xarray.DataArray
with robustness If True, return the inputxarray.Dataset
with robustness categories added as a newxarray.DataArray
- Returns:
xarray.Dataset
if inplace=True- or
xarray.DataArray
if inplace=False
- sampling_dim(this_pcm, features=None, dim=None)[source]
Return the list of dimensions to be stacked for sampling
- Parameters:
- pcm
pyxpcm.pcm
- featuresNone (default) or dict()
Keys are PCM feature name, Values are corresponding
xarray.Dataset
variable names. It set to None, all PCM features are used.- dimNone (default) or str()
The
xarray.Dataset
dimension to use as vertical axis in all features. If set to None, it is automatically set to the dimension with an attributeaxis
set toZ
.
- pcm
- Returns:
- dict()
Dictionary where keys are
xarray.Dataset
variable names of features and values are another dictionary with the list of sampling dimension in DIM_SAMPLING key and the name of the vertical axis in the DIM_VERTICAL key.
- score(this_pcm, **kwargs)[source]
Compute the per-sample average log-likelihood of the given data
- Parameters:
- ds: :class:`xarray.Dataset`
The dataset to work with
- features: dict()
Definitions of PCM features in the input
xarray.Dataset
. If not specified or set to None, features are identified usingxarray.DataArray
attributes ‘feature_name’.- dim: str
Name of the vertical dimension in the input
xarray.Dataset
- Returns:
- log_likelihood: float
In the case of a GMM classifier, this is the Log likelihood of the Gaussian mixture given data
- split()[source]
Split pyXpcm variables from the original
xarray.Dataset
- Returns:
- list of
xarray.Dataset
,xarray.Dataset
Two DataSest: one with pyXpcm variables, one with the original DataSet
- list of