{
“cells”: [
{

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.013553, “end_time”: “2020-02-11T21:00:30.433971”, “exception”: false, “start_time”: “2020-02-11T21:00:30.420418”,
“exception”: false, “start_time”: “2020-02-11T23:09:46.832799”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“.. _preprocessing:”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.009143, “end_time”: “2020-02-11T21:00:30.453308”, “exception”: false, “start_time”: “2020-02-11T21:00:30.444165”,
“exception”: false, “start_time”: “2020-02-11T23:09:46.857003”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“# Features preprocessing”

]

}, {

“cell_type”: “code”, “execution_count”: 1, “metadata”: {

“nbsphinx”: “hidden”, “papermill”: {
<<<<<<< Updated upstream
“duration”: 2.487577, “end_time”: “2020-02-11T21:00:32.948821”, “exception”: false, “start_time”: “2020-02-11T21:00:30.461244”,
“exception”: false, “start_time”: “2020-02-11T23:09:46.875302”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “outputs”: [], “source”: [

“# Hidden notebook set-upn”, “n”, “import os, sysn”, “import numpy as npn”, “import pandas as pdn”, “import xarray as xrn”, “import matplotlib.pyplot as pltn”, “%matplotlib inlinen”, “sys.path.insert(0, os.path.abspath(‘/Users/gmaze/git/github/gmaze/pyxpcm’))n”, “n”, “import pyxpcmn”, “from pyxpcm.models import pcmn”, “n”, “import seaborn as snsn”, “import cartopy.crs as ccrsn”, “import cartopy.feature as cfeaturen”, “import matplotlib.ticker as mtickern”, “import matplotlib as mpln”, “n”, “# Load sample data:n”, “ds = pyxpcm.tutorial.open_dataset(‘isas_snapshot’).load()n”, “n”, “# Define vertical axis and features to use:n”, “z = ds[‘depth’].where(ds[‘depth’]>=-1200, drop=True)n”, “features_pcm = {‘TEMP’: z, ‘TEMP’: z}n”, “n”, “m = pcm(K=3, features=features_pcm)”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.006653, “end_time”: “2020-02-11T21:00:32.963525”, “exception”: false, “start_time”: “2020-02-11T21:00:32.956872”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.663360”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“The Profile Classification Model (PCM) requires data to be preprocessed in order to match the model vertical axis, to scale feature dimensions with each others and to reduce the dimensionality of the problem. Some of these steps are mandatory and they all can be user parameterised.n”, “n”, “The PCM preprocessing operations are organised into 4 steps:n”, “n”, “.. image:: _static/Preprocessing_pipeline_2lines.pngn”, ” :width: 100%n”, ” :align: center”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.006608, “end_time”: “2020-02-11T21:00:32.977136”, “exception”: false, “start_time”: “2020-02-11T21:00:32.970528”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.677573”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“## Stack”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.00741, “end_time”: “2020-02-11T21:00:32.991539”, “exception”: false, “start_time”: “2020-02-11T21:00:32.984129”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.691422”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“This step mask, extract, flatten and transform any ND-array set of feature variables (eg: temperature, salinity) into a plain 2D-array collection of vertical profiles usable for machine learning methods.”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.006983, “end_time”: “2020-02-11T21:00:33.006119”, “exception”: false, “start_time”: “2020-02-11T21:00:32.999136”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.704866”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: [], “toc-hr-collapsed”: false

}, “source”: [

“### Mask”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.00649, “end_time”: “2020-02-11T21:00:33.019490”, “exception”: false, “start_time”: “2020-02-11T21:00:33.013000”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.719397”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“This step computes a mask of the input data that will reject all profiles that are full of nans over the depth range of feature vertical axis. This ensure that all feature variables will be successfully retrieved to fill in the plain 2D-array collection of profiles.n”, “n”, “This operation is conducted by pyxpcm.xarray.pyXpcmDataSetAccessor.mask(), so that the mask can be computed (and plotted) this way:”

]

}, {

“cell_type”: “code”, “execution_count”: 2, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.021824, “end_time”: “2020-02-11T21:00:33.048361”, “exception”: false, “start_time”: “2020-02-11T21:00:33.026537”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.732762”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “outputs”: [

{

“name”: “stdout”, “output_type”: “stream”, “text”: [

“<xarray.DataArray ‘pcm_MASK’ (latitude: 53, longitude: 61)>n”, “dask.array<eq, shape=(53, 61), dtype=bool, chunksize=(53, 61), chunktype=numpy.ndarray>n”, “Coordinates:n”, ” * longitude (longitude) float32 -70.0 -69.5 -69.0 -68.5 … -41.0 -40.5 -40.0n”, ” * latitude (latitude) float32 30.023445 30.455408 … 49.41288 49.737103n”

]

}

], “source”: [

“mask = ds.pyxpcm.mask(m)n”, “print(mask)”

]

}, {

“cell_type”: “code”, “execution_count”: 3, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.221273, “end_time”: “2020-02-11T21:00:33.276820”, “exception”: false, “start_time”: “2020-02-11T21:00:33.055547”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.762424”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “outputs”: [

{
“data”: {

“image/png”: “n”, “text/plain”: [

“<Figure size 432x288 with 2 Axes>”

]

}, “metadata”: {

“needs_background”: “light”

}, “output_type”: “display_data”

}

], “source”: [

“mask.plot();”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007343, “end_time”: “2020-02-11T21:00:33.291512”, “exception”: false, “start_time”: “2020-02-11T21:00:33.284169”,
“exception”: false, “start_time”: “2020-02-11T23:09:49.999331”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“### Ravel”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.006591, “end_time”: “2020-02-11T21:00:33.305412”, “exception”: false, “start_time”: “2020-02-11T21:00:33.298821”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.014331”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“For ND-array to be used as a feature, it has to be ravelled, flatten, along the N-1 dimensions that are not the vertical one. This operation will thus transform any ND-array into a 2D-array (sampling and vertical_axis dimensions) and additionnaly drop profiles according to the PCM mask determined above.n”, “n”, “This operation is conducted by pyxpcm.pcm.ravel().n”, “n”, “The output 2D-array is a xarray.DataArray that can be chunked along the sampling dimension with the PCM constructor option chunk_size:”

]

}, {

“cell_type”: “code”, “execution_count”: 4, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.326244, “end_time”: “2020-02-11T21:00:33.638566”, “exception”: false, “start_time”: “2020-02-11T21:00:33.312322”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.028681”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “outputs”: [], “source”: [

“m = pcm(K=3, features=features_pcm, chunk_size=1e3).fit(ds)”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007209, “end_time”: “2020-02-11T21:00:33.653698”, “exception”: false, “start_time”: “2020-02-11T21:00:33.646489”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.359658”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“By default, chunk_size='auto'.”

]

}, {

“cell_type”: “code”, “execution_count”: 5, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.034957, “end_time”: “2020-02-11T21:00:33.696022”, “exception”: false, “start_time”: “2020-02-11T21:00:33.661065”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.374384”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “outputs”: [

{
“data”: {
“text/html”: [
“<pre>&lt;xarray.DataArray &#x27;TEMP&#x27; (sampling: 2289, depth: 152)&gt;n”, “dask.array&lt;rechunk-merge, shape=(2289, 152), dtype=float32, chunksize=(1000, 152), chunktype=numpy.ndarray&gt;n”, “Coordinates:n”, ” * depth (depth) float32 -1.0 -3.0 -5.0 -10.0 … -1960.0 -1980.0 -2000.0n”, ” * sampling (sampling) MultiIndexn”, ” - latitude (sampling) float64 30.02 30.02 30.02 30.02 … 49.74 49.74 49.74n”, ” - longitude (sampling) float64 -70.0 -69.5 -69.0 -68.5 … -41.0 -40.5 -40.0n”, “Attributes:n”, ” long_name: Temperature n”, ” standard_name: sea_water_temperaturen”, ” units: degree_Celsiusn”, ” valid_min: -23000n”, ” valid_max: 20000</pre>”

], “text/plain”: [

“<xarray.DataArray ‘TEMP’ (sampling: 2289, depth: 152)>n”, “dask.array<rechunk-merge, shape=(2289, 152), dtype=float32, chunksize=(1000, 152), chunktype=numpy.ndarray>n”, “Coordinates:n”, ” * depth (depth) float32 -1.0 -3.0 -5.0 -10.0 … -1960.0 -1980.0 -2000.0n”, ” * sampling (sampling) MultiIndexn”, ” - latitude (sampling) float64 30.02 30.02 30.02 30.02 … 49.74 49.74 49.74n”, ” - longitude (sampling) float64 -70.0 -69.5 -69.0 -68.5 … -41.0 -40.5 -40.0n”, “Attributes:n”, ” long_name: Temperature n”, ” standard_name: sea_water_temperaturen”, ” units: degree_Celsiusn”, ” valid_min: -23000n”, ” valid_max: 20000”

]

}, “execution_count”: 5, “metadata”: {}, “output_type”: “execute_result”

}

], “source”: [

“X, z, sampling_dims = m.ravel(ds[‘TEMP’], dim=’depth’, feature_name=’TEMP’)n”, “X”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007559, “end_time”: “2020-02-11T21:00:33.711220”, “exception”: false, “start_time”: “2020-02-11T21:00:33.703661”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.418226”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“See the chunksize of the dask.array.Array for this feature.”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007467, “end_time”: “2020-02-11T21:00:33.726199”, “exception”: false, “start_time”: “2020-02-11T21:00:33.718732”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.433249”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“### Interpolate”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007477, “end_time”: “2020-02-11T21:00:33.741072”, “exception”: false, “start_time”: “2020-02-11T21:00:33.733595”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.447986”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“Even if input data vertical axis are in the range of the PCM feature axis, they may not be defined on similar level values. In this step, if the input data are not defined on the same vertical axis as the PCM, an interpolation is triggered. The interpolation is conducted following these rules:n”, “n”, “- If PCM axis levels are found into the input data vertical axis, then a simple intersection is used.n”, “- If PCM axis starts at the surface (0 value) and not the input data, the 1st non-nan value is replicated to the surface, as a mixed layer.n”, “- If PCM axis levels are not in the input data vertical axis, a linear interpolation through the xarray.DataArray.interp() method is triggered for each profiles.n”, “n”, “The entire interpolation processed is managed by a pyxpcm.utils.Vertical_Interpolator instance that is created at the time of PCM instanciation.”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007292, “end_time”: “2020-02-11T21:00:33.756155”, “exception”: false, “start_time”: “2020-02-11T21:00:33.748863”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.462945”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“Scalen”, “—–”

]

}, {

“cell_type”: “raw”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007415, “end_time”: “2020-02-11T21:00:33.771215”, “exception”: false, “start_time”: “2020-02-11T21:00:33.763800”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.477820”,
>>>>>>> Stashed changes
“status”: “completed”

}, “raw_mimetype”: “text/restructuredtext”, “tags”: []

}, “source”: [

“Each variable can be normalised along a vertical level. This step ensures that structures/patterns located at depth in the profile, will be considered similarly to those close to the surface by the classifier.n”, “n”, “Scaling is defined at the PCM creation (pyxpcm.models.pcm) with the option scale. It is an integer value with the following meaning:n”, “n”, ” - 0: No scalingn”, ” - 1: Center on sample mean and scale by sample stdn”, ” - 2: Center on sample mean only”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.00727, “end_time”: “2020-02-11T21:00:33.786008”, “exception”: false, “start_time”: “2020-02-11T21:00:33.778738”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.493762”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“## Reducen”, “n”, “[TBC]”

]

}, {

“cell_type”: “markdown”, “metadata”: {

“papermill”: {
<<<<<<< Updated upstream
“duration”: 0.007432, “end_time”: “2020-02-11T21:00:33.801243”, “exception”: false, “start_time”: “2020-02-11T21:00:33.793811”,
“exception”: false, “start_time”: “2020-02-11T23:09:50.508950”,
>>>>>>> Stashed changes
“status”: “completed”

}, “tags”: []

}, “source”: [

“## Combinen”, “n”, “[TBC]”

]

}

], “metadata”: {

“kernelspec”: {
“display_name”: “obidam36”, “language”: “python”, “name”: “obidam36”

}, “language_info”: {

“codemirror_mode”: {
“name”: “ipython”, “version”: 3

}, “file_extension”: “.py”, “mimetype”: “text/x-python”, “name”: “python”, “nbconvert_exporter”: “python”, “pygments_lexer”: “ipython3”, “version”: “3.6.7”

}, “papermill”: {

<<<<<<< Updated upstream
“duration”: 5.112727, “end_time”: “2020-02-11T21:00:34.335987”,
>>>>>>> Stashed changes
“environment_variables”: {}, “exception”: null, “input_path”: “preprocessing.ipynb”, “output_path”: “../preprocessing.ipynb”, “parameters”: {},
<<<<<<< Updated upstream
“start_time”: “2020-02-11T21:00:29.223260”,
“version”: “1.2.1”

}, “toc-showmarkdowntxt”: false

}, “nbformat”: 4, “nbformat_minor”: 4

}