[CF-metadata] Feedback requested on proposed CF Simple Geometries

Discussion:

Ben Koziol - NOAA Affiliate

2016-09-07 18:13:32 UTC

Greetings,

As part of an EarthCube project for advancing netCDF-CF [1], we are
developing an approach to represent simple geometries in enhanced netCDF-4
with a variable length array backport for netCDF-3. Simple geometries, for
example, may be used to associate stream discharge with river lines or
surface runoff with watershed polygons. We've drafted an initial approach
and reference implementation on the GitHub netCDF-CF-simple-geometry
project [2] and would greatly appreciate feedback from the CF community.
We'd like to make sure our scope is appropriate and our approach is
acceptable.

Scope

-

The result of this effort will be a standard that the CF timeSeries
feature type could use to specify spatial coordinates (define a simple
geometry) for a timeSeries variable.
-

For those familiar with the OGC WKT standard geometry types [3], we will
include Point, LineString, Polygon, Multipoint, MultiLineString, and
MultiPolygon (WKT primitives and multipart geometries).

We anticipate that the six chosen geometry types will cover the needs of
most people generating netCDF data. These types also align with other
geospatial data formats such as GeoJSON and ESRI Shapefile. If our approach
is well received by the CF community, we may later adapt it to include
parametric shapes such as circles and ellipses.

Simple Geometry Encoding Method

Driven by the possibility that different features will require different
numbers of coordinates to describe their geometries, our approach uses
variable length (VLEN) arrays in enhanced netCDF-4 and continuous ragged
arrays (CRAs) in netCDF-3. We describe the VLEN netCDF-4 approach first.
The netCDF-3 CRA description follows.

In our approach, a VLEN coordinate_index variable which identifies the
indices of geometry coordinates in separate coordinate arrays. The
coordinate_index variable includes a coordinates attribute which stores the
names of the coordinate variables and a geom_type attribute to indicate the
geometry type.

For multipart geometries, the coordinate index variable may include a
negative integer flag(s) indicating the start of each new geometry "part"
for the current feature. The first geometry part is not preceded by the
negative integer flag. The variable shall include an attribute named
multipart_break_value identifying the flag's value.

For polygon geometries with holes (also called "interiors"), the coordinate
index values shall include a negative integer flagging the start of each
hole. In this case, the variable shall include a hole_break_value attribute
to indicate the flag value.

Other attributes on the coordinate index variable describe clockwise or
anticlockwise node order for polygons and polygon closure convention. For
additional details, see the wiki [4]. With these concepts defined, an
example for multipolygons with holes is shown below. You can copy the WKT
description below into Wicket [5] if you'd like to see what the geometry in
this example looks like.

Well-Known Text (WKT): MULTIPOLYGON(((0 0, 20 0, 20 20, 0 20, 0 0), (1 1,
10 5, 19 1, 1 1), (5 15, 7 19, 9 15, 5 15), (11 15, 13 19, 15 15, 11 15)),
((5 25, 9 25, 7 29, 5 25)), ((11 25, 15 25, 13 29, 11 25)))

Common Data Language (CDL) for netCDF-4 VLEN Arrays:

netcdf multipolygon_example {

types:

int64(*) geom_VLType ;

dimensions:

node = 25 ;

geom = 1 ;

variables:

geom_VLType coordinate_index(geom) ;

string coordinate_index:geom_type = "multipolygon" ;

string coordinate_index:coordinates = "x y" ;

coordinate_index:multipart_break_value = -1 ;

coordinate_index:hole_break_value = -2 ;

string coordinate_index:outer_ring_order = "anticlockwise" ;

string coordinate_index:closure_convention = "last_node_equals_first" ;

double x(node) ;

double y(node) ;

data:

coordinate_index =

{0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2, 13, 14, 15, 16,
-1, 17, 18, 19, 20, -1, 21, 22, 23, 24} ;

x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;

y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

}

You'll find additional examples for VLEN geometry storage on our wiki [6].

Variable Length (VLEN) Arrays in NetCDF-3

To support netCDF-3, we created a VLEN approach for netCDF-3 [7]. Inspired
by CF continuous ragged arrays (CRAs), our approach drops the CRA count
variable in favor of a stop variable that stores the stop index for each
geometry within an array of geometry coordinates. This improves random
accessibility of the CRA "elements" avoiding the need to sum counts
preceding the target element index. The stop variable includes a
contiguous_ragged_dimension attribute whose value is the name of the
dimension for which stop indices apply (similar to the CRA sample_dimension
attribute). An example showing how strings can be stored with this approach
is shown below.

Common Data Language (CDL) for netCDF-3 CRAs:

netcdf dwarf_planets {

dimensions:

dwarf_planet = 5 ; // number of dwarf planets described in this file

dwarf_planet_chars = 28 ; // total number of characters for all planet
names

variables:

char dwarf_planet_name(dwarf_planet_chars) ;

int dwarf_planet_name_stop(dwarf_planet) ;

dwarf_planet_name_stop:contiguous_ragged_dimension = "dwarf_planet_chars" ;

data:

dwarf_planet_name = "PlutoCeresErisHaumeaMakemake" ;

dwarf_planet_name_stop = 5, 10, 14, 20, 28 ;

}

For the above geometry example, the VLEN coordinate_index netCDF-4 array is
replaced by a netCDF-3 CRA.

netcdf multipolygon_example {

dimensions:

node = 25 ;

indices = 30;

geom = 1 ;

variables:

int coordinate_index(indices) ;

coordinate_index:geom_type = "multipolygon" ;

coordinate_index:coordinates = "x y" ;

coordinate_index:multipart_break_value = -1 ;

coordinate_index:hole_break_value = -2 ;

coordinate_index:outer_ring_order = "anticlockwise" ;

coordinate_index:closure_convention = "last_node_equals_first" ;

int coordinate_index_stop(geom) ;

coordinate_index_stop:contiguous_ragged_dimension = "indices" ;

double x(node) ;

double y(node) ;

data:

coordinate_index = 0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2,
13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24 ;

coordinate_index_stop = 30 ;

x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;

y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

}

The CRA method could of course be used in place of VLEN in netCDF-4. See
our wiki page on GitHub [7] for more details and examples.

Questions for the CF Community

1.

Are our VLEN netCDF-3 and netCDF-4 approaches acceptable? What changes
would you recommend?
2.

Are the geometry types point, line, polygon, and their multipart
equivalents sufficient for the community?

Thank you very much for considering our ideas and helping us with your
valuable feedback!

[1] http://earthcube.org/group/advancing-netcdf-cf

[2] https://github.com/bekozi/netCDF-CF-simple-geometry

[3] https://en.wikipedia.org/wiki/Well-known_text

[4] https://github.com/bekozi/netCDF-CF-simple-geometry/wiki

[5] https://arthur-e.github.io/Wicket/sandbox-gmaps3.html

[6]
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/Examples---VLen-Ragged-Arrays
[7]
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3

--
Ben Koziol
NESII/CIRES/NOAA Earth System Research Laboratory
***@noaa.gov
<https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=***@noaa.gov>
802.392.4522
http://www.esrl.noaa.gov/nesii/

Jonathan Gregory

2016-09-22 10:40:07 UTC

Permalink

Dear Ben

Thank you for your thoughtful and interesting proposal. I have quite a lot of
questions and comments about it.

* You explain that the need is to specify spatial coordinates with a simple
geometry for a timeSeries variable. For example, this could be for the
discharge as a function of time across some line in a river (your example), or
I suppose it could be an average temperature as a function of time for the
Atlantic Ocean, where you wanted to supply the polygon which drew the outline
of the basin. Have I got the idea? Timeseries like this can be stored in CF,
but their geographical extent is usually described only in words e.g. a region
name of atlantic_ocean, and this is fine for applications like CMIP where you
want to compare data from different data sources in which the Atlantic Ocean
may have different exact shapes (different AOGCMs, in particular). An array of
region names is also possible, so I don't think we need a new convention to
contain your dwarf planet example.

* Sect 9.1 on discrete sampling geometries says it cannot yet be used for cases
"where geo-positioning cannot be described as a discrete point location.
Problematic examples include time series that refer to a geographical region
(e.g. the northern hemisphere) ...". Actually I think that's not quite right.
The existing convention *can* describe regions which are contiguous, and
rectangular or polygonal, using its usual bounds convention (Sect 7.1). I think
we should consider changing this text, because it seems unnecessarily
restrictive. For example, a timeSeries for the average temperature in the
Northern Hemisphere can be stored like this:

dimensions:
region=1;
nv=2;
time=UNLIMITED;
variables:
float temperature(region,time);
temperature:standard_name="surface_temperature";
temperature:units="K";
temperature:coordinates="lat lon";
temperature:cell_methods="time: mean area: mean";
float lat(region);
lat:standard_name="latitude";
lat:units="degrees_north";
lat:bounds="lat_bounds";
float lat_bounds(region,nv);
float lon(region);
lon:standard_name="longitude";
lon:units="degrees_east";
lon:bounds="lon_bounds";
float lon_bounds(region,nv);
data:
lat_bounds=0,90;
lon_bounds=0,360;

which means the region is 0-90N and 0-360E. If the regions were irregular
polygons in latitude and longitude, nv would be the number of vertices and the
lat and lon bounds would trace the outline of the polygon e.g. nv=3, lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded by the
meridians at 0E and 90E and the Equator. I think, therefore, we do not need an
additional convention for points or polygonal regions. However, we would need
new conventions for a timeseries where each value applies to a set of
discontiguous regions or regions with holes in them, a set of points, a line or
a set of lines. I guess that these are included in the geometry types you list
(LineString, Multipoint, MultiLineString, and MultiPolygon). Do you have
definite use-cases for all of these? (I ask this because we don't add new
functionality to CF until there is a definite and common need for it in
practice.)

* I suspect that geometries of this kind can be described by the ugrid
convention http://ugrid-conventions.github.io/ugrid-conventions, which is
compliant with CF. Their purpose is to describe a set of connected points,
edges or faces at which values are given, whereas in your case you'd give a
single value for the whole set, but the description of the geometry itself
might be similar. Have you had a look at whether ugrid could meet your needs?
If it almost does so, perhaps a better thing to do would be to propose
additions to ugrid. We would like to avoid having more than one way to describe
such geometries.

If you decide to make use of ugrid instead, the rest of my comments may
not be relevant!

* So far CF does not say anything about the use of netCDF-4 features (i.e. not
the classic model). We have often discussed allowing them but the general
argument is also made that there has to be a compelling case for providing a
new way to do something which can already be done. (Steve Hankin often made
this argument, but since he's mostly retired I'll make it now in his name :-)
If there are two ways to do something, software has to support both of them. We
already have ways to encode ragged arrays, so is there a compelling case for
needing the netCDF-4 vlen array as well? We already have a way to encode
strings too, as character arrays. I think this is probably a discussion we
should have again in a different thread, so I'll just talk about your classic
encoding. The same points apply to both encodings.

* Your approach uses a coordinate_index variable to identify indices of
geometry coordinates e.g.

dimensions:
indices = 30;
node = 25 ;
geom = 1 ;
variables:
int coordinate_index(indices) ;
coordinate_index:coordinates = "x y" ;
double x(node) ;
double y(node) ;
data:
coordinate_index = 0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2,
13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24 ;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

where the -1 and -2 indices indicates where exterior and interior polygons
begin, and the first polygon has an implied -1 at the start. Is that right?
Given this example, I wonder why you need the index array, because none of the
coordinates indices (values >=0) is repeated, so no space is saved in the x
and y arrays. I guess this would be the usual case. If polygons did touch or
lines crossed, a few points would be in common, but not so many that seems to
need the complication of the index array. A simpler way to do it would be

int outside_inside(node); // -1 for exterior, -2 for interior
double x(node) ;
double y(node) ;
outside_inside=-1,-1,-1,-1,-1, -2,-2,-2,-2, -2,-2,-2,-2,-2,
-1,-1,-1,-1,-1, -1,-1,-1,-1,-1;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

which needs only one dimension, or you could use the CF ragged array
convention (Sect 9.3.3):

segment=5;
node=25;
int count(segment);
count:sample_dimension="node";
int outside_inside(segment); // -1 for exterior, -2 for interior
double x(node) ;
double y(node) ;
outside_inside=-1,-2,-2,-1,-1;
count=5,5,5,5,5;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

* You provide the attributes multipart_break_value and hole_break_value to
specify the values (-1 and -2 above) for the outside vs inside distinction. Do
you need the generality of being able to choose these values? It would seem
simpler to use a character array and specify in the convention which letters
should be used e.g.

char outside_inside(segment)
outside_inside="OOIIO";

That makes it more readable, perhaps.

* Similarly, you propose attributes for clockwise/anticlockwise node order and
for the polygon closure convention. Do these need to be freely choosable? You
could specify clockwise, like the existing CF bounds convention, and that the
polygons are closed. In the latter case, you could omit the last vertex of each
polygon since it must be the same as the first, and that would save a bit of
space. If you specify these choices, the attributes aren't needed.

* If this convention is going to be used for discrete sampling geometries,
an additional dimension is needed, because in a single data variable you
might have data for several of these geometries. That is, you need an array
of ragged arrays. Again, I wonder whether this suggests trying to use ugrid.
It might be you could name each one as a mesh, and specify the geometry of
for the set of timeSeries as an array of mesh names. That would be a very
easy change to the existing Sect 9.

Best wishes

Jonathan

Arctur, David K

2016-09-22 14:43:21 UTC

Permalink

Dear Jonathan,

I can’t speak to the technical details, but can mention some motivation for simple geometries. Among other applications, NetCDF-CF is now being used as an intermediate & output data format in the US National Weather Service’s National Water Model (NWM). This forecasts streamflow rates in about 2.7 million stream segments averaging 2km, throughout the continental US, at multiple time horizons (3 hr, 18 hr, 10 days) every hour, and an ensemble for 30-day forecast less frequently. There are many applications which can benefit from detailed polyline and polygon geometries. While ugrid could also be used, the simple geometries approach presented is simpler to implement.

Regards,
David Arctur

On Sep 22, 2016, at 5:40 AM, Jonathan Gregory <***@reading.ac.uk> wrote:

Dear Ben

Thank you for your thoughtful and interesting proposal. I have quite a lot of
questions and comments about it.

* You explain that the need is to specify spatial coordinates with a simple
geometry for a timeSeries variable. For example, this could be for the
discharge as a function of time across some line in a river (your example), or
I suppose it could be an average temperature as a function of time for the
Atlantic Ocean, where you wanted to supply the polygon which drew the outline
of the basin. Have I got the idea? Timeseries like this can be stored in CF,
but their geographical extent is usually described only in words e.g. a region
name of atlantic_ocean, and this is fine for applications like CMIP where you
want to compare data from different data sources in which the Atlantic Ocean
may have different exact shapes (different AOGCMs, in particular). An array of
region names is also possible, so I don't think we need a new convention to
contain your dwarf planet example.

* Sect 9.1 on discrete sampling geometries says it cannot yet be used for cases
"where geo-positioning cannot be described as a discrete point location.
Problematic examples include time series that refer to a geographical region
(e.g. the northern hemisphere) ...". Actually I think that's not quite right.
The existing convention *can* describe regions which are contiguous, and
rectangular or polygonal, using its usual bounds convention (Sect 7.1). I think
we should consider changing this text, because it seems unnecessarily
restrictive. For example, a timeSeries for the average temperature in the
Northern Hemisphere can be stored like this:

dimensions:
region=1;
nv=2;
time=UNLIMITED;
variables:
float temperature(region,time);
temperature:standard_name="surface_temperature";
temperature:units="K";
temperature:coordinates="lat lon";
temperature:cell_methods="time: mean area: mean";
float lat(region);
lat:standard_name="latitude";
lat:units="degrees_north";
lat:bounds="lat_bounds";
float lat_bounds(region,nv);
float lon(region);
lon:standard_name="longitude";
lon:units="degrees_east";
lon:bounds="lon_bounds";
float lon_bounds(region,nv);
data:
lat_bounds=0,90;
lon_bounds=0,360;

which means the region is 0-90N and 0-360E. If the regions were irregular
polygons in latitude and longitude, nv would be the number of vertices and the
lat and lon bounds would trace the outline of the polygon e.g. nv=3, lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded by the
meridians at 0E and 90E and the Equator. I think, therefore, we do not need an
additional convention for points or polygonal regions. However, we would need
new conventions for a timeseries where each value applies to a set of
discontiguous regions or regions with holes in them, a set of points, a line or
a set of lines. I guess that these are included in the geometry types you list
(LineString, Multipoint, MultiLineString, and MultiPolygon). Do you have
definite use-cases for all of these? (I ask this because we don't add new
functionality to CF until there is a definite and common need for it in
practice.)

* I suspect that geometries of this kind can be described by the ugrid
convention http://ugrid-conventions.github.io/ugrid-conventions, which is
compliant with CF. Their purpose is to describe a set of connected points,
edges or faces at which values are given, whereas in your case you'd give a
single value for the whole set, but the description of the geometry itself
might be similar. Have you had a look at whether ugrid could meet your needs?
If it almost does so, perhaps a better thing to do would be to propose
additions to ugrid. We would like to avoid having more than one way to describe
such geometries.

If you decide to make use of ugrid instead, the rest of my comments may
not be relevant!

* So far CF does not say anything about the use of netCDF-4 features (i.e. not
the classic model). We have often discussed allowing them but the general
argument is also made that there has to be a compelling case for providing a
new way to do something which can already be done. (Steve Hankin often made
this argument, but since he's mostly retired I'll make it now in his name :-)
If there are two ways to do something, software has to support both of them. We
already have ways to encode ragged arrays, so is there a compelling case for
needing the netCDF-4 vlen array as well? We already have a way to encode
strings too, as character arrays. I think this is probably a discussion we
should have again in a different thread, so I'll just talk about your classic
encoding. The same points apply to both encodings.

* Your approach uses a coordinate_index variable to identify indices of
geometry coordinates e.g.

dimensions:
indices = 30;
node = 25 ;
geom = 1 ;
variables:
int coordinate_index(indices) ;
coordinate_index:coordinates = "x y" ;
double x(node) ;
double y(node) ;
data:
coordinate_index = 0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2,
13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24 ;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

where the -1 and -2 indices indicates where exterior and interior polygons
begin, and the first polygon has an implied -1 at the start. Is that right?
Given this example, I wonder why you need the index array, because none of the
coordinates indices (values >=0) is repeated, so no space is saved in the x
and y arrays. I guess this would be the usual case. If polygons did touch or
lines crossed, a few points would be in common, but not so many that seems to
need the complication of the index array. A simpler way to do it would be

int outside_inside(node); // -1 for exterior, -2 for interior
double x(node) ;
double y(node) ;
outside_inside=-1,-1,-1,-1,-1, -2,-2,-2,-2, -2,-2,-2,-2,-2,
-1,-1,-1,-1,-1, -1,-1,-1,-1,-1;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

which needs only one dimension, or you could use the CF ragged array
convention (Sect 9.3.3):

segment=5;
node=25;
int count(segment);
count:sample_dimension="node";
int outside_inside(segment); // -1 for exterior, -2 for interior
double x(node) ;
double y(node) ;
outside_inside=-1,-2,-2,-1,-1;
count=5,5,5,5,5;
x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5,
11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25,
29, 25, 25, 25, 29, 25 ;

* You provide the attributes multipart_break_value and hole_break_value to
specify the values (-1 and -2 above) for the outside vs inside distinction. Do
you need the generality of being able to choose these values? It would seem
simpler to use a character array and specify in the convention which letters
should be used e.g.

char outside_inside(segment)
outside_inside="OOIIO";

That makes it more readable, perhaps.

* Similarly, you propose attributes for clockwise/anticlockwise node order and
for the polygon closure convention. Do these need to be freely choosable? You
could specify clockwise, like the existing CF bounds convention, and that the
polygons are closed. In the latter case, you could omit the last vertex of each
polygon since it must be the same as the first, and that would save a bit of
space. If you specify these choices, the attributes aren't needed.

* If this convention is going to be used for discrete sampling geometries,
an additional dimension is needed, because in a single data variable you
might have data for several of these geometries. That is, you need an array
of ragged arrays. Again, I wonder whether this suggests trying to use ugrid.
It might be you could name each one as a mesh, and specify the geometry of
for the set of timeSeries as an array of mesh names. That would be a very
easy change to the existing Sect 9.

Best wishes

Jonathan
_______________________________________________
CF-metadata mailing list
CF-***@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Chris Barker

2016-09-22 16:00:21 UTC

Permalink

Sorry, not enough time to really read tis all carefully, but a couple

Post by Jonathan Gregory
If the regions were irregular
polygons in latitude and longitude, nv would be the number of vertices and the
lat and lon bounds would trace the outline of the polygon e.g. nv=3, lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded by the
meridians at 0E and 90E and the Equator. I think, therefore, we do not need an
additional convention for points or polygonal regions.

this seems fine for this simple example, but burying a bunch of coordinates
of a complex polygon in a text string in an attribute is really not a good
idea -- the coordinates of a polygon should be in the array data one way or
another, rather than having to parse out attribute strings.

* I suspect that geometries of this kind can be described by the ugrid

I'm not so sure -- UGRID is about defining a bunch of polygons that all
share vertices, and are all of the same order (usually all triangles, or
quads, or maybe hexes). if they are a mixture, you still store the full set
(say, six vertices), while marking some as unused. But it's not that well
set up for a bunch of polygons of different order.

Not too bad if there are only one or two complex polygons, but it would be
a bit weird -- you'd have vertices and boundaries, but no faces. And you'd
lose t order of the vertices (thought that could probably be added to the
UGRID standard)

Post by Jonathan Gregory
whereas in your case you'd give a
single value for the whole set, but the description of the geometry itself
might be similar. Have you had a look at whether ugrid could meet your needs?
If it almost does so, perhaps a better thing to do would be to propose
additions to ugrid. We would like to avoid having more than one way to describe
such geometries.

Ben has been involved in UGRID, so I'm sure he's thought this out. For my
part, I think it's really a different problem, though it would be nice if
it were as similar to UGRID as possible.

* So far CF does not say anything about the use of netCDF-4 features (i.e.

Maybe it's time to embrace netcdf4? It's been a while! Though maybe for CF
2.* -- any movement on that?

Post by Jonathan Gregory
If there are two ways to do something, software has to support both of them. We
already have ways to encode ragged arrays, so is there a compelling case for
needing the netCDF-4 vlen array as well?

I think the ragged array option ins fine -- though I haven't looked at vlen
arrays enough to know if they offer a compelling alternative. One issue is
that the programming environments that we use to work with the data may not
have an equivalent of vlen arrays.

* Similarly, you propose attributes for clockwise/anticlockwise node order

Post by Jonathan Gregory
and
for the polygon closure convention.

This should match the OGC conventions as much as is practical.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Ethan Davis

2016-09-22 18:17:34 UTC

Permalink

Hi all,

Just a quick note on Chris' CF 2 question (until I have a bit more time to
think more fully on this discussion) ...

The EC netCDF-CF project that Ben mentioned is working on a number of CF
extension efforts that are looking to use features of the netCDF enhanced
data model. Those efforts will all target CF 2 rather than CF 1.x. However,
as with the Simple Geometries, we should also expect suggestions for
changes to CF 1.x spinning out of these efforts.

The CF-2 discussion has been pretty quite for awhile now. However, I expect
it will be more active as these various CF extension efforts start seeking
more community input and making proposals.

Cheers,

Ethan

Post by Chris Barker
Sorry, not enough time to really read tis all carefully, but a couple

this seems fine for this simple example, but burying a bunch of
coordinates of a complex polygon in a text string in an attribute is really
not a good idea -- the coordinates of a polygon should be in the array data
one way or another, rather than having to parse out attribute strings.
* I suspect that geometries of this kind can be described by the ugrid

I'm not so sure -- UGRID is about defining a bunch of polygons that all
share vertices, and are all of the same order (usually all triangles, or
quads, or maybe hexes). if they are a mixture, you still store the full set
(say, six vertices), while marking some as unused. But it's not that well
set up for a bunch of polygons of different order.
Not too bad if there are only one or two complex polygons, but it would be
a bit weird -- you'd have vertices and boundaries, but no faces. And you'd
lose t order of the vertices (thought that could probably be added to the
UGRID standard)

Ben has been involved in UGRID, so I'm sure he's thought this out. For my
part, I think it's really a different problem, though it would be nice if
it were as similar to UGRID as possible.
* So far CF does not say anything about the use of netCDF-4 features (i.e.

Maybe it's time to embrace netcdf4? It's been a while! Though maybe for CF
2.* -- any movement on that?

I think the ragged array option ins fine -- though I haven't looked at
vlen arrays enough to know if they offer a compelling alternative. One
issue is that the programming environments that we use to work with the
data may not have an equivalent of vlen arrays.
* Similarly, you propose attributes for clockwise/anticlockwise node order

Post by Jonathan Gregory
and
for the polygon closure convention.

This should match the OGC conventions as much as is practical.
-CHB
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
_______________________________________________
CF-metadata mailing list
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Jonathan Gregory

2016-09-22 16:26:26 UTC

Permalink

Dear Chris

Post by Chris Barker

Post by Jonathan Gregory
If the regions were irregular
polygons in latitude and longitude, nv would be the number of vertices and
the
lat and lon bounds would trace the outline of the polygon e.g. nv=3,
lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded by the
meridians at 0E and 90E and the Equator. I think, therefore, we do not
need an
additional convention for points or polygonal regions.

To avoid confusion:

I didn't suggest parsing attribute strings. The same numbers that Ben would put
in his x and y auxiliary coordinate variables for a single polygon can appear
in coordinate bounds variables according to the existing convention.

Post by Chris Barker
* I suspect that geometries of this kind can be described by the ugrid

I'm not so sure -- UGRID is about defining a bunch of polygons that all
share vertices, and are all of the same order (usually all triangles, or
quads, or maybe hexes). if they are a mixture, you still store the full set
(say, six vertices), while marking some as unused. But it's not that well
set up for a bunch of polygons of different order.
Not too bad if there are only one or two complex polygons, but it would be
a bit weird -- you'd have vertices and boundaries, but no faces. And you'd
lose t order of the vertices (thought that could probably be added to the
UGRID standard)

OK. I didn't investigate this, but it would be good to know about it. If
ugrid can do something like this, but not all of it, maybe ugrid could be
extended. If ugrid seems too complicated for these cases, maybe a "light"
version of ugrid could be proposed for them. I think we should avoid having
two partially overlapping conventions.

Post by Chris Barker
* So far CF does not say anything about the use of netCDF-4 features (i.e.

Maybe it's time to embrace netcdf4? It's been a while! Though maybe for CF
2.* -- any movement on that?

I think, as we generally do, that we should adopt netCDF-4 features if there
is a definite need to do so. I mean something you can't do with an existing
mechanism, or which is done so much more easily with a new mechanism that it
justifies the extra effort of requiring alternatives to be programmed in
software. I'm not arguing against it in general, but I think it has to be
argued for each specific need within the convention.

CF2 is not well-defined. I have to admit to being nervous about that. I am
very much opposed to an idea of "starting all over again" and maintaining
two conventions in parallel (since old data would continue to exist for a long
time and so the old CF would have to be supported), and I also think backwards-
incompability has to be strongly justified. So I favour step-by-step evolution.
Another idea we've discussed, which I'm comfortable with, is of defining
"strict" compliance to the convention, which a data-writer could optionally
adhere to. This could exclude older features we wanted to deprecate. However
this is really not the subject of the discussion - it's another thread.

Post by Chris Barker
I think the ragged array option ins fine -- though I haven't looked at vlen
arrays enough to know if they offer a compelling alternative. One issue is
that the programming environments that we use to work with the data may not
have an equivalent of vlen arrays.

That's a good point, and a reason why we have to be cautious in general about
adopting netCDF-4 features.

Best wishes

Jonathan

Chris Barker

2016-09-22 18:11:17 UTC

Permalink

Post by Jonathan Gregory
I didn't suggest parsing attribute strings. The same numbers that Ben would put
in his x and y auxiliary coordinate variables for a single polygon can appear
in coordinate bounds variables according to the existing convention.

OK then, sorry for the confusion, probably me reading it too fast...

OK. I didn't investigate this, but it would be good to know about it. If

Post by Jonathan Gregory
ugrid can do something like this, but not all of it, maybe ugrid could be
extended.

sure.

Post by Jonathan Gregory
If ugrid seems too complicated for these cases, maybe a "light"
version of ugrid could be proposed for them. I think we should avoid having
two partially overlapping conventions.

I agree -- but it seem like these are really different use cases to me --
sure there are similarities, but a different enough focus that a different
standard may make sense -- though hopefully UGRID can inform the "new" one,
so as to not have different way to accomplish the parts that are the same.

CF2 is not well-defined.

I thought it wasn't defined at all. But I think we all share your concerns
about that.

-CHB
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Bert Jagers

2016-09-26 11:44:37 UTC

Permalink

Dear Jonathan, Chris, and everybody,

Concerning the reference to UGRID:

Yes, there are similarities between UGRID and the Simple Geometries, but there are differences as well.
We discussed about merging or diverging these two proposals before Ben sent his mail to this group, but it's good to hear the opinion of the broader community on this topic.

1) As Chris indicates UGRID is about defining a mesh or grid, i.e. bunch of polygons that all share vertices. These polygons tend to be of similar order (usually 3 or 4 nodes per polygon, but 5 or 6 or even more might also occur). Since the variation in the number of nodes is usually limited, we have so far accepted that the full matrix is stored with missing values for entries that aren't needed for defining a polygon with less than the maximum number of nodes. I know that there are numerical models that use mostly triangles but accept complex elements with over 1000 nodes and for such cases the face_node_connectivity matrix becomes awfully sparse.

2) UGRID uses an indirection layer because nodes are shared and the fact that nodes are shared is an important element of the mesh topology/connectivity. So, UGRID defines primarily node coordinates and defines edges (i.e. lines) and faces (i.e. polygons) by referring to the node indices. The current Simple Geometries proposal inherits the indirection layer from the UGRID but it is not immediately clear that this indirection is useful. If we think of an arbitrary set of polygons then nodes will be rarely shared, however in case of a coverage of the continent by complex hydrological catchments then one may expect that almost all interior nodes will be shared by two or more catchment polygons.

3) If the polygons form the basis of a hydrological model and connectivity between these complex faces is important, then users may prefer to store the polygons using UGRID and then UGRID may have to support faces with holes and multi-part faces as well. It's a useful extension which we have considered for the results of our water quality model which tend to run on an aggregated version of the hydrodynamic mesh. In this case the hydrodynamic mesh is composed of simple polygons (typically 3-6 nodes), while the water quality data is defined on complex aggregates thereof (20 node polygons) with possibly holes but multiple parts are seldom (although the shared edges between such complex polygons may be composed of multiple (disconnected) edges). By keeping Simple Geometries aligned with UGRID one could make the full list of features available to numerical modellers. Since the indirection is an important part of UGRID, this would for consistency be included in Simple Geometries as well al
though it adds little value and looks like overhead as indicated by Jonathan.

4) The other alternative as given by Jonathan would be to link this to the bounds definition. I would like to remark that the current Cell Boundaries section doesn't say anything about the storing cells with a varying number of nodes. In UGRID we have assumed that the missing value can be used to fill up the arrays of small polygons, but this isn't listed in the CF conventions as an option. Anyway, this option looks more natural if the sharing of nodes is seldom or not important. In this case one loses the option to mark the outer and inner boundaries using flags, but there are alternative methods as already proposed (in such case we would need to check how the inside/outside flag variable is referenced from the bounds). However, this would not solve the need for such features in UGRID and by diverging away from UGRID's indirection the solution couldn't be reused inside UGRID as needed.

5) Besides inventing our own storage format (either in line with UGRID or CF), a third way was discussed namely: to store the simple geometry shapes as ascii or binary blobs in an extended format NetCDF 4 file. Since there are good starting points within UGRID and CF for storing polygons, we haven't really considered this third option yet since it would be less easily readable without using GIS libraries like GDAL. However, the main strength of this approach would be that other standard simple features such as circles, donuts, and circular arcs would be automatically covered consistently by this method.

6) Actually, I have a related use case for storing a network of polylines (rather than straight edges) in UGRID compatible format for 1D hydraulic models. In this case I need to store polylines representing river branches: their connectivity at bifurcations and confluences is important, but so is their overall length - river chainage - and hence I can't split them into the base edges. This network defines basically the 1D coordinate system to be used by the actual 1D (UGRID) simulation mesh which will be defined on top of this channel network. Because of the link to 1D numerical modelling, this will be discussed in a separate thread in the UGRID community first.

Best regards,

Bert

-----Original Message-----
From: CF-metadata [mailto:cf-metadata-***@cgd.ucar.edu] On Behalf Of Jonathan Gregory
Sent: 22 September 2016 18:26
To: cf-***@cgd.ucar.edu
Subject: Re: [CF-metadata] Feedback requested on proposed CF Simple Geometries

Dear Chris

Post by Chris Barker

Post by Jonathan Gregory
If the regions were irregular
polygons in latitude and longitude, nv would be the number of
vertices and the lat and lon bounds would trace the outline of the
polygon e.g. nv=3,
lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded
by the meridians at 0E and 90E and the Equator. I think, therefore,
we do not need an additional convention for points or polygonal
regions.

this seems fine for this simple example, but burying a bunch of
coordinates of a complex polygon in a text string in an attribute is
really not a good idea -- the coordinates of a polygon should be in
the array data one way or another, rather than having to parse out attribute strings.

To avoid confusion:

I didn't suggest parsing attribute strings. The same numbers that Ben would put in his x and y auxiliary coordinate variables for a single polygon can appear in coordinate bounds variables according to the existing convention.

Post by Chris Barker
* I suspect that geometries of this kind can be described by the ugrid

I'm not so sure -- UGRID is about defining a bunch of polygons that
all share vertices, and are all of the same order (usually all
triangles, or quads, or maybe hexes). if they are a mixture, you still
store the full set (say, six vertices), while marking some as unused.
But it's not that well set up for a bunch of polygons of different order.
Not too bad if there are only one or two complex polygons, but it
would be a bit weird -- you'd have vertices and boundaries, but no
faces. And you'd lose t order of the vertices (thought that could
probably be added to the UGRID standard)

OK. I didn't investigate this, but it would be good to know about it. If ugrid can do something like this, but not all of it, maybe ugrid could be extended. If ugrid seems too complicated for these cases, maybe a "light"
version of ugrid could be proposed for them. I think we should avoid having two partially overlapping conventions.

Post by Chris Barker
* So far CF does not say anything about the use of netCDF-4 features (i.e.

Maybe it's time to embrace netcdf4? It's been a while! Though maybe for CF
2.* -- any movement on that?

I think, as we generally do, that we should adopt netCDF-4 features if there is a definite need to do so. I mean something you can't do with an existing mechanism, or which is done so much more easily with a new mechanism that it justifies the extra effort of requiring alternatives to be programmed in software. I'm not arguing against it in general, but I think it has to be argued for each specific need within the convention.

CF2 is not well-defined. I have to admit to being nervous about that. I am very much opposed to an idea of "starting all over again" and maintaining two conventions in parallel (since old data would continue to exist for a long time and so the old CF would have to be supported), and I also think backwards- incompability has to be strongly justified. So I favour step-by-step evolution.
Another idea we've discussed, which I'm comfortable with, is of defining "strict" compliance to the convention, which a data-writer could optionally adhere to. This could exclude older features we wanted to deprecate. However this is really not the subject of the discussion - it's another thread.

Post by Chris Barker
I think the ragged array option ins fine -- though I haven't looked at
vlen arrays enough to know if they offer a compelling alternative. One
issue is that the programming environments that we use to work with
the data may not have an equivalent of vlen arrays.

That's a good point, and a reason why we have to be cautious in general about adopting netCDF-4 features.

Best wishes

Jonathan

David Blodgett

2016-09-26 15:37:55 UTC

Permalink

Dear Jonathan, Chris, Bert and others,

Thanks for the responses. This is all very thoughtful and good discussion.

Ben, Tim and I are preparing a more thorough response than any one of us could prepare on our own and will share it with the list soon.

Regards,

- Dave Blodgett

Post by Bert Jagers
Dear Jonathan, Chris, and everybody,
Yes, there are similarities between UGRID and the Simple Geometries, but there are differences as well.
We discussed about merging or diverging these two proposals before Ben sent his mail to this group, but it's good to hear the opinion of the broader community on this topic.
1) As Chris indicates UGRID is about defining a mesh or grid, i.e. bunch of polygons that all share vertices. These polygons tend to be of similar order (usually 3 or 4 nodes per polygon, but 5 or 6 or even more might also occur). Since the variation in the number of nodes is usually limited, we have so far accepted that the full matrix is stored with missing values for entries that aren't needed for defining a polygon with less than the maximum number of nodes. I know that there are numerical models that use mostly triangles but accept complex elements with over 1000 nodes and for such cases the face_node_connectivity matrix becomes awfully sparse.
2) UGRID uses an indirection layer because nodes are shared and the fact that nodes are shared is an important element of the mesh topology/connectivity. So, UGRID defines primarily node coordinates and defines edges (i.e. lines) and faces (i.e. polygons) by referring to the node indices. The current Simple Geometries proposal inherits the indirection layer from the UGRID but it is not immediately clear that this indirection is useful. If we think of an arbitrary set of polygons then nodes will be rarely shared, however in case of a coverage of the continent by complex hydrological catchments then one may expect that almost all interior nodes will be shared by two or more catchment polygons.
3) If the polygons form the basis of a hydrological model and connectivity between these complex faces is important, then users may prefer to store the polygons using UGRID and then UGRID may have to support faces with holes and multi-part faces as well. It's a useful extension which we have considered for the results of our water quality model which tend to run on an aggregated version of the hydrodynamic mesh. In this case the hydrodynamic mesh is composed of simple polygons (typically 3-6 nodes), while the water quality data is defined on complex aggregates thereof (20 node polygons) with possibly holes but multiple parts are seldom (although the shared edges between such complex polygons may be composed of multiple (disconnected) edges). By keeping Simple Geometries aligned with UGRID one could make the full list of features available to numerical modellers. Since the indirection is an important part of UGRID, this would for consistency be included in Simple Geometries as well

Post by Bert Jagers
though it adds little value and looks like overhead as indicated by Jonathan.
4) The other alternative as given by Jonathan would be to link this to the bounds definition. I would like to remark that the current Cell Boundaries section doesn't say anything about the storing cells with a varying number of nodes. In UGRID we have assumed that the missing value can be used to fill up the arrays of small polygons, but this isn't listed in the CF conventions as an option. Anyway, this option looks more natural if the sharing of nodes is seldom or not important. In this case one loses the option to mark the outer and inner boundaries using flags, but there are alternative methods as already proposed (in such case we would need to check how the inside/outside flag variable is referenced from the bounds). However, this would not solve the need for such features in UGRID and by diverging away from UGRID's indirection the solution couldn't be reused inside UGRID as needed.
5) Besides inventing our own storage format (either in line with UGRID or CF), a third way was discussed namely: to store the simple geometry shapes as ascii or binary blobs in an extended format NetCDF 4 file. Since there are good starting points within UGRID and CF for storing polygons, we haven't really considered this third option yet since it would be less easily readable without using GIS libraries like GDAL. However, the main strength of this approach would be that other standard simple features such as circles, donuts, and circular arcs would be automatically covered consistently by this method.
6) Actually, I have a related use case for storing a network of polylines (rather than straight edges) in UGRID compatible format for 1D hydraulic models. In this case I need to store polylines representing river branches: their connectivity at bifurcations and confluences is important, but so is their overall length - river chainage - and hence I can't split them into the base edges. This network defines basically the 1D coordinate system to be used by the actual 1D (UGRID) simulation mesh which will be defined on top of this channel network. Because of the link to 1D numerical modelling, this will be discussed in a separate thread in the UGRID community first.
Best regards,
Bert
-----Original Message-----
Sent: 22 September 2016 18:26
Subject: Re: [CF-metadata] Feedback requested on proposed CF Simple Geometries
Dear Chris

Post by Chris Barker

Post by Jonathan Gregory
If the regions were irregular
polygons in latitude and longitude, nv would be the number of
vertices and the lat and lon bounds would trace the outline of the
polygon e.g. nv=3,
lat=0,90,0
and lon=0,0,90 describes the eighth of the sphere which is bounded
by the meridians at 0E and 90E and the Equator. I think, therefore,
we do not need an additional convention for points or polygonal
regions.

this seems fine for this simple example, but burying a bunch of
coordinates of a complex polygon in a text string in an attribute is
really not a good idea -- the coordinates of a polygon should be in
the array data one way or another, rather than having to parse out attribute strings.

I didn't suggest parsing attribute strings. The same numbers that Ben would put in his x and y auxiliary coordinate variables for a single polygon can appear in coordinate bounds variables according to the existing convention.

Post by Chris Barker
* I suspect that geometries of this kind can be described by the ugrid

I'm not so sure -- UGRID is about defining a bunch of polygons that
all share vertices, and are all of the same order (usually all
triangles, or quads, or maybe hexes). if they are a mixture, you still
store the full set (say, six vertices), while marking some as unused.
But it's not that well set up for a bunch of polygons of different order.
Not too bad if there are only one or two complex polygons, but it
would be a bit weird -- you'd have vertices and boundaries, but no
faces. And you'd lose t order of the vertices (thought that could
probably be added to the UGRID standard)

OK. I didn't investigate this, but it would be good to know about it. If ugrid can do something like this, but not all of it, maybe ugrid could be extended. If ugrid seems too complicated for these cases, maybe a "light"
version of ugrid could be proposed for them. I think we should avoid having two partially overlapping conventions.

Post by Chris Barker
* So far CF does not say anything about the use of netCDF-4 features (i.e.

Maybe it's time to embrace netcdf4? It's been a while! Though maybe for CF
2.* -- any movement on that?

I think, as we generally do, that we should adopt netCDF-4 features if there is a definite need to do so. I mean something you can't do with an existing mechanism, or which is done so much more easily with a new mechanism that it justifies the extra effort of requiring alternatives to be programmed in software. I'm not arguing against it in general, but I think it has to be argued for each specific need within the convention.
CF2 is not well-defined. I have to admit to being nervous about that. I am very much opposed to an idea of "starting all over again" and maintaining two conventions in parallel (since old data would continue to exist for a long time and so the old CF would have to be supported), and I also think backwards- incompability has to be strongly justified. So I favour step-by-step evolution.
Another idea we've discussed, which I'm comfortable with, is of defining "strict" compliance to the convention, which a data-writer could optionally adhere to. This could exclude older features we wanted to deprecate. However this is really not the subject of the discussion - it's another thread.

Post by Chris Barker
I think the ragged array option ins fine -- though I haven't looked at
vlen arrays enough to know if they offer a compelling alternative. One
issue is that the programming environments that we use to work with
the data may not have an equivalent of vlen arrays.

That's a good point, and a reason why we have to be cautious in general about adopting netCDF-4 features.
Best wishes
Jonathan
_______________________________________________
CF-metadata mailing list
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata
DISCLAIMER: This message is intended exclusively for the addressee(s) and may contain confidential and privileged information. If you are not the intended recipient please notify the sender immediately and destroy this message. Unauthorized use, disclosure or copying of this message is strictly prohibited. The foundation 'Stichting Deltares', which has its seat at Delft, The Netherlands, Commercial Registration Number 41146461, is not liable in any way whatsoever for consequences and/or damages resulting from the improper, incomplete and untimely dispatch, receipt and/or content of this e-mail.
_______________________________________________
CF-metadata mailing list
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata

Ben Koziol - NOAA Affiliate

2016-09-27 17:28:36 UTC

Permalink

Jonathan and CF-Metadata List,

Thanks for the suggestions and discussion. Weâve attempted to respond to
the major questions and concerns using Jonathan's mail as a template.
Apologies in advance if we missed anything outstanding or did not
appropriately acknowledge contributions in this thread.

You explain that the need is to specify spatial coordinates with a simple

Post by Jonathan Gregory
geometry for a timeSeries variable. For example, this could be for the
discharge as a function of time across some line in a river (your example),
or I suppose it could be an average temperature as a function of time for
the Atlantic Ocean, where you wanted to supply the polygon which drew the
outline of the basin. Have I got the idea?

Yes, you have this mostly right. Itâs common to have a collection of points
(weather stations), lines (stream reaches), or polygons (hydrologic
catchments) with an associated time series.

Timeseries like this can be stored in CF, but their geographical extent is

Post by Jonathan Gregory
usually described only in words e.g. a region name of atlantic_ocean, and
this is fine for applications like CMIP where you want to compare data from
different data sources in which the Atlantic Ocean may have different exact
shapes (different AOGCMs, in particular). An array of region names is also
possible, so I don't think we need a new convention to contain your dwarf
planet example.

The dwarf planet example is intended to describe our generalized approach
to continuous ragged arrays that may be used for arbitrarily-sized data
arrays. For some (including me), using a string instead of a numeric
example helps illustrate the concept. It is an idiosyncratic example in
many ways. Sorry for the confusion.

Sect 9.1 on discrete sampling geometries says it cannot yet be used for

Post by Jonathan Gregory
cases "where geo-positioning cannot be described as a discrete point
location. Problematic examples include time series that refer to a
geographical region (e.g. the northern hemisphere) ...". Actually I think
that's not quite right. The existing convention *can* describe regions
which are contiguous, and rectangular or polygonal, using its usual bounds
convention (Sect 7.1). I think we should consider changing this text,
because it seems unnecessarily restrictive.

Your explanation makes sense, and this should be captured in the DSG
convention text.

Post by Jonathan Gregory
If the regions were irregular polygons in latitude and longitude, nv would
be the number of vertices and the lat and lon bounds would trace the
outline of the polygon e.g. nv=3, lat=0,90,0 and lon=0,0,90 describes the
eighth of the sphere which is bounded by the meridians at 0E and 90E and
the Equator. I think, therefore, we do not need an additional convention
for points or polygonal regions.

Many earth science datasets (excluding triangular, hexagonal, etc. meshes)
representable as polygons and lines have differing node counts. "nv" could
not efficiently capture watershed A with 5 nodes and watershed B with 100.
Additionally, the cell bounds concept does not include the structure and
semantics needed to support MultiLines, MultiPolygons, or polygons with
holes/interiors.

However, we would need new conventions for a timeseries where each value

Post by Jonathan Gregory
applies to a set of discontiguous regions or regions with holes in them, a
set of points, a line or a set of lines. I guess that these are included in
the geometry types you list (LineString, Multipoint, MultiLineString, and
MultiPolygon).

Yes.

Do you have definite use-cases for all of these? (I ask this because we

Post by Jonathan Gregory
don't add new functionality to CF until there is a definite and common need
for it in practice.)

David Arctur described the primary motivation for developing the simple
geometries approach: "Among other applications, NetCDF-CF is now being used
as an intermediate & output data format in the US National Weather
Serviceâs National Water Model (NWM). This forecasts streamflow rates in
about 2.7 million stream segments averaging 2km, throughout the continental
US, at multiple time horizons (3 hr, 18 hr, 10 days) every hour, and an
ensemble for 30-day forecast less frequently." These data also contain
multi-geometries primarily in the form of MultiLineStrings and
MultiPolgyons.

To this we would add that working with GIS datasets of this magnitude is
difficult with current NetCDF metadata conventions, often yielding an
unwieldy hybrid of NetCDF data and other softwares like ESRI ArcGIS and
PostGIS. ESRI ArcGIS and PostGIS are not usable on many HPC platforms where
models like the NWM reside.

I suspect that geometries of this kind can be described by the ugrid

Post by Jonathan Gregory
convention http://ugrid-conventions.github.io/ugrid-conventions, which is
compliant with CF. Their purpose is to describe a set of connected points,
edges or faces at which values are given, whereas in your case you'd give a
single value for the whole set, but the description of the geometry itself
might be similar. Have you had a look at whether ugrid could meet your
needs? If it almost does so, perhaps a better thing to do would be to
propose additions to ugrid. We would like to avoid having more than one way
to describe such geometries.

Bert Jagers and Chris Barker have already commented on this. It is
important to note that UGRID is the *primary* inspiration behind this
proposed approach. That should have been mentioned in the original mail.
The genesis of this work was with full knowledge of UGRID.

This proposed CF addition is meant to align more closely with the community
standards behind GIS features types used by the OGC community. To
accommodate the feature types described by this proposal UGRID would need
to incorporate:

1. Ragged arrays for coordinate index vectors.
2. Encoding method for multi-geometries.
3. Support for point geometries.

The simple features proposal does not expect node sharing amongst
adjacent/contiguous elements and, in all fairness, this is not a
requirement of UGRID but rather a recommendation. The simple features
approach does inherit from UGRID as Bert indicated in that it is possible
to implement node sharing via coordinate index indirection.

We agree with David Arctur that the simple features approach is easier to
implement than UGRID. No offense intended to UGRID which is a powerful
convention indeed.

It really is up to the community if they would rather see simple features
represented in an amended CF-compatible UGRID or an addition to CF. We are
of the opinion that a simple features specification would be very useful.

So far CF does not say anything about the use of netCDF-4 features (i.e.

Post by Jonathan Gregory
not the classic model). We have often discussed allowing them but the
general argument is also made that there has to be a compelling case for
providing a new way to do something which can already be done. (Steve
Hankin often made this argument, but since he's mostly retired I'll make it
now in his name :-) If there are two ways to do something, software has to
support both of them. We already have ways to encode ragged arrays, so is
there a compelling case for needing the netCDF-4 vlen array as well? We
already have a way to encode strings too, as character arrays. I think this
is probably a discussion we should have again in a different thread, so
I'll just talk about your classic encoding. The same points apply to both
encodings.

Yes, letâs leave that conversation for another time. We mostly want to be
forward compatible understanding that vlen provides a more simple and some
would say more elegant way of handling ragged array data.

Your approach uses a coordinate_index variable to identify indices of

Post by Jonathan Gregory
geometry coordinates where the -1 and -2 indices indicates where exterior
and interior polygons begin, and the first polygon has an implied -1 at the
start. Is that right? Given this example, I wonder why you need the index
array, because none of the coordinates indices (values >=0) is repeated, so
no space is saved in the x and y arrays. I guess this would be the usual
case. If polygons did touch or lines crossed, a few points would be in
common, but not so many that seems to need the complication of the index
array. A simpler way to do it would be ... which needs only one dimension,
or you could use the CF ragged array convention (Sect 9.3.3)...

Our example may not be complete enough to fully demonstrate the use case we
are trying to describe. The example given, inspired by the DSG Continuous
Ragged Array encoding, uses a 'stop' variable rather than a âcountâ
variable. It may not be apparent that each âsimple featureâ may actually be
multiple polygons (with or without hole polygons) or lines. Regarding the
âoutside_insideâ example you provided, we should show an example where the
geometry count (dimension) is more than 1 and a geometry has multiple
polygons prior to the 'stop' coordinate. The word encoding example was
meant to convey this, but may not have been sufficient. Here is an example
with three geometries:
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
.

We are hesitant to add an additional integer variable to indicate
'inside_outside' as it will introduce (in our minds) extraneous, duplicated
value variables. Why repeat -1 5,000 times when introducing a -1 at
multi-geometry breaks accomplishes the same task? One could also argue for
additional variables containing multi-geometry breaks, but again, this is
extraneous. As an example, using break values and ragged arrays similar to
what we describe, the 2.7 million catchment dataset mentioned by Dave
Arctur (which contains MultiPolygons) results in a ~10 GB uncompressed,
netCDF-4 file. Adding 'inside_outside' variables to describe the breaks
and/or holes will make this file larger. We could reduce the file size by
removing repeated nodes via the coordinate indexing method.

You provide the attributes multipart_break_value and hole_break_value to

Post by Jonathan Gregory
specify the values (-1 and -2 above) for the outside vs inside
distinction. Do you need the generality of being able to choose these
values? It would seem simpler to use a character array and specify in the
convention which letters should be used e.g. ... That makes it more
readable, perhaps.

Those values could be fixed. We would recommend they always be appended to
the variable as attributes, however. We also tend to think of them as fill
values which are customizable in CF. In regards to the character array, it
again seems like a lot of repetition.

Similarly, you propose attributes for clockwise/anticlockwise node order

Post by Jonathan Gregory
and for the polygon closure convention. Do these need to be freely
choosable? You could specify clockwise, like the existing CF bounds
convention, and that the polygons are closed. In the latter case, you could
omit the last vertex of each polygon since it must be the same as the
first, and that would save a bit of space. If you specify these choices,
the attributes aren't needed.

These attributes would be considered optional. If controls are in place for
ordering, it may be specified on the polygon variables. Many GIS software
packages don't care about ordering and repeated nodes, but modelling and
regridding codes tend to be more picky.

If this convention is going to be used for discrete sampling geometries, an

Post by Jonathan Gregory
additional dimension is needed, because in a single data variable you might
have data for several of these geometries. That is, you need an array of
ragged arrays. Again, I wonder whether this suggests trying to use ugrid.
It might be you could name each one as a mesh, and specify the geometry of
for the set of timeSeries as an array of mesh names. That would be a very
easy change to the existing Sect 9.

The multiple geometry example may help:
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3#multipolygon-example
.

Again, thanks all for the feedback and looking forward to continued
discussion.

--
Ben Koziol
NESII/CIRES/NOAA Earth System Research Laboratory
***@noaa.gov
<https://mail.google.com/mail/u/0/?view=cm&fs=1&tf=1&to=***@noaa.gov>
802.392.4522
http://www.esrl.noaa.gov/nesii/

Chris Barker

2016-09-27 21:52:41 UTC

Permalink

Thanks for all the great input, Bert.

Post by Bert Jagers
5) Besides inventing our own storage format (either in line with UGRID or
CF), a third way was discussed namely: to store the simple geometry shapes
as ascii or binary blobs in an extended format NetCDF 4 file.

I think binary blobs is a really bad idea (and what would be the format of
those blobs? shape files? or maybe WEll KnownBinary?

But WellKnownText or geoJSON might be reasonable.

I'd still rather have it done "properly" with netcdf arrays.

-Chris
--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

***@noaa.gov

Ryan May

2016-09-27 22:06:11 UTC

Permalink

Post by Chris Barker
Thanks for all the great input, Bert.

I think binary blobs is a really bad idea (and what would be the format of
those blobs? shape files? or maybe WEll KnownBinary?

I agree--it sounds almost absurd to take a file format whose claim to fame
is being self-describing, and use it to store data in a format that is no
longer self-describing. I also lean (though less strongly) towards applying
this same argument to well-known text.

Ryan

--
Ryan May, Ph.D.
Software Engineer
UCAR/Unidata
Boulder, CO

Jonathan Gregory

2016-09-23 14:12:03 UTC

Permalink

Dear Ethan

Post by Ethan Davis
The EC netCDF-CF project that Ben mentioned is working on a number of CF
extension efforts that are looking to use features of the netCDF enhanced
data model. Those efforts will all target CF 2 rather than CF 1.x. However,
as with the Simple Geometries, we should also expect suggestions for
changes to CF 1.x spinning out of these efforts.
The CF-2 discussion has been pretty quite for awhile now. However, I expect
it will be more active as these various CF extension efforts start seeking
more community input and making proposals.

This implies forking CF. I don't see the need to do that. If we did make a
major backward-incompatible change to CF, I agree that it would be logical to
call it CF2.x, and subsequent development would be based on it, but I don't see
why EarthCube proposals like Ben's shouldn't be accommodated in CF1.x.

Best wishes

Jonathan

Hedley, Mark

2016-09-26 10:03:36 UTC

Permalink

Hello Ben

I think this is fascinating and fantastic work which is likely to prove very useful for a range of domains.

Post by Ben Koziol - NOAA Affiliate
Questions for the CF Community
1. Are our VLEN netCDF-3 and netCDF-4 approaches acceptable? What changes would you recommend?
2. Are the geometry types point, line, polygon, and their multipart equivalents sufficient for the community?

but I do think these are really valuable areas to get feedback on.

all the best
mark

From: CF-metadata [cf-metadata-***@cgd.ucar.edu] on behalf of Ben Koziol - NOAA Affiliate [***@noaa.gov]

Sent: 07 September 2016 19:13

To: CF metadata

Cc: Bob Simons - NOAA Federal; Whiteaker, Timothy L

Subject: [CF-metadata] Feedback requested on proposed CF Simple Geometries

Greetings,

As part of an EarthCube project for
advancing netCDF-CF [1], we are developing an approach to represent simple geometries in enhanced netCDF-4 with a variable length array backport for netCDF-3. Simple geometries, for example, may be used to associate stream discharge with river lines or surface
runoff with watershed polygons. We've drafted an initial approach and reference implementation on the GitHub netCDF-CF-simple-geometry project [2] and would greatly appreciate feedback from the CF community. We'd like to make sure our scope is appropriate
and our approach is acceptable.

Scope

The result of this effort will be a standard that the CF
timeSeries
feature type could use to specify spatial coordinates (define a simple geometry) for a
timeSeries
variable.

For those familiar with the OGC WKT standard geometry types [3], we will
include Point, LineString, Polygon, Multipoint, MultiLineString, and MultiPolygon (WKT primitives and multipart geometries).

We anticipate that the six chosen geometry
types will cover the needs of most people generating netCDF data. These types also align with other geospatial data formats such as GeoJSON and ESRI Shapefile. If our approach is well received by the CF community, we may later adapt it to include parametric
shapes such as circles and ellipses.

Simple Geometry Encoding
Method

Driven by the possibility that different
features will require different numbers of coordinates to describe their geometries, our approach uses variable length (VLEN) arrays in enhanced netCDF-4 and continuous ragged arrays (CRAs) in netCDF-3. We describe the VLEN netCDF-4 approach first. The netCDF-3
CRA description follows.

In our approach, a VLEN
coordinate_index
variable which identifies the indices of geometry coordinates in separate coordinate arrays. The
coordinate_index
variable includes a coordinates
attribute which stores the names of the coordinate variables and a geom_type
attribute to indicate the geometry type.

For multipart geometries, the coordinate
index variable may include a negative integer flag(s) indicating the start of each new geometry "part" for the current feature. The first geometry part is not preceded by the negative integer flag. The variable shall include an attribute named
multipart_break_value
identifying the flag's value.

For polygon geometries with holes (also
called "interiors"), the coordinate index values shall include a negative integer flagging the start of each hole. In this case, the variable shall include a
hole_break_value
attribute to indicate the flag value.

Other attributes on the coordinate index
variable describe clockwise or anticlockwise node order for polygons and polygon closure convention. For additional details, see the wiki [4]. With these concepts defined, an example for multipolygons with holes is shown below. You can copy the WKT description
below into Wicket [5] if you'd like to see what the geometry in this example looks like.

Well-Known Text (WKT):
MULTIPOLYGON(((0 0, 20 0, 20 20, 0 20, 0 0), (1 1, 10 5, 19 1, 1 1), (5 15, 7 19, 9 15, 5 15),
(11 15, 13 19, 15 15, 11 15)), ((5 25, 9 25, 7 29, 5 25)), ((11 25, 15 25, 13 29, 11 25)))

Common Data Language (CDL) for netCDF-4
VLEN Arrays:

netcdf multipolygon_example
{
types:
int64(*) geom_VLType ;
dimensions:
node = 25 ;
geom = 1 ;
variables:
geom_VLType coordinate_index(geom)
;
string coordinate_index:geom_type
= "multipolygon" ;
string coordinate_index:coordinates
= "x y" ;
coordinate_index:multipart_break_value
= -1 ;
coordinate_index:hole_break_value
= -2 ;
string coordinate_index:outer_ring_order
= "anticlockwise" ;
string coordinate_index:closure_convention
= "last_node_equals_first" ;
double x(node) ;
double y(node) ;
data:

coordinate_index =

{0, 1, 2, 3, 4, -2, 5, 6,
7, 8, -2, 9, 10, 11, 12, -2, 13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24} ;

x = 0, 20, 20, 0, 0, 1, 10,
19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5, 11, 15, 13, 11 ;

y = 0, 0, 20, 20, 0, 1, 5, 1,
1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25, 29, 25, 25, 25, 29, 25 ;
}

You'll find additional examples for
VLEN geometry storage on our wiki [6].

Variable Length (VLEN)
Arrays in NetCDF-3

To support netCDF-3, we created a VLEN
approach for netCDF-3 [7]. Inspired by CF continuous ragged arrays (CRAs), our approach drops the CRA
count
variable in favor of a stop
variable that stores the stop index for each geometry within an array of geometry coordinates. This improves random accessibility of the CRA "elements" avoiding the need to sum counts preceding the target element index. The
stop
variable includes a contiguous_ragged_dimension
attribute whose value is the name of the dimension for which stop indices apply (similar to the CRA
sample_dimension
attribute). An example showing how strings can be stored with this approach is shown below.

Common Data Language (CDL) for netCDF-3
CRAs:

netcdf dwarf_planets {
dimensions:
dwarf_planet
= 5 ; // number of dwarf planets described in this file
dwarf_planet_chars
= 28 ; // total number of characters for all planet names
variables:
char
dwarf_planet_name(dwarf_planet_chars) ;
int
dwarf_planet_name_stop(dwarf_planet) ;
dwarf_planet_name_stop:contiguous_ragged_dimension
= "dwarf_planet_chars" ;
data:
dwarf_planet_name = "PlutoCeresErisHaumeaMakemake"
;
dwarf_planet_name_stop = 5,
10, 14, 20, 28 ;
}

For the above geometry example, the
VLEN coordinate_index
netCDF-4 array is replaced by a netCDF-3 CRA.

netcdf multipolygon_example
{
dimensions:
node = 25 ;
indices = 30;
geom = 1 ;
variables:
int coordinate_index(indices)
;
coordinate_index:geom_type
= "multipolygon" ;
coordinate_index:coordinates
= "x y" ;
coordinate_index:multipart_break_value
= -1 ;
coordinate_index:hole_break_value
= -2 ;
coordinate_index:outer_ring_order
= "anticlockwise" ;
coordinate_index:closure_convention
= "last_node_equals_first" ;
int coordinate_index_stop(geom)
;
coordinate_index_stop:contiguous_ragged_dimension
= "indices" ;
double x(node) ;
double y(node) ;
data:
coordinate_index = 0, 1, 2,
3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2, 13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24 ;
coordinate_index_stop = 30 ;
x = 0, 20, 20, 0, 0, 1, 10,
19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5, 11, 15, 13, 11 ;
y = 0, 0, 20, 20, 0, 1, 5, 1,
1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25, 29, 25, 25, 25, 29, 25 ;
}

The CRA method could of course be used
in place of VLEN in netCDF-4. See our wiki page on GitHub [7] for more details and examples.

Questions for the CF
Community

Are our VLEN netCDF-3 and netCDF-4 approaches acceptable? What changes would
you recommend?

Are the geometry types point, line, polygon, and their multipart equivalents
sufficient for the community?

Thank you very much for considering
our ideas and helping us with your valuable feedback!

[1]
http://earthcube.org/group/advancing-netcdf-cf
[2]
https://github.com/bekozi/netCDF-CF-simple-geometry
[3]
https://en.wikipedia.org/wiki/Well-known_text
[4]
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki
[5]
https://arthur-e.github.io/Wicket/sandbox-gmaps3.html
[6]
https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/Examples---VLen-Ragged-Arrays
[7]

https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/VLEN-Arrays-in-NetCDF-3

--
Ben Koziol

NESII/CIRES/NOAA Earth System Research Laboratory
***@noaa.gov
802.392.4522

http://www.esrl.noaa.gov/nesii/