[Perldl] a non-PDL-specific issue #2: dimensions lack any semantic information

Ivan Shmakov ivan at main.uusia.org
Thu Feb 3 11:56:25 HST 2011


	0.  Introduction
	1.  A trivial example
	2.  Typical solutions
	3.  PDL vs. NetCDF Operators
	4.  Conclusion

    0.  Introduction

	The second problem I'd like to discuss is different from the
	first in that there /are/ known solutions for some particular
	cases, yet there aren't, to the best of my knowledge, a single
	approach, generic enough to become a part of a generic array
	processor, such as PDL.

	This problem is of a greater perceived significance, as it may
	easily lead to increased development (debugging) time.  Also,
	since this problem is, at least in part, solved by other
	software, it readily makes PDL seem inferior to such software.

	The problem is that the dimensions of a PDL variable lack any
	information whatsoever on the /meaning/ of the indices.

    1.  A trivial example

	Consider, e. g., that there're two PDL variables, $t1 and $t2,
	which contain series of regularly-sampled temperature at two
	distinct locations, which we've loaded from some data file or
	files.  Consider also that we're, for some reason, need to
	compute the difference between the temperatures sampled at the
	corresponing moments of time.  Can it be as simple as, say, the

    my $tdiff = $t1 - $t2;

	Unfortunately, it can't, as we're yet to be sure that the
	corresponding elements of $t1 and $t2 were sampled at the same
	time.  IOW, we're yet to be sure that the /mapping/ of indices
	to the values of a /physical quantity/ (time) is the same for
	both of the variables.

    2.  Typical solutions

	How this problem is typically solved?  First of all, we need a
	way to encode the mapping.  In the most common case, the mapping
	is assumed to be linear, and thus can be defined as a pair of
	scalars: the step, and the offset.  Only having ensured that
	both are the same for both of the variables, we can proceed with
	the computation.  Otherwise, we may choose to use subsampling or
	interpolation in order to get the mappings to match each other.

	For multi-dimensional data, the order of indices also becomes
	significant.  There're some differences on how this problem is
	addressed by the software.  In particular, the raster engine of
	the GRASS GIS assumes that, roughly speaking, the minor
	dimension corresponds to the west to east direction, while the
	major one corresponds to the south to north direction.  No other
	number of dimensions but two is allowed.  (Such a solution is
	clearly /not/ for a generic array processor, like PDL.)

	Many major image formats employ more or less the same solution,
	by requiring, e. g., that the inner dimension correspond to the
	primary color (red, green or blue), the middle dimension
	corresponds to the left to right direction, while the major one
	corresponds to the top to bottom direction.  (TIFF is among the
	notable exceptions, as it allows different layouts.)

	NetCDF, a prominent multi-dimensional data format, allows the
	individual dimensions to be explicitly named.  The software
	processing NetCDF files may then choose to /orient/ (i. e.,
	permute the dimensions of) the variables involved in a
	computation so that their dimensions having the same name will
	have matching positions in the list of indices.

	Some NetCDF-related materials mention the concept of a
	/coordinate variable/ — an one-dimensional variable associated
	with the named dimension, which holds the values of some
	physical quantity corresponding to the whole range of the index
	values.  This feature allows for completely arbitrary mappings.

    3.  PDL vs. NetCDF Operators

	The NetCDF Operators (NCO) implement the support for the named
	dimensions feature of NetCDF.  (And also for the NetCDF Climate
	and Forecast (CF) Metadata Conventions, which I'm not yet
	familiar with.)

	Thus, e. g., the user invoking the following command may be sure
	that the right thing is done, irrespective of the internal
	layout of the multi-dimensional data that inhabitates the source

$ ncbo --op_typ=sub data1.nc data2.nc difference.nc 

	A mere convenience?  Even more so for both the developers and
	data providers.

	For the first, this behavior means that the software based on
	the semantically-aware building blocks like the one above will
	not require modification should a data provider suddenly change
	the internal layout of a dataset.

	For the second, it, conversely, gives more freedom to change the
	internal layout as it becomes necessary, without any of: giving
	early warnings to the users of the data, providing the data in
	both the flavors, or risking losing compatibility.

	Unfortunately, reading the contents of a NetCDF variable into a
	PDL variable results in the loss of such a semantic information.
	Although this information may be read and tracked separately, it
	may imply extra burden on the developer, and reduce the
	readability of the code, perhaps to the point when it becomes
	impractical to pursue layout-independence.

	Previously, I've noted that there's a problem with software
	relying on some particular ordering of dimensions in the
	datasets created by some other software, as both it's limited as
	to the datasets it could be applied without modification, and it
	also constrains the data provider to the once-created data
	layout.  In fact, the same reasoning applies to the building
	blocks the software is made from: the functions.

	Thus, as of the current version of PDL, the order of the
	dimensions becomes a part of the function's signature, with all
	the (negative) consequences thereof.

    4.  Conclusion

	Now there, my question is: does it seem feasible to add semantic
	information to the PDL dimensions?

	The mere association of coordinate variables of a kind to the
	dimensions of the PDL regular variables shouldn't be hard to
	implement.  However, the necessity to maintain this information
	throughout the computation may imply some extra burden to the
	implementations of the PDL functions.

	Also, there's a question on how should the behavior be altered
	in presence of the semantically-tagged variables?  E. g., if the
	only dimension of $a is time, and the only dimension of $b is
	power, should $a + $b result in a variable having both of these
	dimensions?  (IOW, should an implicit cross product be


FSF associate member #7257
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mailman.jach.hawaii.edu/pipermail/perldl/attachments/20110204/99ba1ffb/attachment.sig>

More information about the Perldl mailing list