Linux and the domain specific language

DOTADIW.  Wait…  If you don’t know what ‘DOTADIW’ means, you have virtually no context to move forward.  You might guess it’s a acronym.  You might Google it.  Otherwise, full stop.

DOTADIW means “do one thing and do it well.”  Google it, and you’ll see it referred to as the “Unix philosophy.”  Once you’re well familiar with how one thing is done well, you’re golden.  You’re an expert at that one thing, and DOTADIW is for you, for one thing.

When doing one thing well, there is no interest in doing other things well, nor interest in how other things are done.  No interest in compromise on doing something very well so that its more like other things.  Hence, one thing “done well” is very much unlike other things “done well”, and being an expert at one thing does not help much with other things.

This is the world of Linux executables:  arcane command line flags which summarize (often to the point of belying) how things work.  This is the truth of regular expressions, of object relational mapping, and of domain specific languages.  Once you learned the abstraction – you’ve freed yourself from the implementation details.

However, examining the details is often how one learns.  Knowledge of regular expressions does not significantly aid object relational mapping.  With Linux, a thorough understanding of “make” does not help with “sed” and “awk.”  Hence we see many language specific implementations of common tasks.  Java and Python developers seldom run and parse “ls” to get a directory listing – it’s simply easier to learn their language’s native implementation.  The difficulty of mastering the details of simple commands (DSLs if you will) exceeds the difficulty of re-implementing them under the warm blanket of familiarity.

This is, of course, because simple commands often hide immense complexities.  An abstraction may help you or hurt you (consider the variety of regular expressions purporting to parse phone numbers, or ORM’s label as the “Vietnam of Computer Science”).  [In truth, there is no guarantee of positive ends, only of summary action.] Hence, the hiding of complexity – particularly as it pertains to learning domain specific languages – should not be considered a virtue.

This argument / claim above applies to “Linux proficiency”: familiarity with a proven, ragtag assemblage of GNU applications that must be mastered individually and not as a group. There is little familiarity upon which new commands become easier to master (beyond knowledge of basic shell features).

… and I simply find that interesting.

Downloading RPMs for local installation

Recently I’ve been doing a fair amount of CentOS dev-ops type work, and one of the items that had me stuck was downloading a list of RPMs for later installation on another arbitrary box.

The trick here is that we need to download all dependencies, but only install the ones we originally needed.  It turns out that there’s no method to resolve dependencies from local RPMs, so using a local repo is the only method.

On the downloading box:

repotrack $(cat rpms.txt)
repobuild .

On the target box:

create local.repo…

name=local packages
yum install -c local.repo --disablerepo=* --enablerepo=local $(cat rpms.txt)

Web applications in Go (golang)

This afternoon I authored a very simple web application in Go.  I’ve developed DLLs and command-line tools with Go in the past, but this was my first real Go web app outside of demos and tutorials.

I wanted to build an application to track my daughters first words.  The required persistence is simple word + date pairs.  You don’t want to repeatedly enter words, so I included a prefix-search JSON web service which dynamically shows the user any previously entered words beginning with the search term: auto-dissuade if you will.  This feature also acts as a mechanism to review what words were previously stored.

With Go, like Java or Node.js, its easy to find and use an embedded web server technology.  This sets it apart from .NET, where only recently has a reasonable embedded web server option existed.  However, what genuinely makes Go stand apart is the application packaging aspect.  Idiomatic Go strongly favors static compilation into a single .EXE file.  Consequently, most libraries in Go are distributed under business-friendly BSD / MIT style licenses.

What struck me, however, was how thoroughly the option of having a single .EXE appeals to me.  Although Go has a built in templates library for loading HTML from external files, I opted instead to embed my HTML within the .EXE.  I further sought a fast, embedded data store – settling on LevelDB.

I feel that the critical distinction here, is how easily deployed my solution is:  one .EXE.  A similar application using Microsoft technologies would require .NET, IIS, SQL Server, a dozen DLLs, and likely a WIX installer to bundle these items.  A naive Linux implementation might require Python, Apache, Redis, 3rd party libraries, and hopefully a Docker package to bundle it all up.  Its easy to trust that 3rd party database will be running when we need it to be, but there is a significant overhead in guaranteeing such things.

With Go, its surprisingly easy to consider a web server or standalone database as unwanted complexity, another point of failure, or simply another thing to learn.  Pundits may focus on micro-services and such, but its easy to see where such solutions simply shift complexity from the hands of developers to the hands of the dev-ops team.


On Storage I/O, “Medium” Data, and Data Locality

I regularly copy and/or ship terabytes of raster data, but only recently revisited hardware I/O:

A 1 gigabit Ethernet card (NIC) costs $10, while a 10gbe NIC is around $200.  Idealized math fills a 6TB drive over 1gbe in ~13.3 hours, but real world usage is often significantly slower.  A quick survey of small office NASs shows they typically saturate their ethernet connections.  For higher speeds, NAS vendors report the performance of multiple “teamed” 1gbe ports (AKA link aggregation).  Want those speeds from your server to your NAS?  Details suddenly matter – iSCSI multipath I/O might work for you, but good-old Windows file sharing likely won’t.  Multichannel SMB is in development for SAMBA, meaning your linux-based NAS likely doesn’t support sending SMB traffic over “teamed” ports.  If you’re using your NAS in iSCSI mode you’ve obviated your NAS’s built-in file server… you’re effectively using it as just another hard drive – aka Direct Attached Storage (DAS).

Direct Attached Storage – as it turns out – is harder to pin down.  The consumer markets are occupied mostly by companies like Lacie which cater largely to video-editors on Mac platforms.  Commodity HDDs can write at 150MB/s (1.2gb/s), hence USB 3’s 5gb/s should conceptually come close to saturating disk I/O on a 4-bay DAS.  Many DAS units, however, offer Thunderbolt (10gb/s) connectivity, and finding USB3 perfomance details is often challenging.  External SATA (6gb/s) or SAS (12gb/s) is yet another (rather elegant) option, but requires specialized components outside of most consumers’ wheelhouse.

There are certainly a lot of details that affect how fast these types of file-copy workflows will go.  If your NAS has USB ports, you may be able to use your NAS’s UI or Offloaded Data Transfers (ODX) to facilitate faster copying.  Under these conditions, NAS might be preferable to DAS.  However, there are other reasons to consider direct attached storage.  Although it’s a bit of a sidebar, a strong reason is to accommodate Windows users – likely the most expensive part of an IT organization.

The average “big data” user has specialized, deep knowledge of distributed systems and the cloud.  The complexities of “big data” problems have warranted an operational overhaul.  “Medium data” users are likely specialists in something – but not data management.  These users typically run Windows locally, not Linux in the Amazon Cloud.  They don’t want to drop ArcGIS for Desktop to learn MapReduce Geo, because they’re busy learning skydiving or how to be a better parent.  $100 a month to Comcast gets them 10mb/s uploads; and it takes them fifty-six days(!!) to backup that 6TB hard drive into the cloud.  Think of the potential productivity losses if the wrong user pursued this path…

This is not a criticism of “the cloud.”  Far from it, there are lessons to be learned in “data locality.”  Hadoop is a popular big-data buzzword of our time, but most people don’t know that Hadoop is less about “distributed processing” than it is about “data locality.”  The H-A-D in Hadoop stands for “High Availability Distributed” – referring to Hadoop’s Distributed File System (HDFS).  The beauty of HDFS – what we can learn from the cloud – is that data is stored on the same machines that will process it.  The I/O limitations mentioned above with NAS vs DAS apply all the same, in the cloud or in enterprise.

As GIS professionals, we cannot turn a blind eye to our hardware and operating system infrastructures.  Moore’s law has advanced processing to the point where data processing is cheaper than I/O infrastructures.  Within our real-world “medium data” infrastructures, we must be careful to scale realistically and intelligently.  Data must be stored close to where it is used; distributed processing is often for naught without a complimentary distributed storage.  In short, the proverbial “IT Guy” acting alone might not be enough to optimize your enterprise GIS – get involved!

On Cartographic Symbology and the GIS Software Business

I began my GIS career using ArcIMS.  These were ancient times… in order to specify “this line should be red“, one had to know ArcIMS’s proprietary XML.  It was ghastly then, and it’s ghastly now.  Using Google Earth?  The cartography is represented by KML.  Using TileMill?  Your CartoCSS is translated to XML behind the scenes.  Using ESRI?  At best, your symbology is stored as JSON.  QGIS?  QML.  MapServer?  Arcane plain text.  GeoServer?  SLD.

This is not meant to be a comparison of data formats, but of the breadth of different cartographic styling languages.

A raster in any of these programs will look the same.  With rasters, the software doesn’t matter — this is why OGC WMS was so successful.  Things did fall apart for WMS around the edges – OGC Styled Layer Descriptions (SLDs) were seldom used, and that style specification never really gained traction.  Seldom did a client really need to supply alternative cartography to a WMS.  The idea of a WFS server passing an SLD to a client as rendering instructions would be great, but its something I’ve never seen real world implementations of.

Hence vector styling has remained the wild west.  ESRI recently said they’d use a “format based on existing community specifications” for vector tile maps.  Presumably, that means CartoCSS or some variant.  The question looms “can I use my ArcMap symbology for ESRI vector tiles”? [It’s worth plugging Arc2Earth‘s TileMill Connect feature here.]  The opposite is also true.  It’s become simple to export OpenStreetMap data into ESRI-friendly formats.  Nevertheless, it will look terrible out of the box, you’ll have a devil of a time finding a cartographer to work on it, and its impossible for ESRI’s rendering engine to match OSM’s / Mapnik‘s rendering engine.

We are blessed in GIS, until the words cartographer and style trigger the cringe-worthy vector rasterization engine.  Nevertheless, this world is upon us.  Cloud resources such as OpenStreetMap and Twitter are defining the new worlds of cartography.  MapD’s Twitter Demo exemplifies how big data requires new types of rasterization engines.  Recently MapBox  has shifted from a server-side Mapnik rendering engine to a client-side MapBox GL; no doubt vastly reducing their storage and processing overhead.

Those of us building new GIS applications – even the mundane – should start by worrying about the cartography.  Data set support, application capabilities, application performance, cartographic costs, data storage sizes, and interoperability with other programs are just a few of the critical reasons why having good style is important.

Windows Search and Image GPS Metadata

Windows Search is a feature that on paper is fantastic but that completely fails in its default implementation.  I won’t wax poetic on what Windows Search claims to do, but its an amazing set of features given that nobody seems to be able to find anything with it.  However, if you fiddle around with your settings and have a concrete goal, things get better.

I can instantly find pictures from my iPhone by querying for “Apple *.jpg“.  This search utilizes the full-text index; a more precise search could have read “System.Photo.CameraManufacturer:Apple *.jpg“.  Herein lies the first challenge of Window Search: for non-text search, you usually need to know the name of the field you’re looking for.

A little digging reveals that image location data is stored as System.GPS.Latitude and System.GPS.Longitude.  Sweet!  Type “System.GPS.Latitude:>0” in your search box and prepare for disappointment.  There are a number of issues at hand here.  One of these issues is the format of the data, which is not the decimal you expect. Its actually a “computed property”, and there’s a lot of detail there, which I will skip over.

The bigger issue is that latitude and longitude simply aren’t being indexed.

If the property is indexed so that it is searchable: This means it is included in the Windows Search property store (denoted by isColumn = "true" in the property description schema) or available for full text searches (inInvertedIndex = "true")

Referring to the System.GPS.Latitude property description, isColumn and inInvertedIndex are both false.  I’m not yet aware how one might change these settings, but I’ll post again if I have any luck.


On Windows 8.1 and Windows Server 2012 R2, there’s a System.GPS.LatitudeDecimal property, which appear to be searchable by default.  Unfortunately, it appears that only Panoramic (.pano) files are associated with this property.  Prop.exe is a great tool for further exploring the Windows Property System.

Kicking the tires of TileMill’s support for File Geodatabases

Back in April, MapBox announced that TileMill now supports ESRI File Geodatabases.  The support appears to come via GDAL’s integration of  Even Rouault’s work to reverse engineer the FGDB format.

When I first looked at Even’s work, there was no support for reading the spatial indexing files of a FGDB.  Of course, without spatial indexing, large data sets would perform quite poorly.  Its worth noting that Even’s project now supports spatial indexing, but GDAL 1.1 uses the older version.  The current latest TileMill dev build to include an installer – TileMill-v0.10.1-291 – should similarly lack spatial indexing.

To make my test exciting, then, I decided to use a large dataset.  I fired up Ogr2Org and created an FGDB dump of the full Open Street Map globe (OSM2PGSQL schema).  I tested the data in ArcMap and OGR and everything was quite zippy.  Upon attempting to load the FGDB in TileMill, it crashed.  I can’t say I didn’t expect this.

It’s worth noting that ESRI’s File Geodatabase API is free as in beer.  I think Even’s work is fantastic for the community, but I’m not sure why MapBox didn’t use that other GDAL FGDB driver.  Nevertheless, OSS marches on, and I expect we’ll see these recent features bubble their way up.  I look forward to seeing FGDB spatial-indexing support hit TileMill, as I believe the idea has real legs.