Take your spatial data analysis to the next level with PostGIS at Azure PostgreSQL

Brass Contributor

Jan 23, 2020

Spatial data comes in many shapes (sic) and sizes. Location information can be something more abstract that doesn’t look like spatial data at first (an IP-address, a placename inside a string) or it can be something more concrete (an address, coordinates) that is quite straightforward to turn in to geographic objects.

To conduct spatial analysis with PostgreSQL you need to be able to handle geographic objects and to be able to handle geographic objects, you need the PostGIS extension.

If you set up a hosted PostgreSQL database on Azure and run the following query:

SELECT * FROM pg_available_extensions;

This will return a list of https://www.citusdata.com/blog/2018/11/27/postgres-more-than-a-relational-database/ and among those are multiple spatial extensions. Among those is PostGIS. To install PostGIS to your database, you only need the following query:

CREATE EXTENSION postgis;

After a few seconds this enables PostGIS and you might notice that a table spatial_ref_sys appears to your public schema. This table lists over 3000 known spatial reference systems and details needed to transform and reproject between them. Because the Earth is not flat (sorry to break it to you here), we have to deal with various coordinate systems to project things happening on this somewhat round planet to flat surfaces like your computer screen.

Working with spatial data in PostGIS

As a disclaimer, I have to note at this point that working with spatial data is not always very straightforward. That is why there are people (like me) specialized in that. Traditionally the field has been called GIS (see, PostGIS!). Nowadays GIS (Geographic Information Systems) is not just information systems, but it has evolved to things like location intelligence, spatial data analytics or spatial data science. Understanding the basics of spatial data is becoming more important in all subdomains of data analysis. However you might label it, we are still talking about the same key question: where something is happening and what does that mean?

Basic formats of spatial data in vector format are points, lines and polygons. You can read a very basic level explanation of data formats and GIS in general from https://mapschool.io/ or if you wish to dive deeper into the topic, I strongly recommend getting familiar with a basic GIS book. Points are formed from a pair of coordinates, lines are made from connected points and polygons are made from connected lines.

Points, lines and polygons. That is a good starting point when trying to understand spatial data. Credit: mapschool.io

In PostGIS points (POINT type), lines (LINESTRING) and polygons (POLYGON) can also be MULTI-type with multiple geometries in one, or a geometry collection that can contain different geometry types inside one cell. The coordinate system of the data is defined by spatial reference identifier (SRID). In addition to vector data, PostGIS https://info.crunchydata.com/blog/waiting-for-postgis-3-separate-raster-extensionalso store and analyze raster data.

The basis for the normal PostGIS geometry type is a plane and the shortest path between two points on the plane is a straight line. That means calculations on geometries (areas, distances, lengths, intersections, etc) can be calculated using cartesian mathematics and straight line vectors. PostGIS also has a geography data type, which provides native support for spatial features represented on "geographic" coordinates (sometimes called "geodetic" coordinates, or "lat/lon", or "lon/lat"). But you can also work with “lat/lon” data using the normal geometry data type.

The OpenGIS specification which PostGIS is based on, defines two standard ways of expressing spatial objects: the Well-Known Text (WKT) form and the Well-Known Binary (WKB) form. Both of these include information about the type of the object and the coordinates which form the object.

So in a PostGIS database a point geometry could be represented in multiple ways. If I run the following query in my PostGIS database that has a spatial table with cities:

SELECT geom FROM ne_50m_populated_places WHERE name =  'Seattle'

It returns the following binary representation for the point that is the location of Seattle:

0104000020E6100000010000000101000000EC1FEA31E2955EC0E0AAD59635C94740

But if I use the function ST_AsEWKT, the same point

SRID=4326;MULTIPOINT(-122.341930845868 47.5719479125307)

So the SRID 4326 https://epsg.io/4326that the data is in coordinate system called WGS84 - World Geodetic System 1984, used also in GPS. MULTIPOINT means that the data type is a multi-type although there is only one point and the numbers in parenthesis are latitude and longitude of Seattle. This particular point has way too many decimals in the coordinates when you think about the fact that it is showing where Seattle is. Understanding the concepts of scale and accuracy are key when working with spatial data.

You will find this funny at some point. I promise. Credit: https://xkcd.com/2170/

Spatial indexes are also one thing that works slightly differently with spatial data than other PostgreSQL data types. Without an index any search for a feature would require a sequential scan of every record in the database and this also applies for spatial data. Indexing speeds up searching by organizing the data into a search tree which can be quickly traversed to find a particular record. Whereas standard database indexes create a hierarchical tree based on the values of the column being indexed, spatial indexes are unable to index the geometric features themselves and instead index the bounding boxes of the features.

PostGIS uses R-Tree spatial index structure. R-Trees break up data into rectangles (bounding boxes), and sub-rectangles, and sub-sub rectangles, etc. It is a self-tuning index structure that automatically handles variable data density and object size.

These are just a few examples that are specific when working with spatial data. In the next blog post I will go through where you can find spatial data if you don’t already happen to have some in your database and how you can load it to PostGIS.

What can you do with PostGIS?

Whereas normal PostgreSQL offers you functions that work for example with math (sqrt(), round()), strings (char_length(), substring()) or data types, PostGIS supplements those with hundreds of https://postgis.net/docs/PostGIS_Special_Functions_Index.html. These functions make PostGIS a real data analysis powerhouse and can enable you to do everything from point data clustering to geometry manipulation. PostGIS uses https://en.wikipedia.org/wiki/DE-9IM model to describe the spatial relations of geographic objects. I will introduce some of those in later blog posts of this series.

One really powerful feature of PostGIS are spatial joins. From normal SQL you are probably familiar with the process of joining tables, but there are cases where the only common nominator of two tables is their location. As an example you might have a table with customer locations as points and another table with shop locations as areas. With a spatial join you can add new information to the customer data about their closest shop. https://carto.com/blog/lateral-joins/ of spatial joins can help you to answer very complex questions.

Routing is one really traditional GIS problem. With another PostgreSQL/PostGIS extension (also available on Azure PostgreSQL) pgRouting and network data you can find optimal routes and do different network analytics. Networks can be any kind of topologically connected vector data, but most typically you work with road geometries. You can find very good pgRouting https://workshop.pgrouting.org/

In one project I combined real-time train GPS location data in PostGIS ST_Voronoi analysis. Thus the areas sow the nearest train in each location around Finland.

In one project I combined real-time train GPS location data in PostGIS https://postgis.net/docs/ST_VoronoiPolygons.html function. Thus the areas sow the nearest train in each location around Finland.

Where to learn more and how to get started?

Currently Azure PostgreSQL is offering PostGIS version 2.5 but just recently PostGIS version 3.0 was released. I recommend trying it out and if you are familiar with PostgreSQL or SQL you will be up and running in no time. After a while you will also be able to distinguish your EPSG:3857 from EPSG:4326 and know the difference between ST_Contains and ST_Intersects.

I have found the PostGIS https://postgis.net/docs/to be a very good resource in general. You can check https://postgis.net/workshops/postgis-intro/extensive workshop material to get a good overview of PostGIS in action. If you want to improve your spatial SQL skills in practice, I recommend https://learn.crunchydata.com/postgisinteractive playground by Crunchy Data. I have written a longer blog post about PostGIS on https://medium.com/@tjukanov/why-should-you-care-about-postgis-a-gentle-introduction-to-spatial-databases-9eccd26bc42b directed more towards people coming from GIS background with no RDBMS experience.

I personally work with PostGIS and FOSS4G (= Free Open Source Software for Geospatial) at https://www.gispo.fi/en/home/ in Finland. I will write here a few blogs about PostGIS from different perspectives, so stay tuned!

Updated Jan 23, 2020

Version 2.0

Brass Contributor

Joined November 12, 2019

View Profile