XML Queries on PostgreSQL

XML, love it or hate it, there is a lot of data floating around in XML formats and that fact isn't changing rapidly due to a large investment in XML. Occasionally, pulling XML data into a RDBMS may yield a quick win when integrating with external data sources which provide data in XML. PostgreSQL has the ability to manipulate XML data with SQL, allowing a path to integrate XML data into SQL queries. This post highlights the functionality of PostgreSQL to query XML documents.

Table Creation

A quick example of creating a table to hold XML data introduces the xml datatype in PostgreSQL. The xml datatype is available in PostgreSQL since at least version 8.x, and the functionality to manipulate XML within PostgreSQL is stable and unchanging.

create table xml_example (
   id  bigserial primary key,
   doc xml
);

Inserting XML Data

As with creating a table, inserting data follows the expected SQL form. XML data is really a string when inserting into an RDBMS.

insert into xml_example (doc)
values ('<CD><TITLE>2112</TITLE><ARTIST>Rush</ARTIST><YEAR>1976</YEAR></CD>');

insert into xml_example (doc)
values ('<CD><TITLE>Moving Pictures</TITLE><ARTIST>Rush</ARTIST><YEAR>1980</YEAR></CD>');

insert into xml_example (doc)
values ('<CD><TITLE>Clockwork Angels</TITLE><ARTIST>Rush</ARTIST><YEAR>2012</YEAR></CD>');

Querying XML Data

PostgreSQL has functions to extract bit of XML which work within standard SQL syntax and may be used within much larger queries. For an example, the following query would extract the year from the blocks of XML inserted above where the title element equals the string '2112'.

select xpath('/CD/YEAR/text()', doc) as year
from xml_example
where xmlexists('//TITLE[text() = ''2112'']' PASSING BY REF doc)

Which, given the data inserted above, results in:

year xml[]
{1976}

This does pose a challenge for some clients to process the results, as "1976"" without the decoration of XML, in this case the curly braces, would be the expected form. A tiny bit of regular expression coupled with a cast to a text solves the issue for most clients.

select regexp_replace((xpath('/CD/YEAR/text()', doc)::text), '{|}', '') as year
from xml_example
where xmlexists('//TITLE[text() = ''2112'']' PASSING BY REF doc)

Executing the above query provides the expected result, without curly braces:

year xml[]
1976