Archive

Archive for May, 2011

Representing an XML qualified name as a string

May 31st, 2011 1 comment

I am working on a project where we need to store qualified XML names (QNames i.e. namespace and local name) as strings outside of an XML document. This includes QNames from any third party namespace that a user of our package wants to include. So I set out to find the standard way of doing this in a way that would give other apps the best chance of being able to properly parse the string back into a QName, especially for QNames which already had a somewhat widely used string representation. We are storing meta-data about “things” (documents, sensor recordings, you name it), so I paid particular attention to popular schemas in the semantic web space. Should we use ns:name, ns/name, ns#name, or something else? After spending way too much time on this, here is what I found:

  • There is no official standard. A qualified name is officially defined as two strings – the namespace and the local name. Oh, great.
  • One of the first papers on this by James Clark says {namespace}local is proper. This is what javax.xml.namespace.QName.toString produces, and the QName.valueOf method will parse that format. This form is also what the groovy QName class uses, but, interestingly, the equals for that class will accept a string that uses a colon delimiter.
  • http://docstore.mik.ua/orelly/xml/xmlnut/ch04_02.htm talks of both {namespace}local and namespace#local
  • http://www.rpbourret.com/xml/NamespacesFAQ.htm#names_15 has great detail on namespaces overall. It talks of {namespace}local and another form, namespace^local, which is what SAX filter uses, according to the page. I found no other examples or mention of this “caret” format.
  • javax.xml.soap.Name uses namespace:local. Apache axis does the same thing, which is not surprising considering I believe one came from the other.
  • ECMAScript for XML (and, thus, Adobe ActionScript) uses 2 colons – namespace::local. This is partly because it uses the two colons as an operator of sorts, and needed to separate it from other uses of a colon in the ECMAScript syntax.
  • Dublin Core (DC) explicitly defines the URIs of the terms in its schema. It uses “the path divider ‘/’ as the delimiter between namespace and local name. Of note, if you try to put one of those URIs into a web browser as a URL, it will redirect to a page which uses ‘#’ to note the fragment in an RDF schema. For example, http://purl.org/dc/terms/ will resolve to http://dublincore.org/2010/10/11/dcterms.rdf#name. I didn’t find any other schema/taxonomy that explicitly defines the URI for each element.
  • Regardless of the above behaviour, the Dublin Core XSD defines the namespace to include the ending ‘/’.
  • The namespaces of the RDF and OWL specifications include an ending ‘#’.
  • All namespaces included in the output from pingthesemanticweb, which lists the most popular semantic schemas, end in ‘/’ or ‘#’. Even the few that use urn format end in ‘#’ (e.g. urn:x-inspire:specification:gmlas:HydroPhysicalWaters:3.0#).
  • The Department of Defense Discovery Metadata Specification (DDMS) namespace, based heavily on Dublin Core, includes the ending ‘/’ just as DC does.
  • I could not find any namespaces that end in ‘}’, ‘^’, or ‘:’ (the first two of which are illegal, I think)

  • So, you might be thinking that we could just concatenate the namespace and local name together to form the string. To parse it, we could then split the string at the last occurrence of the delimiter character, keeping the delimiter as part of the namespace if it is a ‘/’ or a ‘#’. But wait! There’s more…

  • Many non-semantic-web schemas, like the XML Schema itself, xlink, and the OGC standards like gml, do not include the ending delimiter in their namespaces.
  • National Information Exchange Model (NIEM) namespaces, arguably somewhat-semantic, also do not include a trailing delimiter.
  • Neither does the Intelligence Community Metadata Standard for
    Information Security Marking (IC-ISM)
    namespace (which is in urn format).
  • Nor does the DOD core metadata OWL schema, at least as far as I can tell. Sorry, I couldn’t find an exact reference to that one.

Resolution Rules

So if you want to represent a particular qualified name as a string and do it in a way that others are most likely to recognize as the “accepted” way to represent that particular QName and you want it to be reversible, at least within your own app, the best rules I could come up with are:

Creating the String

Call the path divider ‘/’ and fragment ‘#’ symbols sticky delimiters because they may be a part of (i.e. stick to) a namespace. Call the other possibilities (‘:’, ‘::’, ‘}’, ‘^’) formal delimiters because you know they only serve the purpose of being a delimiter.

  1. If the namespace ends in a delimiter of any form, simple append the local name directly to it.
  2. Else, use ‘:’, ‘^’ or, to be totally safe, surround the namespace string with ‘{}’ and then append the local name. I chose ‘:’ because I at least saw some uses of that form on various pages while I never saw any uses of the caret ‘^’ or the surrounding ‘{}’. If you have total control of your input and output, use the surrounding braces format since it is totally unambiguous.

Parsing the String

  1. If there is a ‘{}’ pair, can assume form is {namespace}local
  2. Else, find the last possible delimiter in the string. If it is a “formal” delimiter, then drop the delimiter and make the namespace the chars before it and local name the chars after it.
  3. Else, if the last delimiter is “sticky”, you have to guess whether to keep it in the namespace. I put some basic logic in my code to recognize well known namespaces (like those above) that do not end in a delimiter, but then otherwise assume that a sticky delimiter should be included in the namespace.

It’s not a perfect solution, but that’s what you get when there is no standard.

Categories: groovy, OGC, semantic web, XML Tags: ,