Archive

Archive for the ‘XML’ Category

Xerces and xml-api Dependency Hell

June 29th, 2011 3 comments

One of the project I work on includes a whole mish-mash of XML-related libraries including xerces, jdom, dom4j, jaxen, xalan. Some are direct dependencies and some are pulled in by other third-party dependencies like hibernate, tika, gate, etc. Many of these libraries have transitive dependencies on xerces and/or on some form of xml-api artifact, though the exact artifact name, and even the group name seem to vary randomly. What was xerces:xmlParserApis vs xml-apis:xml-apis vs xml-apis:xmlParserAPIs? Why were there versions of xml-api artifacts in the 2.0.x range, but they seemed older than version 1.0.b2 which so many libs depend on?

I recently tried to upgrade the included version of xerces from 2.6.2 to 2.9.1. This is the latest official release posted to Maven Central, though it is nearly 4 years old. (The latest official xerces release, 2.11.0, and the previous one, 2.10.0, are not in the primary maven repos. See XERCESJ-1454 if interested in more on why.) The upgrade caused some rather strange class loader errors that forced me to finally dig into this. What follows are my rough notes on the various xml-api related artifacts. They go in chronological order.

Group IDArtifact IDVersionRelease DateNotes
xerces
xml-apis
xmlParserApis
xmlParserApis
2.0.0
2.0.0
01/30/2002
xerces
xml-apis
xmlParserApis
xmlParserApis
2.0.2
2.0.2
06/21/2002
xercesxmlParserApis2.2.111/11/2002includes all classes in 2.0.2, plus some security support stuff and other mods
xml-apisxml-apis1.0.b2
2.0.0
2.0.2
12/01/2002includes all but some security support and other util class in xerces:xmlParserApis:2.2.1, plus some additions
xercesxmlParserApis2.6.0
2.6.0
2.6.2
11/18/2003* all but 1 class from xml-apis:1.0.b2, plus the security support classes that were in xerces:xmlParserApis:2.2.1
* 2.6.2 was the last of this artifact
xml-apisxml-apis1.2.01* no jar, just a relocation tag to xerces:xmlParserApis:2.6.2
* Looks like this was added on 02/03/2010 (judging by date in http://repo1.maven.org/maven2/xml-apis/xml-apis/), about 3 years after other xml-apis:xml-apis entries like 1.3.04
xml-apisxml-apis1.3.0207/22/2005* includes all but 1 class from v2.6.2 (dropped older security support stuff), plus many additions
* Included with xerces 2.7.1
xml-apisxml-apis1.3.0302/25/2006* released with xerces 2.8.0
* xercesImpl:2.8.0 was the first one where they included dependency info in the pom
xml-apisxml-apis1.3.0411/19/2006* xerces:xercesImpl:2.9.1 (09/14/2007) depends on this
* this is the last of this artifact in maven repos

One interesting note is that xml-apis:xml-apis:2.0.0 and 2.0.2 are newer than their equivalent versions of xerces:xmlParserApis and xml-apis:xmlParserAPIs.

While tedious, working out these relationships helped me track down the conflicting dependencies.  I added these entries to my root project’s dependencyManagement section:

<dependency>
    <groupId>xml-apis</groupId>
    <artifactId>xml-apis</artifactId>
    <version>1.3.04</version>
</dependency>
<dependency>
  <groupId>jaxen</groupId>
  <artifactId>jaxen</artifactId>
  <version>1.1.1</version>
    <exclusions>
        <exclusion>
            <groupId>xerces</groupId>
            <artifactId>xmlParserAPIs</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
  <groupId>jmimemagic</groupId>
  <artifactId>jmimemagic</artifactId>
  <version>0.1.2</version>
    <exclusions>
        <exclusion>
            <groupId>xml-apis</groupId>
            <artifactId>xmlParserAPIs</artifactId>
        </exclusion>
    </exclusions>
</dependency>

and all was good in the world again.

Categories: maven, XML Tags: , ,

Representing an XML qualified name as a string

May 31st, 2011 1 comment

I am working on a project where we need to store qualified XML names (QNames i.e. namespace and local name) as strings outside of an XML document. This includes QNames from any third party namespace that a user of our package wants to include. So I set out to find the standard way of doing this in a way that would give other apps the best chance of being able to properly parse the string back into a QName, especially for QNames which already had a somewhat widely used string representation. We are storing meta-data about “things” (documents, sensor recordings, you name it), so I paid particular attention to popular schemas in the semantic web space. Should we use ns:name, ns/name, ns#name, or something else? After spending way too much time on this, here is what I found:

  • There is no official standard. A qualified name is officially defined as two strings – the namespace and the local name. Oh, great.
  • One of the first papers on this by James Clark says {namespace}local is proper. This is what javax.xml.namespace.QName.toString produces, and the QName.valueOf method will parse that format. This form is also what the groovy QName class uses, but, interestingly, the equals for that class will accept a string that uses a colon delimiter.
  • http://docstore.mik.ua/orelly/xml/xmlnut/ch04_02.htm talks of both {namespace}local and namespace#local
  • http://www.rpbourret.com/xml/NamespacesFAQ.htm#names_15 has great detail on namespaces overall. It talks of {namespace}local and another form, namespace^local, which is what SAX filter uses, according to the page. I found no other examples or mention of this “caret” format.
  • javax.xml.soap.Name uses namespace:local. Apache axis does the same thing, which is not surprising considering I believe one came from the other.
  • ECMAScript for XML (and, thus, Adobe ActionScript) uses 2 colons – namespace::local. This is partly because it uses the two colons as an operator of sorts, and needed to separate it from other uses of a colon in the ECMAScript syntax.
  • Dublin Core (DC) explicitly defines the URIs of the terms in its schema. It uses “the path divider ‘/’ as the delimiter between namespace and local name. Of note, if you try to put one of those URIs into a web browser as a URL, it will redirect to a page which uses ‘#’ to note the fragment in an RDF schema. For example, http://purl.org/dc/terms/ will resolve to http://dublincore.org/2010/10/11/dcterms.rdf#name. I didn’t find any other schema/taxonomy that explicitly defines the URI for each element.
  • Regardless of the above behaviour, the Dublin Core XSD defines the namespace to include the ending ‘/’.
  • The namespaces of the RDF and OWL specifications include an ending ‘#’.
  • All namespaces included in the output from pingthesemanticweb, which lists the most popular semantic schemas, end in ‘/’ or ‘#’. Even the few that use urn format end in ‘#’ (e.g. urn:x-inspire:specification:gmlas:HydroPhysicalWaters:3.0#).
  • The Department of Defense Discovery Metadata Specification (DDMS) namespace, based heavily on Dublin Core, includes the ending ‘/’ just as DC does.
  • I could not find any namespaces that end in ‘}’, ‘^’, or ‘:’ (the first two of which are illegal, I think)

  • So, you might be thinking that we could just concatenate the namespace and local name together to form the string. To parse it, we could then split the string at the last occurrence of the delimiter character, keeping the delimiter as part of the namespace if it is a ‘/’ or a ‘#’. But wait! There’s more…

  • Many non-semantic-web schemas, like the XML Schema itself, xlink, and the OGC standards like gml, do not include the ending delimiter in their namespaces.
  • National Information Exchange Model (NIEM) namespaces, arguably somewhat-semantic, also do not include a trailing delimiter.
  • Neither does the Intelligence Community Metadata Standard for
    Information Security Marking (IC-ISM)
    namespace (which is in urn format).
  • Nor does the DOD core metadata OWL schema, at least as far as I can tell. Sorry, I couldn’t find an exact reference to that one.

Resolution Rules

So if you want to represent a particular qualified name as a string and do it in a way that others are most likely to recognize as the “accepted” way to represent that particular QName and you want it to be reversible, at least within your own app, the best rules I could come up with are:

Creating the String

Call the path divider ‘/’ and fragment ‘#’ symbols sticky delimiters because they may be a part of (i.e. stick to) a namespace. Call the other possibilities (‘:’, ‘::’, ‘}’, ‘^’) formal delimiters because you know they only serve the purpose of being a delimiter.

  1. If the namespace ends in a delimiter of any form, simple append the local name directly to it.
  2. Else, use ‘:’, ‘^’ or, to be totally safe, surround the namespace string with ‘{}’ and then append the local name. I chose ‘:’ because I at least saw some uses of that form on various pages while I never saw any uses of the caret ‘^’ or the surrounding ‘{}’. If you have total control of your input and output, use the surrounding braces format since it is totally unambiguous.

Parsing the String

  1. If there is a ‘{}’ pair, can assume form is {namespace}local
  2. Else, find the last possible delimiter in the string. If it is a “formal” delimiter, then drop the delimiter and make the namespace the chars before it and local name the chars after it.
  3. Else, if the last delimiter is “sticky”, you have to guess whether to keep it in the namespace. I put some basic logic in my code to recognize well known namespaces (like those above) that do not end in a delimiter, but then otherwise assume that a sticky delimiter should be included in the namespace.

It’s not a perfect solution, but that’s what you get when there is no standard.

Categories: groovy, OGC, semantic web, XML Tags: ,