Yesterday I was at the ALPSP Technology Update meeting "A Standard XML Document Format: the case for the adoption of NLM DTD". I gave a talk entitled "NLM DTD in Archiving - a case study. It is the story of how we used the NLM DTD to produce the IMechE Proceedings Archive : You can download my PowerPoint presentation by clicking this link:
"Download NLM_DTD_archiving_IMechEProceedingsArchive2007.ppt .
All the speaker presentations are also available for download from the ALPSP website
It was an interesting meeting with a lively discussion, so I thought I would write up some notes.
By the way, I shall assume that readers have a reasonable idea what a DTD is, and only give a brief definition here (from the W3C tutorial on DTDs,) .
The purpose of a DTD (Document Type Definition) is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements and attributes.
Meanwhile, if you are a bit hazy on what XML is or how it differs from HTML or SGML, you are welcome to try my short paper, intended for publishers, called "What the Hell is XML?"
So now back to yesterday. I will try to pick out the themes that I found interesting. That will mean dodging about chronologically between what the speakers said, the panel discussion and discussions over lunch. Impediments such as lunch plates and my own pre- and post-speech hormone levels mean that my notes were not up to much - I welcome any corrections and additions!
Theme 1 - one reason why the NLM DTD is good
Bruce Rosenblum, (Inera, and one of the authors of the DTD), kicked off with a description of how it came to be. The first of my themes soon emerged: one of the reasons the DTD is such a success is that it was designed by reviewing publishers' DTDs that were available at the time. Therefore it incorporated and built on what publishers actually were doing and wanted to do, rather than some theoretical model of what they should do. The authors examined a wide range of journals of many subjects and so have been able to allow for many of the practical issues that journals publishers need to deal with . Also, while it is "the NLM DTD" it is suitable for subject matter very different from the National Library of Medicine's interests. Therefore, as Geoff Builder (CrossRef) said in his Chair's remarks, there is a lot of publishing wisdom in the DTD. When Eamonn Neylon (British Standards Institute) came to speak, he commented that the NLM DTD had followed the ideal route to being a standard - it had built on best practice, then been adopted widely by its community, before going down the route to having a NISO number.
Theme 2 - whether and when to adapt it
Given the "publishing wisdom" already included, the second theme to emerge was whether and how publishers should modify the DTD. Geoff Builder went so far as to say that if you needed to modify the DTD, you should look at your methods (as you might well be doing something wrong). Bruce's partially disagreed: his talk included material on how the DTD had been made easy to modify or extend if necessary - and some examples of people producing extensions for it . My own contribution was that I've found it important to avoid uncontrolled tinkering with DTDs - sometimes people are very tempted just to "fix" the DTD as a workaround to whatever problem they have on their desk today. During the IMechE Archive project, we did sometimes encounter unexpected oddities in the content we were processing. Each time I found that I could go back to the NLM DTD and its excellent documentation and find a way (or several possible ways) in which the problem could be accommodated WITHOUT bending the DTD out of shape. The DTD is beautifully built to anticipate these kinds of problems (in my experience). One example is the <custom-meta> element, which enables you to include your own, self-defined metadata into the DTD if you need to. Obviously you then need to document or take other steps to ensure that you adopt the same approach next time the problem is encountered, and know what you have done. But look to see what the DTD can already do before wading in and editing it.
Theme 3 - QA
A third theme that emerged was that of QA. You should certainly check that your content will "parse" (that is, check that it obeys all the rules in the DTD), but it would be a misconception to believe that this in itself means that all is well. This is because parsing makes a very specific set of checks - it is possible to write a document that parses perfectly, but is full of errors that the DTD cannot catch. Geoff Builder had a (rather shocking) anecdote about how some people behaved as if parsing the data was a game - he worked with a supplier whose data didn't parse. When one day it did, he was suspicious and inspected the XML . This revealed the reason - the supplier had faked things by "commenting out" the bits of the document that wouldn't parse.
The idea that the DTD is not a complete validation tool is somewhat counter-intuitive. During the panel discussion I was looking for an analogy that might be helpful, but it was not until later that thought of one. Here it is. When considering contracting a company to do important work, you might undertake some due diligence checks. For example, you might find out whether the company has any legal cases pending, and you might do a Dun and Bradstreet credit check. These checks are of course useful - you'd think carefully before getting involved with a company that had big money or legal problems - but they are not likely to be all that you would want to know. For example, you'd surely want to know about other things, including the competence and capacity of the company to do the work.
Eamonn Neylon's talk covered some of the automated tools available to supplement a check for parsing - he suggested that publishers should used rule-based QA where possible, and employ tools such as these to do it. Of course some manual inspection would always have its place too (automated tools are good at fixing predictable errors - people are good at finding unexpected errors)
Theme 4 - how should publishers adopt it?
A fourth theme, emerging in the panel discussion, was how should publishers adopt the NLM DTD (and possibly XML too if they have not done so already). Adopting the NLM DTD, rather than making your own was of course already a big time and money saver. Both Bruce Rosenblum and Bill Kasdorf (Apex CoVantage) had experience that trying to get authors to provide tagged content (e.g. by giving them MS Word templates) was an uncertain venture - success in getting the authors to comply was by no means guaranteed, even in the case of a very prestigious journal or when working with very technophilic authors. The panel agreed that talking to your suppliers - especially the typesetters - was a good first step. Typesetters might be able to offer XML at no or little additional page cost and were likely to be willing to spend time with a publisher to discuss and explain (in the interests of winning business, of course). Both Geoff Builder and Eamonn Neylon sounded a note of caution though - a lot of XML work these days is with databases and therefore many an XML expert knows about databases and may be unfamiliar with the different issues of working with text content. So check the expert's expertise. What kind of workers did a publisher need? We felt that a publisher didn't necessarily need a mega-technical person. It could be very helpful to have an enthusiastic and interested project manager or project leader to make sure things happen [indeed, this is the role I took in the IMechE Archive project, while it was clear that Professional Engineering Publishing staff should deal with ongoing XML issues, so as not to be permanently reliant on me for expertise]. Publishers obviously needed to make sure they understood the new part of their business, however. Mick Spencer (Professional Engineering Publishing), commented that he had been on exactly this journey - learning about XML largely by talking and working with suppliers - and had found it possible to learn perfectly adequately in this way.
Nick Evans (ALPSP) asked what publishers should do with their legacy PDFs. The panel discussion drifted away from part of this - later I spoke to Nick and suggested that you could certainly use the NLM DTD to capture XML metadata for these older papers - you don't necessarily have to convert the full text into XML. In a way this was a bit like the position in my project with the IMechE Proceedings Archive - except that we started with paper, not even with PDF!
Tunneling under Disney
In an interesting closing set of closing remarks, Geoff Builder quoted Yuri Rubinsky (a well-known promoter of SGML, the predecessor to XML) as likening publishing to the "tunnels under Disneyland" (i.e. the support infrastructure, invisible to the user, which makes the whole enterprise work ). I believe Geoff was quoting this passage:
"I saw a revealing photograph of Disneyland in a United Airlines magazine, a shot of Mickey Mouse -- who is enormous in real life -- talking to a street cleaning person in a very tall, very wide tunnel underneath Disney World. A complex network of tunnels is what lets the Peaceful Kingdom function as well as it does and why you never see Mickey or Minnie or Goofy or Donald ducking into a washroom or eating lunch. The analogy [with the systems needed by publishers and libraries] is pretty rich. The architecture of the tunnels is the same no matter what public facility they support. The services they provide are constant, and silent. They keep complications -- like transport vehicles and emergency personnel -- out of the visitors' way, while providing an underpinning to the whole operation.
On one level, publishing is like those tunnels, making available the attractions above ground with subterranean structures. But for me the most interesting aspects of the Disneyland tunnels are their dimensions and their materials and their layout. Why? Because they are completely consistent wherever they go. They're the same beneath a pirate ship and beneath a hotdog stand, providing the consistent system services below which support and enable the mad variety of extravaganza above."
Yuri Rubinsky, Electronic Texts The Day After Tomorrow
Geoff's point (I think) - is that publishers (as opposed to XML fans) need only to dive into the tunnels of XML DTDs, Near & Far diagrams, Schemas et. etc. only so far - to keep their Magic Kingdoms running well.
Er Geoff - does that make me Mickey Mouse :-) ?
[Note added 12 December 07: Geoff is planning another foray into the tunnels under Disneyland in the ALPSP update series - the details that the ALPSP have circulated so far are:
Date: Early July
Venue: To be confirmed
Chair: Geoff Bilder, CrossRef
In this Technology Update you will hear how as publishers consider revamping their online publishing infrastructures, they are increasingly looking to new technologies like XML databases, RDF triple stores and XML-aware full text indexing engines.