XML Bad Practices — Introduction — Robin Berjon

Written by Chris Beer on Tuesday, December 22, 2009

Perhaps this series is of interest as PBCore goes forward

"XML is now over ten years old and can euphemistically be dubbed a success. That being said, I don't believe I need convince readers that not all of its uses have been successful. Over time, many bright minds have attempted to describe how to best make use of it when designing vocabularies, but I believe it is safe to say that those efforts, no matter how excellent, have not been sufficient in ensuring that all applications of XML are produced in an entirely sane manner."

From http://berjon.com/blog/2009/12/xmlbp-intro.html (via http://delicious.com/anarchivist)

PBCore Recommendation : PBCoreCollection

Written by daniel_jacobson on Tuesday, December 08, 2009

Over the last few months, I have worked with Jack Brighton and Dave Rice to have the NPR API (http://www.npr.org/api) output PBCore as a supported format. In the early stages, we were able to put together a mapping of NPRML (our native XML format) to PBCore. From this mapping, my team and I started conceptualizing how this would work within the framework of the API. This exercise ultimately failed because of a philosophical issue between PBCore and the NPR API.

PBCore's implementation focuses on the individual conceptual asset (in NPRML terms, the story). So, if a station wants to receive the 20 stories from today's All Things Considered in PBCore, the station would receive 20 documents, one for each segment (and possibly another document for the program episode record). Meanwhile, the NPR API is a feed-oriented model, which means that a station that wants all 20 ATC segments would make a single request that delivers all 20 items, as well as the information about the program episode. The NPR API model matches many of the more popular feed types in the marketplace, including RSS, ATOM, Podcast, etc.

Because of this key difference, it is a big challenge to fit PBCore into the NPR API model. And because of this difference, I would like to recommend a new implementation to the PBCore schema to allow it to handle its current requests as well as the feed-based requests. Here is a sample of the changes:


    <PBCoreCollectionTitle>Title of this collection of stories
    <PBCoreCollectionDescription>This news feed contains stories that 
    meet all of the following criteria: (1) Stories aired on "All 
    Things Considered".  (2) Stories from the "Afghanistan" 
    <PBCoreCollectionSource>NPR API
    <PBCoreCollectionLink>Link back to the source for this feed
    <PBCoreCollectionPubdate>Thu, 16 Oct 2008 06:00:00 -0400
            <!-- OTHER ELEMENTS GO HERE -->
            <!-- OTHER ELEMENTS GO HERE -->
            <!-- OTHER ELEMENTS GO HERE -->

The key to the recommendation is to wrap all of the current XML in a parent node called <PBCoreCollection>. The namespace attributes, which were previously attached to the <PBCoreDescriptionDocument> element, have been moved up to the collection element.

The <PBCoreCollection> node may then contain several child nodes which describe the collection itself, such as title, description, source, and date (similar to the node in an RSS feed). Additionally, the <PBCoreCollectionLink> element is carrying over the concept of REST-ful models, providing a link back to the source, identifying how this feed was produced by the source.

The <PBCoreCollection> node may contain any number of iterations of the <PBCoreDescriptionDocument> as sub-nodes.

After describing the collection, the PBCore document will then provide the list of actual documents. For each <PBCoreDescriptionDocument> element in this overall document, I have made no changes (other than lifting the namespace attributes to the collection element).

The purpose of this approach, again, is to enable multiple PBCore documents to be delivered in one transaction as one document. To ensure backward compatibility for existing implementations, they can continue to process each item one at a time in this model as well by having only one <PBCoreDescriptionDocument>. That said, I believe that PBCore should recommend to implementers that they use of the feed-based approach. The most expensive part of data transfer from one system to another is always going to be the transaction itself, not the parsing of the document. So, for those 20 ATC segments, performing those 20 transactions and parsing them individually is far less efficient than doing one transaction and parsing the larger document.

Jack, Dave and I have also discussed potential conflicts with some current implementations that handle the transferring of multiple documents differently. So, if one system currently zips up multiple documents and another sends them individually, these two systems may not be able to work together without custom development, even though both of them are currently complying with the PBCore standard. Extending the standard to provide a method for distribution of multiple documents would also standardize the development practices around distribution of PBCore documents.

The three of us have had many conversations about <PBCoreCollection> and see great merit in this schematic change for PBCore. Although NPR is the real-life scenario that has surfaced this proposal, we believe that the merit of this approach goes far beyond working within the NPR framework.

Although we have put together this proposed model, we know that there are other great minds that could help us refine the recommendation. We look to forward your feedback and to an engaging discussion!


Write a comment: