PBCore Recommendation : PBCoreCollection

Written by daniel_jacobson on Tuesday, December 08, 2009

Over the last few months, I have worked with Jack Brighton and Dave Rice to have the NPR API (http://www.npr.org/api) output PBCore as a supported format. In the early stages, we were able to put together a mapping of NPRML (our native XML format) to PBCore. From this mapping, my team and I started conceptualizing how this would work within the framework of the API. This exercise ultimately failed because of a philosophical issue between PBCore and the NPR API.

PBCore's implementation focuses on the individual conceptual asset (in NPRML terms, the story). So, if a station wants to receive the 20 stories from today's All Things Considered in PBCore, the station would receive 20 documents, one for each segment (and possibly another document for the program episode record). Meanwhile, the NPR API is a feed-oriented model, which means that a station that wants all 20 ATC segments would make a single request that delivers all 20 items, as well as the information about the program episode. The NPR API model matches many of the more popular feed types in the marketplace, including RSS, ATOM, Podcast, etc.

Because of this key difference, it is a big challenge to fit PBCore into the NPR API model. And because of this difference, I would like to recommend a new implementation to the PBCore schema to allow it to handle its current requests as well as the feed-based requests. Here is a sample of the changes:

 
<PBCoreCollection 
xsi:schemaLocation="http://www.pbcore.org/PBCore/PBCoreNamespace.html 
http://www.pbcore.org/PBCore/PBCoreXSD_Ver_1-2-1.xsd" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.pbcore.org/PBCore/PBCoreNamespace.html">

    <PBCoreCollectionTitle>Title of this collection of stories
    </PBCoreCollectionTitle>
    <PBCoreCollectionDescription>This news feed contains stories that 
    meet all of the following criteria: (1) Stories aired on "All 
    Things Considered".  (2) Stories from the "Afghanistan" 
    topic.</PBCoreCollectionDescription>
    <PBCoreCollectionSource>NPR API
    <PBCoreCollectionLink>Link back to the source for this feed
    </PBCoreCollectionLink>
    <PBCoreCollectionPubdate>Thu, 16 Oct 2008 06:00:00 -0400
    </PBCoreCollectionPubdate>
   
    <PBCoreDescriptionDocument>
            <!-- OTHER ELEMENTS GO HERE -->
    </PBCoreDescriptionDocument>
    <PBCoreDescriptionDocument>
            <!-- OTHER ELEMENTS GO HERE -->
    </PBCoreDescriptionDocument>
    <PBCoreDescriptionDocument>
            <!-- OTHER ELEMENTS GO HERE -->
    </PBCoreDescriptionDocument>
</PBCoreCollection>

The key to the recommendation is to wrap all of the current XML in a parent node called <PBCoreCollection>. The namespace attributes, which were previously attached to the <PBCoreDescriptionDocument> element, have been moved up to the collection element.

The <PBCoreCollection> node may then contain several child nodes which describe the collection itself, such as title, description, source, and date (similar to the node in an RSS feed). Additionally, the <PBCoreCollectionLink> element is carrying over the concept of REST-ful models, providing a link back to the source, identifying how this feed was produced by the source.

The <PBCoreCollection> node may contain any number of iterations of the <PBCoreDescriptionDocument> as sub-nodes.

After describing the collection, the PBCore document will then provide the list of actual documents. For each <PBCoreDescriptionDocument> element in this overall document, I have made no changes (other than lifting the namespace attributes to the collection element).

The purpose of this approach, again, is to enable multiple PBCore documents to be delivered in one transaction as one document. To ensure backward compatibility for existing implementations, they can continue to process each item one at a time in this model as well by having only one <PBCoreDescriptionDocument>. That said, I believe that PBCore should recommend to implementers that they use of the feed-based approach. The most expensive part of data transfer from one system to another is always going to be the transaction itself, not the parsing of the document. So, for those 20 ATC segments, performing those 20 transactions and parsing them individually is far less efficient than doing one transaction and parsing the larger document.

Jack, Dave and I have also discussed potential conflicts with some current implementations that handle the transferring of multiple documents differently. So, if one system currently zips up multiple documents and another sends them individually, these two systems may not be able to work together without custom development, even though both of them are currently complying with the PBCore standard. Extending the standard to provide a method for distribution of multiple documents would also standardize the development practices around distribution of PBCore documents.

The three of us have had many conversations about <PBCoreCollection> and see great merit in this schematic change for PBCore. Although NPR is the real-life scenario that has surfaced this proposal, we believe that the merit of this approach goes far beyond working within the NPR framework.

Although we have put together this proposed model, we know that there are other great minds that could help us refine the recommendation. We look to forward your feedback and to an engaging discussion!


Comments:

  • Chris Beer said on 12/08 at 05:55 PM

    Daniel,

    Before I comment on your example implementation, I’d love to hear more about why you chose to create PBCoreCollection, rather than using/defining:

    - a manifest standard (based on something like http://www.cdlib.org/inside/diglib/bagit/bagitspec.html, which would also get you file transfer), or

    - an aggregation standard like Atom or OAI-PMH (http://www.openarchives.org/pmh/).

    I guess I’m a little worried about introducing yet another aggregation standard, but this might address shortcomings in those standards..

    Chris

  • daniel_jacobson said on 12/09 at 10:33 AM

    Chris,
    Thanks for the comment.  It seems like we are at least agreeing that PBCore could benefit from a standard for distributing multiple documents, so that is good.
    One of the key drivers for me in making this suggestion is improving adoption (including by NPR - we are having trouble implementing PBCore with the current spec).  Heavy-weight, hard to implement, systems that require custom development or server-side integration, will result in lower adoption rates.  A slimmer solution is much more likely to gain adoption and generate traction in getting open source tools and wrappers built for it.
    If I understand you correctly, using other standards like Atom and OAI-PMH as wrappers for the PBCore standard raises concerns about making PBCore dependent on extended namespaces or some integration with those standards (a dependency that could snap PBCore if these other standards shift).  On the other hand, the addition of PBCoreCollection was meant to be a very light-weight addition to the existing PBCore document that enables it to bundle multiple documents without compromising the core elements of the documents themselves. 

    This approach also sets PBCore up well to become more REST-ful (http://en.wikipedia.org/wiki/Representational_State_Transfer).  Although the other integrations points that you mention (and perhaps others) are worthy of discussion, my biggest concerns with them would be about our ability to improve adoption of PBCore and the fact that package-based approaches seem to be losing to REST-ful API’s and more dynamic/accessible distribution methods.
    I would be interested in hearing more about how you think this will/won’t work and if this suggestion is/isn’t consistent with the spirit of PBCore.

  • Chris Beer said on 12/09 at 04:35 PM

    I agree that OAI-PMH is probably too heavy-weight, but it is trying to solve a very similar problem. That said, I am personally inclined towards manifest-based standards like BagIt, which is getting pretty good support from the Library of Congress, or an Atom serialization like OAI-ORE. By supporting these standards, we can better integrate with libraries, archives and other institutions (who i imagine, after public broadcasters, are a likely audience for a PBCore feed).

    Do you have a full example of the PBCoreCollection document? I’d love to take it and create examples of BagIt or OAI-ORE for comparison.

  • Jack Brighton said on 12/10 at 11:04 AM

    Here’s an example of a PBCoreCollection document:

    http://will.illinois.edu/metadata/pbcoreCollection/

    Cavaets: The namespace is messed up because of course there is no such thing as a PBCoreCollection document. Also, I’m using PBCore version 1.1 because that’s what the AAPP wants. I will create a 1.2.1 version soon as I can, but other AAPP work must take precedence given the zooming deadline. Also, all those empty elements won’t be empty much longer, it’s just a bit messy this moment…

    See what you can do with this?

    Cheers,
    Jack

  • Chris Beer said on 12/12 at 01:52 PM

    Thanks Jack,

    I’ve put together an example Atom feed that puts the PBCoreDescriptionDocument into <atom:content>, which seems appropriate (this also means we could reference external documents using <atom:content src=”” >). The data duplication in the <atom:entry> itself isn’t ideal, but actually might be oddly useful..

    http://cbeer.info/~chris/atom-pbcore-mash.xml

    The biggest advantage of this approach, I think, is that existing atom parsers and libraries can handle the aggregation component and PBCore the descriptive metadata. Even without using an Atom parser, the additional work to parse this XML isn’t that different from what you’d need to do with a PBCoreCollection, e.g.:

    <xsl:stylesheet version=“1.0”
    xmlns:xsl=“http://www.w3.org/1999/XSL/Transform” xmlns:atom=“http://www.w3.org/2005/Atom” xmlns:pbcore=“http://www.pbcore.org/PBCore/PBCoreNamespace.html”>
    <xsl:template match=“text()” >
    <xsl:template match=”/”>
      <xsl:apply-templates match=”//pbcore:PBCoreDescriptionDocument” >
    </xsl:template>

    <xsl:template match=“pbcore:PBCoreDescriptionDocument”>
        <xsl:value-of select=”//pbcore:identifier” >
    </xsl:template>

    </xsl:stylesheet>

    Chris

  • Jack Brighton said on 12/14 at 11:28 AM

    This is great stuff Chris! We’ve been trying to move this forward in terms of aggregation for a long time, and it seems like the PBCoreCollection root element was the missing ingredient.

    The PBCoreCollection example I provided is based on PBCore Version 1.1. I’ll update it to 1.2.1 just for grins. I expect there will be a new version some time soon, and hopefully this work informs the need to add the collection root element.

  • Kara Van Malssen said on 12/17 at 01:44 PM

    Pardon my naivete on this subject, but what about using RDF to encapsulate multiple pbcore documents? Does it have the same capabilities to generate a feed-like output where the various PBCoreDescriptionDocuments could be contained in one RDF XML document?

  • Chris Beer said on 12/19 at 05:43 PM

    Kara—are you thinking of something like OAI-ORE and its RDF serialization? I’ll try to mock one of those too..

  • Chris Beer said on 02/13 at 12:06 PM

    For AAPP, I put together an OAI-ORE Atom serialization for OPB that went something like:

    <?xml version=“1.0” encoding=“utf-8”?>
    <entry >
    <id>aapp:wgbh</id>
    <source>
      <author>
          <name>WGBH Media Library and Archives</name>
          <uri>http://wgbh.org</uri>
      </author>
      <id>wgbh.org</id>
      <updated>2010-01-08T12:35:00Z</updated>
      <title>WGBH Media Library and Archives</title>
    </source>
    2010-01-08T12:35:00Z</published>
    <updated>2010-01-08T12:35:00Z</updated>

    <title>WGBH PBCore metadata for the American Archive Pilot Project repository</title>
    <author>
      <name>WGBH Media Library and Archives</name>
    </author>
    <author>
      <name>Chris Beer</name>
      <email>chris_beer@wgbh.org</email>
    </author>

    <!—hrefs are relative to the BagIt directory—>
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/007a2a8d38c9f5a39188a67eae93cba5b8c7ee5d.xml” title=“Americas: Old World of Negro Americans, The: Willard T. Johnson. [Part 1 of 2, Reel 2 of 2]” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/011d441e950fb1f7993548237336a6dc08932cf0.xml” title=“Evening Compass, The: September In Boston” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/02ac3da1ba4067993efba970e7cf55ba9e524484.xml” title=“Ten O’Clock News, The:” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/0304441bc22f97b936352943acbe9bc329d47688.xml” title=“March on Washington” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/0370b22ae37771d6abad63d8f17b674fe12b16f5.xml” title=“James Baldwin at MIT: On Civil Rights” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/03aeca61ae1d87376a0f8b1c250f0fe0b8d772f1.xml” title=“Evening Compass, The: [9/9/1975]” type=“text/xml” hreflang=“en” />
    <link rel=“http://www.openarchives.org/ore/terms/aggregates” href=“data/043bfaae1716ac4fca1d851322d2abc09134ae70.xml” title=“We Shall Overcome” type=“text/xml” hreflang=“en” />
    [...]
    </entry>

Write a comment:

Commenting is not available in this section entry.

Options:

Size

Colors