Article category: Change requests
Deadline for PBCore 2.0 change requests looms
July 25th if the official deadline for input on the next major revision of PBCore. A release of PBCore 2.0 in November, 2010 is expected to represent a major leap forward, based on lots of change requests and research among the user community, plus where projects like the American Archive need to go. This is our chance to make sure it’s going in the most useful direction, and solving the right problems.
You can see the full list of submitted change requests, and add your own, here:
By July 25th please!
PBCore.org Alpha site launched!
Head on over to PBCore.org, check out the face lift, and leave your feedback!
WGBH is in the process of re-designing PBCore.org and we’ve just launched the “alpha” site which includes:
The interior pages remain the same for now, as the full re-launch is scheduled for November 2010 to coincide with the release of PBCore 2.0.
To that end, please contribute change requests for the schema, elements, etc., by June 30, 2010.
This summer, we’ll be conducting card sort and user testing exercises to re-organize the site’s new and legacy content and to make it as clear, concise and useful as possible for PBCore users. If you have suggestions or would like to participate please contact us or participate in the blog.
Many thanks to all who provided, and who continue to provide input into the new site, including Jack Brighton, Paul Burrows, Nan Rubin, the Code4Lib community, and the WGBH PBCore team!
PBCore and the Semantic Web
Recently I had a chance to discuss with Dave Rice, Dan Jacobson, and Chris Beer the potential role of PBCore in enabling content to flow in the “semantic web.” The topic has also come up in discussions around the development of PBCore 2.0. It seems like a good idea to open the topic for broader consideration.
The Semantic Web would mostly likely be implemented by Linked Data. Using URIs and RDF, “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” Those other methods would include linking things together by hand, piece by piece, which webmasters of the world can tell you doesn’t exactly scale well.
So how does PBCore fit into linked data using URIs and RDF? I don’t have anything close to a complete answer, just ideas that came from the discussion. I would love some feedback to take this thing further.
Linked Data depends upon having Uniform Resource Identifiers for content items, their descriptive/administrative metadata, and their relationships to other items and descriptors. Every resource should have a URI, whether the resource is a file or a subject term. Relationships between resources also get URIs, such as “friend” or “author”. With URIs for everything, you can establish so-called Triples (subject -> predicate -> object) as machine-processable links, which then establish other machine-processable links to other resources, and the Linked Data universe scales up from there based on network effects. These Triples and their relationships with other resources are made possible by RDF statements. How do we create RDF statements using PBCore?
As the most basic level, a PBCore record may already contain many URIs, for example in the case of a streaming media file a valid URI might be http://will.illinois.edu/media/mp3/illinoisbroadcastarchives_040T_96k.mp3. This file is described by elements like:
<pbcoreSubject> <subject>African American Civil Rights</subject> <subjectAuthorityUsed>American Archive Civil Rights Subject Categories</subjectAuthorityUsed> </pbcoreSubject>
<pbcoreContributor> <contributor>Hughes, Langston</contributor> <contributorRole>Lecturer</contributorRole> </pbcoreContributor>
Do we have URIs for “African American Civil Rights,” “American Archive Civil Rights Subject Categories,” “Hughes, Langston” and “Lecturer”? They aren’t currently in this PBCore document but perhaps they could be if we did something radical (for PBCore) like adding attributes. We could then tap into existing namespaces or create our own as needed, using and assigning URIs for each and every element in our PBCore records.
But this adds a great deal of additional work and complication. PBCore was invented to provide interoperability among public media systems, and if we all used plain old PBCore it would work for that purpose. If we raise the bar too high and require URIs, fewer people will be able to adopt PBCore in their local workflows and systems.
Yet it would be very useful to add URIs to key elements like subject and genre terms, contributors, creators, and publishers. We could even apply this to the currently almost meaningless pbcoreRightsSummary, for example using Creative Commons license URIs.
Such an approach could be incredibly powerful, but I believe it should be optional. URI attributes should be allowed but not required in PBCore 2.0. We might also create PBCore Profiles which use URIs in specific namespaces; for example in a future American Archive PBCore Profile there could be subject and genre namespaces facilitating a common taxonomy. This could solve one of the thorniest problems of interoperability in different content collections.
But I believe content creators in pubmedia will never universally adopt the same taxonomies, nor should they. After all, we’re probably all dealing with a babel of descriptive metadata and controlled vocabularies generated by various legacy systems and catalog records, and we can’t make them conform to a new common standard. I also like the idea of harvesting user-generated keywords for my content and adding that to the metadata mix. Also, PBCore Profiles might work within the PBCore user community, but what about linking to other content domains? I want my data to interoperate not only inside pubmedia, but also with related data anywhere. A story on the new water plant drawing from the Central Illinois Mahomet Aquifer could link to historical information, climate records, corporate earnings reports, and huge data sets on the geography and hydrology of the region. Each of these domains has tons of public data but it’s all in different formats and none of it is PBCore. Almost none of it is RDF.
This leads to the notion that something else should create the RDF. It would harvest whatever it could from available data and build an explorable triple store. This would enable other applications to traverse, search, and build interesting interfaces to a growing universe of linked data.
Several software projects are working on tools to do this. Project Tupelo can harvest existing URI schemes, or in their absence create them for a given set of data. We’re attempting to test Tupelo with our content collections, and hope to have something to report later this year.
But I bet there are other tools and approaches, and maybe it’s simpler than I’m seeing it. If you know something about this please respond by commenting on this post, or starting a new one.
Use of PBCore in the American Archive Pilot Project
Illinois Public Media was one of the 20-some public TV and Radio stations in the CPB-funded American Archive Pilot Project. The AAPP required participating stations to use PBCore as a metadata format, at least in principle. I decided to push implementation of PBCore in my AAPP content collection as far as possible using the toolset I used on a previous video archive project (Prairiefire on WILL-TV).
This toolset is based on the website Content Management System called ExpressionEngine, which makes setting up a particular database structure rather easy. I set up the database structure based on PBCore elements, with controlled vocabularies reflecting the AAPP taxonomy and suggested PBCore picklists. I then created xml templates in ExpressionEngine to render my AAPP collection metadata as valid PBCore records. I then went one step further, following discussions with Dan Jacobson and David Rice, and created a PBCoreCollection wrapper containing all 235 of the PBCore item records (each as a PBCoreDescriptionDocument) in my collection. The national portal for the AAPP, being developed and hosted at Oregon Public Broadcasting, was able to simply ingest the PBCoreCollection, demonstrating the viability of this approach to aggregating a large collection from multiple content sources.
This article details the methods used to accomplish this in ExpressionEngine. Similar methods could be used in Drupal, which we’re working on now.
In ExpressionEngine, one can easily define a set of fields to input data. For example a blog would need fields for a Title, a Body, and maybe a separate Image upload field along with a label field for the image (so you could add a caption or an alt tag at least). When you create these fields, you also pick a field type: textarea, dropdown list, file upload, etc. EE has several pre-defined field types and there are dozens of addons from third-party developers to add more.
One of the really great EE addons is FieldFrame, developed by Brandon Kelly. FieldFrame is a framework for developing new EE fieldtypes, and there are a bunch of good ones. The most important for our EE PBCore tool is called FF Matrix, which allows you to bundle several fields in a “row” of related data.
Here’s the way you create an FF Matrix field in ExpressionEngine:
With an FF Matrix field, you can do things like enter a PBCore subject tied to a subjectAuthorityUsed, or title along with titleType. Since most of PBCore elements are wrapped in pairs like this, it’s important to solve this in a straightforward way. With FF Matrix, you can enter as many linked pairs as needed, for example with many subject terms you want to have each term wrapped individually along with its corresponding subjectAuthorityUsed.
Here’s the PBCore Item entry form showing a number of such fields (but not the entire form which is a bit long):
We used this form to enter all the Intellectual Content and Intellectual Property metadata for each media item. Nothing in this Item form relates to the physical or digital Instantiation of that item. For that we used a different form with fields and fieldtypes defined specifically for Instantiation metadata. Here’s the fun part: One of the fieldtypes in the Instantiation form is a “relationship” field, which allows you to select an existing Item to which the Instantiation should be linked. So if you have several Instantations, like a wav file, and mp3, and an analog tape of the same Item, you create Instantiations records for each and link them to the Item.
This proved to be a quick and effective way to link multiple Instantiations with a single Item.
You might be able to see that some of the fields are blank, and their instructions say things like “formatDataRate - If MP3 file don’t enter anything.” Lots of the technical metadata like formatFileSize etc could be extracted automatically from the digital files by the system, so we don’t have to enter that data by hand. EE has a nice addon called MP3 Info + that does most of that work.
David Rice has developed better methods of reading file metadata into his PBCore Records Repository using a free tool called MediaInfo. We should get him to write more about that, as it’s work that could be leveraged and used in different systems I’m sure.
After entering all the metadata for our collection using the two forms above, the payoff is in rendering everything in usable form. Since it’s all in the CMS, it’s a simple matter to make a website displaying everything, and providing media players for the files. In fact we did this initially for the catalogers so they could work remotely and listen to and view the audio and video files.
This site was intended for that purpose: http://will.illinois.edu/metadata/aapp-inventory-all/.
As the catalogers added descriptive metadata, the site became much more interesting! We added as much descriptive stuff as possible, even full tape logs for some of the World War II oral history interviews. I chose not to display all that metadata on the web page, but it is rendered in the PBCore XML record for each item.
For example, here is a web page for one such interview: http://will.illinois.edu/metadata/aapp-inventory-all/WWII_oral_history_WesleyMatthews2008-02-21
And here is the PBCore record for the same interview: http://will.illinois.edu/metadata/pbcoreAAPP/wwii_oral_history_wesleymatthews2008-02-21
The way these are rendered is simple: an html template for the web page, and an xml template for the PBCore record, both drawing from the same database. In ExpressionEngine this is very simple to set up, and once it’s set up, you’re done.
Finally, as mentioned above I chose to try implementing the idea of a PBCoreCollection wrapper element, enclosing all 235 of the individual PBCoreDescriptionDocuments in my AAPP media collection. This is, of course, not a valid wrapper element in any PBCore version to date. This experience suggests that it should be. OPB was able to ingest my entire collection in a single gulp from this URL. Other stations in the AAPP were able to export using the same method (PBcoreCollection) even though they have different local systems. The ability to render a PBCoreCollection is all that matters, not the underlying system that rendered it.
I hope this is useful to anyone who might be looking for systems for cataloging media assets and doing various things with them like creating websites and PBCore records or whatever metadata format. I used ExpressionEngine but the basic method would work with Drupal, Plone, and other CMSs and frameworks. Most importantly, regardless of the system used, I hope this demonstration of the power of PBCoreCollection informs the development of PBCore 2.0, which is now in progress.
PBCore Recommendation : PBCoreCollectionOver the last few months, I have worked with Jack Brighton and Dave Rice to have the NPR API (http://www.npr.org/api) output PBCore as a supported format. In the early stages, we were able to put together a mapping of NPRML (our native XML format) to PBCore. From this mapping, my team and I started conceptualizing how this would work within the framework of the API. This exercise ultimately failed because of a philosophical issue between PBCore and the NPR API.
PBCore's implementation focuses on the individual conceptual asset (in NPRML terms, the story). So, if a station wants to receive the 20 stories from today's All Things Considered in PBCore, the station would receive 20 documents, one for each segment (and possibly another document for the program episode record). Meanwhile, the NPR API is a feed-oriented model, which means that a station that wants all 20 ATC segments would make a single request that delivers all 20 items, as well as the information about the program episode. The NPR API model matches many of the more popular feed types in the marketplace, including RSS, ATOM, Podcast, etc.
Because of this key difference, it is a big challenge to fit PBCore into the NPR API model. And because of this difference, I would like to recommend a new implementation to the PBCore schema to allow it to handle its current requests as well as the feed-based requests. Here is a sample of the changes:
<PBCoreCollection xsi:schemaLocation="http://www.pbcore.org/PBCore/PBCoreNamespace.html http://www.pbcore.org/PBCore/PBCoreXSD_Ver_1-2-1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.pbcore.org/PBCore/PBCoreNamespace.html"> <PBCoreCollectionTitle>Title of this collection of stories </PBCoreCollectionTitle> <PBCoreCollectionDescription>This news feed contains stories that meet all of the following criteria: (1) Stories aired on "All Things Considered". (2) Stories from the "Afghanistan" topic.</PBCoreCollectionDescription> <PBCoreCollectionSource>NPR API <PBCoreCollectionLink>Link back to the source for this feed </PBCoreCollectionLink> <PBCoreCollectionPubdate>Thu, 16 Oct 2008 06:00:00 -0400 </PBCoreCollectionPubdate> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> </PBCoreCollection>
The key to the recommendation is to wrap all of the current XML in a parent node called <PBCoreCollection>. The namespace attributes, which were previously attached to the <PBCoreDescriptionDocument> element, have been moved up to the collection element.
The <PBCoreCollection> node may then contain several child nodes which describe the collection itself, such as title, description, source, and date (similar to the
The <PBCoreCollection> node may contain any number of iterations of the <PBCoreDescriptionDocument> as sub-nodes.
After describing the collection, the PBCore document will then provide the list of actual documents. For each <PBCoreDescriptionDocument> element in this overall document, I have made no changes (other than lifting the namespace attributes to the collection element).
The purpose of this approach, again, is to enable multiple PBCore documents to be delivered in one transaction as one document. To ensure backward compatibility for existing implementations, they can continue to process each item one at a time in this model as well by having only one <PBCoreDescriptionDocument>. That said, I believe that PBCore should recommend to implementers that they use of the feed-based approach. The most expensive part of data transfer from one system to another is always going to be the transaction itself, not the parsing of the document. So, for those 20 ATC segments, performing those 20 transactions and parsing them individually is far less efficient than doing one transaction and parsing the larger document.
Jack, Dave and I have also discussed potential conflicts with some current implementations that handle the transferring of multiple documents differently. So, if one system currently zips up multiple documents and another sends them individually, these two systems may not be able to work together without custom development, even though both of them are currently complying with the PBCore standard. Extending the standard to provide a method for distribution of multiple documents would also standardize the development practices around distribution of PBCore documents.
The three of us have had many conversations about <PBCoreCollection> and see great merit in this schematic change for PBCore. Although NPR is the real-life scenario that has surfaced this proposal, we believe that the merit of this approach goes far beyond working within the NPR framework.
Although we have put together this proposed model, we know that there are other great minds that could help us refine the recommendation. We look to forward your feedback and to an engaging discussion!
Some thoughts on future directions of PBCoreFor the last couple weeks, I've been thinking about some ways PBCore could changes to be easier to use and friendlier. Part of this was spurred by the PBCore 2.0 RFP, but also while I was trying to figure out how to teach PBCore to a workshop introducing XML at AMIA '09. This allowed me to take a step back from my use of PBCore to power digital archives and think about some bigger picture issues. I've been jotting my thoughts down elsewhere, but it would be more helpful to the community to record them here as well.
The first two links are likely the most interesting and provoking because they offer some future directions for PBCore. I've tried to keep my ideas in line with current practice, rather than offering radically new idea (that are probably impossible to implement within PBCore). As I prefaced the first post with, my proposals focus primarily on the descriptive aspects of PBCore, which, in my opinion, is its most important aspect.
The third link offers something to think about, because it offers five different XML metadata schemas that could be used to describe video. I think it is interesting to examine other schemas to see how other communities approached (or avoided) similar problems. I wanted to keep my introductions brief, mainly to create some context for the structures that follow. As I suggest in that post, there is probably some fruitful work in doing detailed comparisons of the schema components.
- - 15 ways to improve PBCore - General thoughts about possible directions for PBCore.
- - Teaching PBCore, Questions and Notes - Reflections after teaching the PBCore workshop.
- - A comparison of metadata standards for media - example instances galore.
It's probably in the interest of this community to leave general comments and feedback on pbcoreresources.org (or, even make a post of your own..), but I naturally welcome comments on the posts themselves.
To be clear, the views expressed are my own, and do not reflect the views of my employer
pbcoreContributor element issue
Let's say I'm trying to record contributors to a movie and have a list of actors. In PBCore I can list each name, and record contributorRole as "actor." But there's no way to record what role they played as actor. Seems like a big gap! This could easily be solved in one of two ways:
Add an optional third element inside the pbcoreContributor container to record this. What to call it? Part maybe?
Add an optional attribute to contributorRole, so you can say something like:
<contributorRole part="Erin Brockovich">Actor</contributorRole>
Only problem with this is, PBCore doesn't use attributes. So maybe the first way is better:
Fun new category added to PBCoreresources.org: Change Requests
If things appear quite in pbcoreland, that would be only a surface-level view. A lively discussion is taking place, for example, on the American Archive Pilot Project Basecamp site. People are asking questions about why PBCore elements are as they are, how PBCore matches the requirements of their collections and projects, and poking it with sticks in various ways. The AMIA Conference arrives in St. Louis next week, following a year in which many AMIA folks have submerged PBCore in white fire and subzero liquids. Other open source software projects are leveraging PBCore to exchange data and media files between far-flung systems. Lots of us have been throwing tons of actual content against the PBCore wall, and seeing which pieces stick or bounce back.
PBCore 2.0 will be coming, and it needs our help.
So I'm putting out the call (with the encouragement of Paul Burrows) for something we'll categorize on this site as Change Requests. Many suggestions have already been made on vairous listservs and other online spaces, and we'll be looking through those to compile them here. If you have other suggestions for changes to PBCore, you can add them here yourself. (Anyone who is a Member of this site can post to it, and you can sign up on the home page.) Please make sure to assign your entry to the category Change Request. We can then more easily filter these into a pile, and sort them into subcategories as needed.
I'll be an active curator as needed, so just let me know what "as needed" means to you, and let 'er rip.