Article category: Change management
pbcoreAssetDate needs refinement…and this site needs revitalization
So…lots going on in PBCoreland that hasn’t been reflected on pbcoreresources.org. I’ll get to that in another post. For the moment I want to mark something that is currently bugging me about PBCore 2.0: pbcoreAssetDate needs something to say what date formatting is being used. This is true for all other dates in a PBCore record.
I’m in the middle of building a PBCore export feature for WILL’s main website. This will allow exchange of pretty complete metadata with systems that can ingest PBCore, like the American Archive project (if it ever gets truly rolling) and the Popup Archive (which is rolling nicely). As I dive into the specifics, I want to return to and highlight those things about the PBCore 2.0 schema that remain…unfinished.
My concern is machine readability of dates and times. The PBCore 2.0 schema suggests, but does not require, ISO 8601 or the Library of Congress Extended Date/Time Format (EDTF) (and BTW the link on pbcore.org to EDTF is broken). Two big problems here:
- The 2.0 schema doesn’t provide any way to specify date formatting at all
- Even if it did, there’s a huge range of possible date formats within either IOS 8601 or EDTF
What’s a good solution? I don’t build parsers for a living, so I’m not sure, and thus this post. I’m tempted to say we should add a source attribute to PBCore dates, and specify the source of the date format we’re using. But is this specific enough? For example:
<pbcoreAssetDate dateType=“published” source=“ISO 8601”>2006-10-16T08:19:39-05:00</pbcoreAssetDate>
p.s. For a variety of reasons, I’m back on the job here as your editor/curator/muckraker of this site. It needs a rebuild, but first things first!
PBCore.org site refresh a welcome sight
If you haven’t been to http://pbcore.org lately you’re in for a good surprise. The site has been completely rebuilt, and contains up-to-date documentation, news, case studies, and most importantly, the complete PBCore 2.0 schema. There, I buried the lead: PBCore 2.0 has been officially released!
I might quibble with the color scheme of the new site, and I see some obvious CSS tweaks that could improve readability. But hey, my own websites need lots of work so who am I to talk? I have to give the folks at WGBH, who rebuilt the new PBCore site from the bones and ashes of its 2005 incarnation, lots of credit for getting things basically right.
Things I really like: In addition to the 2.0 Schema, there’s a How To section (contained in the Documentation menu item in the sidebar navigation) which really good guidance, and a Training section with clear instructions and code examples on things like “How to express collections in PBCore,” “How to sequence records within relationships,” and “How to express time segments within a video.” None of these was even possible prior to the 2.0 schema, and now we have clear documentation on how to do them.
Another area of the PBCore site that stands out is the Elements section. This section provides concise details on each of the PBCore 2.0 elements, usage rules (i.e. minOccurs) where it appears in the schema, and its available attributes. I find the Elements section highly usable, but I find myself Command-clicking element names to open them in a new browser tab (I’m on a Mac…) so I don’t lose the Elements index page in the original tab. It might be more usable to navigate the Elements more like the old PBCore User Guide, where clicking on an Element doesn’t take you away from the navigation. But that’s a minor quibble, I’m a geek, and am never completely happy.
One thing I am happy about is the inclusion of a “related discussions” link on each Element page, which includes a link to the home page of pbcoreresources.org. This leads to an idea about how pbcoreresources could directly add value to pbcore.org. As we discuss various PBCore elements here, our posts get aggregated in categories like pbcoreTitle. So for example on the pbcoreCollection Element page on pbcore.org, the “related discussions” link could go directly to the pbcoreCollection category page on pbcoreresources.org. This assumes enough of us are contributing to pbcoreresources with questions, answers, examples, and other useful conversation about PBCore elements. We have done that to some extent, and I’m suggesting we do it more. PBCore.org can then mine those discussions to enhance the official documentation over time.
Or maybe pbcore.org will continue to grow and supplant some of the stuff we have been doing here, and I’d be OK with that. The new pbcore.org site is built on WordPress, and it does allow comments in many places including the How To pages and the Element pages. Wherever it happens, I expect the user community will continue to build a shared understanding of how to move forward with PBCore. And above all, to keep the keepers of the PBCore standard in tune with the needs and realities of real-world media producers, publishers, and archivists who use it every day.
Sneak preview of PBCore 2.0
If I’ve learned one thing about the PBCore user community, it’s that we’re not satisfied with the current state of PBCore. We’ve used it enough to discover its strengths in describing AV assets and creating shareable metadata, but we keep running into its gaps and flaws. We’ve been pushing for a change process, and have argued for specific changes. Common threads have emerged right here on this site:
- A need for PBCore to support multi-part instantiations, e.g. when you have one complete work comprised of several reels or tapes or files.
- A need to express rights information related to a specific Instantiation, instead of only the entire asset. For example, you might want to allow users to download an mpeg4 version of a film for personal use, but not grant the same kind of access to the actual film!
- Speaking of rights, formatting of the pbcoreRightsSummary element disallows inclusion of metadata from existing standards such as ORDL or Creative Commons, which seems odd to say the least. If you already have structured rights data, why not simply reuse it?
- A need to show relationships between Instantiations, like when you digitize a film to 10-bit uncompressed digital video, then encode an mpeg4 file from the 10-bit uncompressed file, it seems important to show that in the PBCore record.
- With pbcoreContributor, you can say that Harrison Ford is an Actor, but you can’t say what role he plays in the film.
- There’s no way to uniquely identify a person, subject term, location, or other value that might have an actual URI.
- The lack of attributes of any kind! Everything is elements and sub-elements, which seems inefficient and makes parsing more difficult.
- The lack of a valid way to identify clip information within an asset, for example where in the timeline a particular subject is discussed or a specific person appears.
- The lack of any way to bundle multiple PBCore XML records together in a feed or collection, so you could export/import large groups of records between systems or use PBCore in RESTful web applications.
Well good news folks! PBCore 2.0 is on the way, and it solves all these issues.
Even better, it solves them in a way that doesn’t add complexity for those who want to keep PBCore simple. For example, PBCore 2.0 allows you to use attributes for a subject term to specify a URI, and a startTime and endTime for that term in the media asset timeline. So you could have something like this:
<pbcoreSubject ref=”http://en.wikipedia.org/wiki/Hobbit” startTime=”00:23:14” endTime=”00:24:22”>Hobbits</pbcoreSubject>
You can also do this:
<contributor affiliation=”NPR” ref=”http://en.wikipedia.org/wiki/Michele_Norris”>Michele Norris</contributor>
<contributor ref=”http://en.wikipedia.org/wiki/Sean_Connery”>Sean Connery</contributor>
<contributorRole portrayal=”James Bond”>Actor</contributorRole>
But the use of the new 2.0 attributes is totally optional. You can keep it simple and use PBCore the same way as before.
Once you get used to the idea of adding attributes, however, you may find it opens up all kinds of new possibilities for your PBCore metadata. For example, the use of URIs to identify values like subject terms, people, and locations is the first step to enabling content to live and breath in the emerging semantic web/linked data universe. The addition of the optional ref=“URI” attribute in the 2.0 schema puts PBCore squarely onto that path.
But I suspect many of the other improvements to PBCore 2.0 will make life easier for all concerned. From what I see, the changes solve the issues people have raised on this site. The folks at WGBH who managed the 2.0 project did additional extensive research and outreach to find out what people using PBCore need, and how best to evolve the schema. And I give a lot of credit to CPB for supporting an open and transparent process. We all contributed to the 2.0 version of PBCore, and our input was taken seriously. I understand the schema will be publicly released soon, and you’ll see.
PBCore has thus far only sort of worked as a metadata standard for AV assets and collections, but gaps in its earlier versions drove many of us to implement workarounds and hacks. The result was lack of clarity at best, which is not a good thing for a technical standard. The 2.0 PBCore schema probably isn’t perfect, and we’ll all find out more as we learn about it and begin our own implementations. But in my view it takes PBCore to a much higher state of functionality and flexibility, while retaining its simplicity and its humble origins as a child of Dublin Core.
Deadline for PBCore 2.0 change requests looms
July 25th if the official deadline for input on the next major revision of PBCore. A release of PBCore 2.0 in November, 2010 is expected to represent a major leap forward, based on lots of change requests and research among the user community, plus where projects like the American Archive need to go. This is our chance to make sure it’s going in the most useful direction, and solving the right problems.
You can see the full list of submitted change requests, and add your own, here:
By July 25th please!
PBCore.org Alpha site launched!
Head on over to PBCore.org, check out the face lift, and leave your feedback!
WGBH is in the process of re-designing PBCore.org and we’ve just launched the “alpha” site which includes:
The interior pages remain the same for now, as the full re-launch is scheduled for November 2010 to coincide with the release of PBCore 2.0.
To that end, please contribute change requests for the schema, elements, etc., by June 30, 2010.
This summer, we’ll be conducting card sort and user testing exercises to re-organize the site’s new and legacy content and to make it as clear, concise and useful as possible for PBCore users. If you have suggestions or would like to participate please contact us or participate in the blog.
Many thanks to all who provided, and who continue to provide input into the new site, including Jack Brighton, Paul Burrows, Nan Rubin, the Code4Lib community, and the WGBH PBCore team!
CPB today announces the launch of the PBCore 2.0 Development Project
(Washington, DC) - - The Corporation for Public Broadcasting today announced the launch of the PBCore 2.0 Development Project.
The PBCore 2.0 Development Project will expand the existing PBCore metadata standard to increase the ability, on one hand, of content producers and distributors using digital media to classify and describe public media content (audio and video) and, on the other, of audiences to find public media content on a variety of digital media and mobile platforms.
The PBCore 2.0 Development Project will also work to enhance the PBCore standard to ensure that it will be able to satisfy the demands of multiplatform digital content as well as an evolving World Wide Web. Since PBCore's development in 2005, it has become not only one of the most widely-used metadata standards in the world, but also the basis of other metadata standards. At the same time, in the last five years, the number of digital media applications that would benefit from PBCore has grown significantly. An updated PBCore will benefit not only public broadcasters, but all users of metadata standards based on PBCore.
Use of PBCore in the American Archive Pilot Project
Illinois Public Media was one of the 20-some public TV and Radio stations in the CPB-funded American Archive Pilot Project. The AAPP required participating stations to use PBCore as a metadata format, at least in principle. I decided to push implementation of PBCore in my AAPP content collection as far as possible using the toolset I used on a previous video archive project (Prairiefire on WILL-TV).
This toolset is based on the website Content Management System called ExpressionEngine, which makes setting up a particular database structure rather easy. I set up the database structure based on PBCore elements, with controlled vocabularies reflecting the AAPP taxonomy and suggested PBCore picklists. I then created xml templates in ExpressionEngine to render my AAPP collection metadata as valid PBCore records. I then went one step further, following discussions with Dan Jacobson and David Rice, and created a PBCoreCollection wrapper containing all 235 of the PBCore item records (each as a PBCoreDescriptionDocument) in my collection. The national portal for the AAPP, being developed and hosted at Oregon Public Broadcasting, was able to simply ingest the PBCoreCollection, demonstrating the viability of this approach to aggregating a large collection from multiple content sources.
This article details the methods used to accomplish this in ExpressionEngine. Similar methods could be used in Drupal, which we’re working on now.
In ExpressionEngine, one can easily define a set of fields to input data. For example a blog would need fields for a Title, a Body, and maybe a separate Image upload field along with a label field for the image (so you could add a caption or an alt tag at least). When you create these fields, you also pick a field type: textarea, dropdown list, file upload, etc. EE has several pre-defined field types and there are dozens of addons from third-party developers to add more.
One of the really great EE addons is FieldFrame, developed by Brandon Kelly. FieldFrame is a framework for developing new EE fieldtypes, and there are a bunch of good ones. The most important for our EE PBCore tool is called FF Matrix, which allows you to bundle several fields in a “row” of related data.
Here’s the way you create an FF Matrix field in ExpressionEngine:
With an FF Matrix field, you can do things like enter a PBCore subject tied to a subjectAuthorityUsed, or title along with titleType. Since most of PBCore elements are wrapped in pairs like this, it’s important to solve this in a straightforward way. With FF Matrix, you can enter as many linked pairs as needed, for example with many subject terms you want to have each term wrapped individually along with its corresponding subjectAuthorityUsed.
Here’s the PBCore Item entry form showing a number of such fields (but not the entire form which is a bit long):
We used this form to enter all the Intellectual Content and Intellectual Property metadata for each media item. Nothing in this Item form relates to the physical or digital Instantiation of that item. For that we used a different form with fields and fieldtypes defined specifically for Instantiation metadata. Here’s the fun part: One of the fieldtypes in the Instantiation form is a “relationship” field, which allows you to select an existing Item to which the Instantiation should be linked. So if you have several Instantations, like a wav file, and mp3, and an analog tape of the same Item, you create Instantiations records for each and link them to the Item.
This proved to be a quick and effective way to link multiple Instantiations with a single Item.
You might be able to see that some of the fields are blank, and their instructions say things like “formatDataRate - If MP3 file don’t enter anything.” Lots of the technical metadata like formatFileSize etc could be extracted automatically from the digital files by the system, so we don’t have to enter that data by hand. EE has a nice addon called MP3 Info + that does most of that work.
David Rice has developed better methods of reading file metadata into his PBCore Records Repository using a free tool called MediaInfo. We should get him to write more about that, as it’s work that could be leveraged and used in different systems I’m sure.
After entering all the metadata for our collection using the two forms above, the payoff is in rendering everything in usable form. Since it’s all in the CMS, it’s a simple matter to make a website displaying everything, and providing media players for the files. In fact we did this initially for the catalogers so they could work remotely and listen to and view the audio and video files.
This site was intended for that purpose: http://will.illinois.edu/metadata/aapp-inventory-all/.
As the catalogers added descriptive metadata, the site became much more interesting! We added as much descriptive stuff as possible, even full tape logs for some of the World War II oral history interviews. I chose not to display all that metadata on the web page, but it is rendered in the PBCore XML record for each item.
For example, here is a web page for one such interview: http://will.illinois.edu/metadata/aapp-inventory-all/WWII_oral_history_WesleyMatthews2008-02-21
And here is the PBCore record for the same interview: http://will.illinois.edu/metadata/pbcoreAAPP/wwii_oral_history_wesleymatthews2008-02-21
The way these are rendered is simple: an html template for the web page, and an xml template for the PBCore record, both drawing from the same database. In ExpressionEngine this is very simple to set up, and once it’s set up, you’re done.
Finally, as mentioned above I chose to try implementing the idea of a PBCoreCollection wrapper element, enclosing all 235 of the individual PBCoreDescriptionDocuments in my AAPP media collection. This is, of course, not a valid wrapper element in any PBCore version to date. This experience suggests that it should be. OPB was able to ingest my entire collection in a single gulp from this URL. Other stations in the AAPP were able to export using the same method (PBcoreCollection) even though they have different local systems. The ability to render a PBCoreCollection is all that matters, not the underlying system that rendered it.
I hope this is useful to anyone who might be looking for systems for cataloging media assets and doing various things with them like creating websites and PBCore records or whatever metadata format. I used ExpressionEngine but the basic method would work with Drupal, Plone, and other CMSs and frameworks. Most importantly, regardless of the system used, I hope this demonstration of the power of PBCoreCollection informs the development of PBCore 2.0, which is now in progress.
PBCore Recommendation : PBCoreCollectionOver the last few months, I have worked with Jack Brighton and Dave Rice to have the NPR API (http://www.npr.org/api) output PBCore as a supported format. In the early stages, we were able to put together a mapping of NPRML (our native XML format) to PBCore. From this mapping, my team and I started conceptualizing how this would work within the framework of the API. This exercise ultimately failed because of a philosophical issue between PBCore and the NPR API.
PBCore's implementation focuses on the individual conceptual asset (in NPRML terms, the story). So, if a station wants to receive the 20 stories from today's All Things Considered in PBCore, the station would receive 20 documents, one for each segment (and possibly another document for the program episode record). Meanwhile, the NPR API is a feed-oriented model, which means that a station that wants all 20 ATC segments would make a single request that delivers all 20 items, as well as the information about the program episode. The NPR API model matches many of the more popular feed types in the marketplace, including RSS, ATOM, Podcast, etc.
Because of this key difference, it is a big challenge to fit PBCore into the NPR API model. And because of this difference, I would like to recommend a new implementation to the PBCore schema to allow it to handle its current requests as well as the feed-based requests. Here is a sample of the changes:
<PBCoreCollection xsi:schemaLocation="http://www.pbcore.org/PBCore/PBCoreNamespace.html http://www.pbcore.org/PBCore/PBCoreXSD_Ver_1-2-1.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.pbcore.org/PBCore/PBCoreNamespace.html"> <PBCoreCollectionTitle>Title of this collection of stories </PBCoreCollectionTitle> <PBCoreCollectionDescription>This news feed contains stories that meet all of the following criteria: (1) Stories aired on "All Things Considered". (2) Stories from the "Afghanistan" topic.</PBCoreCollectionDescription> <PBCoreCollectionSource>NPR API <PBCoreCollectionLink>Link back to the source for this feed </PBCoreCollectionLink> <PBCoreCollectionPubdate>Thu, 16 Oct 2008 06:00:00 -0400 </PBCoreCollectionPubdate> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> <PBCoreDescriptionDocument> <!-- OTHER ELEMENTS GO HERE --> </PBCoreDescriptionDocument> </PBCoreCollection>
The key to the recommendation is to wrap all of the current XML in a parent node called <PBCoreCollection>. The namespace attributes, which were previously attached to the <PBCoreDescriptionDocument> element, have been moved up to the collection element.
The <PBCoreCollection> node may then contain several child nodes which describe the collection itself, such as title, description, source, and date (similar to the
The <PBCoreCollection> node may contain any number of iterations of the <PBCoreDescriptionDocument> as sub-nodes.
After describing the collection, the PBCore document will then provide the list of actual documents. For each <PBCoreDescriptionDocument> element in this overall document, I have made no changes (other than lifting the namespace attributes to the collection element).
The purpose of this approach, again, is to enable multiple PBCore documents to be delivered in one transaction as one document. To ensure backward compatibility for existing implementations, they can continue to process each item one at a time in this model as well by having only one <PBCoreDescriptionDocument>. That said, I believe that PBCore should recommend to implementers that they use of the feed-based approach. The most expensive part of data transfer from one system to another is always going to be the transaction itself, not the parsing of the document. So, for those 20 ATC segments, performing those 20 transactions and parsing them individually is far less efficient than doing one transaction and parsing the larger document.
Jack, Dave and I have also discussed potential conflicts with some current implementations that handle the transferring of multiple documents differently. So, if one system currently zips up multiple documents and another sends them individually, these two systems may not be able to work together without custom development, even though both of them are currently complying with the PBCore standard. Extending the standard to provide a method for distribution of multiple documents would also standardize the development practices around distribution of PBCore documents.
The three of us have had many conversations about <PBCoreCollection> and see great merit in this schematic change for PBCore. Although NPR is the real-life scenario that has surfaced this proposal, we believe that the merit of this approach goes far beyond working within the NPR framework.
Although we have put together this proposed model, we know that there are other great minds that could help us refine the recommendation. We look to forward your feedback and to an engaging discussion!