Use of PBCore in the American Archive Pilot Project

Written by Jack Brighton on Wednesday, February 17, 2010

Illinois Public Media was one of the 20-some public TV and Radio stations in the CPB-funded American Archive Pilot Project. The AAPP required participating stations to use PBCore as a metadata format, at least in principle. I decided to push implementation of PBCore in my AAPP content collection as far as possible using the toolset I used on a previous video archive project (Prairiefire on WILL-TV).

This toolset is based on the website Content Management System called ExpressionEngine, which makes setting up a particular database structure rather easy. I set up the database structure based on PBCore elements, with controlled vocabularies reflecting the AAPP taxonomy and suggested PBCore picklists. I then created xml templates in ExpressionEngine to render my AAPP collection metadata as valid PBCore records. I then went one step further, following discussions with Dan Jacobson and David Rice, and created a PBCoreCollection wrapper containing all 235 of the PBCore item records (each as a PBCoreDescriptionDocument) in my collection. The national portal for the AAPP, being developed and hosted at Oregon Public Broadcasting, was able to simply ingest the PBCoreCollection, demonstrating the viability of this approach to aggregating a large collection from multiple content sources.

This article details the methods used to accomplish this in ExpressionEngine. Similar methods could be used in Drupal, which we’re working on now.

In ExpressionEngine, one can easily define a set of fields to input data. For example a blog would need fields for a Title, a Body, and maybe a separate Image upload field along with a label field for the image (so you could add a caption or an alt tag at least). When you create these fields, you also pick a field type: textarea, dropdown list, file upload, etc. EE has several pre-defined field types and there are dozens of addons from third-party developers to add more.

One of the really great EE addons is FieldFrame, developed by Brandon Kelly. FieldFrame is a framework for developing new EE fieldtypes, and there are a bunch of good ones. The most important for our EE PBCore tool is called FF Matrix, which allows you to bundle several fields in a “row” of related data.

Here’s the way you create an FF Matrix field in ExpressionEngine:

FF Matrix screenshot

With an FF Matrix field, you can do things like enter a PBCore subject tied to a subjectAuthorityUsed, or title along with titleType. Since most of PBCore elements are wrapped in pairs like this, it’s important to solve this in a straightforward way. With FF Matrix, you can enter as many linked pairs as needed, for example with many subject terms you want to have each term wrapped individually along with its corresponding subjectAuthorityUsed.

Here’s the PBCore Item entry form showing a number of such fields (but not the entire form which is a bit long):

PBCore Item entry form

We used this form to enter all the Intellectual Content and Intellectual Property metadata for each media item. Nothing in this Item form relates to the physical or digital Instantiation of that item. For that we used a different form with fields and fieldtypes defined specifically for Instantiation metadata. Here’s the fun part: One of the fieldtypes in the Instantiation form is a “relationship” field, which allows you to select an existing Item to which the Instantiation should be linked. So if you have several Instantations, like a wav file, and mp3, and an analog tape of the same Item, you create Instantiations records for each and link them to the Item.

PBCore Instantiation entry form

This proved to be a quick and effective way to link multiple Instantiations with a single Item.

You might be able to see that some of the fields are blank, and their instructions say things like “formatDataRate - If MP3 file don’t enter anything.” Lots of the technical metadata like formatFileSize etc could be extracted automatically from the digital files by the system, so we don’t have to enter that data by hand. EE has a nice addon called MP3 Info + that does most of that work.

David Rice has developed better methods of reading file metadata into his PBCore Records Repository using a free tool called MediaInfo. We should get him to write more about that, as it’s work that could be leveraged and used in different systems I’m sure.

After entering all the metadata for our collection using the two forms above, the payoff is in rendering everything in usable form. Since it’s all in the CMS, it’s a simple matter to make a website displaying everything, and providing media players for the files. In fact we did this initially for the catalogers so they could work remotely and listen to and view the audio and video files.

This site was intended for that purpose: http://will.illinois.edu/metadata/aapp-inventory-all/.

As the catalogers added descriptive metadata, the site became much more interesting! We added as much descriptive stuff as possible, even full tape logs for some of the World War II oral history interviews. I chose not to display all that metadata on the web page, but it is rendered in the PBCore XML record for each item.

For example, here is a web page for one such interview: http://will.illinois.edu/metadata/aapp-inventory-all/WWII_oral_history_WesleyMatthews2008-02-21

And here is the PBCore record for the same interview: http://will.illinois.edu/metadata/pbcoreAAPP/wwii_oral_history_wesleymatthews2008-02-21

The way these are rendered is simple: an html template for the web page, and an xml template for the PBCore record, both drawing from the same database. In ExpressionEngine this is very simple to set up, and once it’s set up, you’re done.

Finally, as mentioned above I chose to try implementing the idea of a PBCoreCollection wrapper element, enclosing all 235 of the individual PBCoreDescriptionDocuments in my AAPP media collection. This is, of course, not a valid wrapper element in any PBCore version to date. This experience suggests that it should be. OPB was able to ingest my entire collection in a single gulp from this URL.  Other stations in the AAPP were able to export using the same method (PBcoreCollection) even though they have different local systems. The ability to render a PBCoreCollection is all that matters, not the underlying system that rendered it.

I hope this is useful to anyone who might be looking for systems for cataloging media assets and doing various things with them like creating websites and PBCore records or whatever metadata format. I used ExpressionEngine but the basic method would work with Drupal, Plone, and other CMSs and frameworks. Most importantly, regardless of the system used, I hope this demonstration of the power of PBCoreCollection informs the development of PBCore 2.0, which is now in progress.

Best regards,
Jack


Comments:

  • mlc said on 02/17 at 11:17 PM

    On the subject of automatic instantiation creation from media files, this is something that Dave Rice has been working on for a while and which we recently integrated into the PBCore metadata repository we’ve been building.

    The import code is relatively brief as it primarily relies on an XSL transform that Dave wrote and which should be reasonably easy to integrate into other PBCore-aware applications.

    You can see an example of the results on this asset. (Only the video and thumbnail instantiations were generated automatically; the Internet Archive link was generated manually. And actually I think that video may have been uploaded with a test version of the code, so the mapping is a bit better now.)

  • John Tynan said on 02/17 at 11:37 PM

    Impressive work! 

    Tell me, are you caching the PBCoreCollection wrapper feed containing the 235 PBCoreDescriptionDocuments?

    How long does it take to catalogue a single record?  Were these records all catalogued by a technical user (you)?  Do you anticipate that you could get library science majors to intern on this project and help finish cataloguing WILL’s collection?  If so, would you make any changes in the user interface to help accomplish this task?

    Is it possible to link all of the subjects for a particular record to the search results for the index for each of these subjects?  Like this for a search on the ACLU:

    http://will.illinois.edu/metadata/searchresults/553e282f269387314c5f762c189b1213/

  • Jack Brighton said on 02/18 at 12:18 AM

    Responding first to mic: I think you and Dave are way ahead on automatic instantiation stuff. My next project involves leveraging Drupal and contributed modules like Media Mover to upload a source media file, transcode it to whatever formats, catalog and save everything to the maximum extent possible, and generate PBCore or whatever shareable metadata is needed. I hope to lean heavily on what you’ve already done.

    To John, yes the wrapper feed is cached, which solves a weakness in ExpresssionEngine or maybe PHP or just our server capacity. I can explain in more detail, but basically beyond a certain number of assets we had to not run out of PHP memory. Cataloging a single record depended upon the quality of the cataloger, what was known about the media object under observation, and the resources at hand. We had very good catalogers, knew very little about the Civil Rights content, and had a much appreciated grant from CPB for the project. I had them listen to and view the content in real time, all the way through, so count the hours of that plus something like .5 real time for the full entry. I consider this very efficient work on their part.

    We should involved library science students wherever possible, and I’ve had a stream of them working at Illinois Public Media on various projects. I once even had funding for an actual Graduate Assistant!

    The user interface…it took almost zero time for the catalogers to get up to speed. It could always be prettier, but the point is to get things done in this case.

    Finally regarding search, I didn’t create this for the purpose of a public website with usability in mind, but as a way to feed the overall AAPP national portal. I think the national portal will have much better search capabilities than I have provided in the local project website. But suggestions about improvements to the local site would be most welcome!

  • Kevin Reynen said on 02/18 at 12:50 PM

    Excited to see what you do with Drupal!  Have you looked at how the new Creative Commons modules (http://drupal.org/project/creativecommons) or MERCI (http://drupal.org/project/merci) work?

    The set of fields approach you took in ExpressionEngine seems similar to Drupal’s Content Construction Kit (CCK). We leverage CCK to extend Drupal’s core node functionality in the Open Media Project, but there are limits to what can be done with that because it leverages Drupal’s core form processing and validation logic. 

    I think the approach we used with Creative Commons and MERCI would work much better.  The concept extending a content type is the same, but the metadata fields you add can be defined per content type.  It CC gives users the option of which licenses the site and/or content type allows as well as which fields are available or required. 

    I’d be really interested in contributing to this development.  It would dovetail nicely with the Open Media Project’s implementation for the Bay Area Video Coalition and the work we’re with Media Mover to push videos to Archive.org.

  • Daniel Jacobson said on 02/21 at 09:29 PM

    This is great work!  It really does demonstrate the need for collection elements in PBCore.  The URL that you provide to your collection of PBCore documents also shows how PBCore can be more than a manifest standard, allowing such documents to be distributed in a feed-based manner.

Write a comment:

Commenting is not available in this section entry.

Options:

Size

Colors