Article category: Projects
PBCore and the Semantic Web
Recently I had a chance to discuss with Dave Rice, Dan Jacobson, and Chris Beer the potential role of PBCore in enabling content to flow in the “semantic web.” The topic has also come up in discussions around the development of PBCore 2.0. It seems like a good idea to open the topic for broader consideration.
The Semantic Web would mostly likely be implemented by Linked Data. Using URIs and RDF, “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” Those other methods would include linking things together by hand, piece by piece, which webmasters of the world can tell you doesn’t exactly scale well.
So how does PBCore fit into linked data using URIs and RDF? I don’t have anything close to a complete answer, just ideas that came from the discussion. I would love some feedback to take this thing further.
Linked Data depends upon having Uniform Resource Identifiers for content items, their descriptive/administrative metadata, and their relationships to other items and descriptors. Every resource should have a URI, whether the resource is a file or a subject term. Relationships between resources also get URIs, such as “friend” or “author”. With URIs for everything, you can establish so-called Triples (subject -> predicate -> object) as machine-processable links, which then establish other machine-processable links to other resources, and the Linked Data universe scales up from there based on network effects. These Triples and their relationships with other resources are made possible by RDF statements. How do we create RDF statements using PBCore?
As the most basic level, a PBCore record may already contain many URIs, for example in the case of a streaming media file a valid URI might be http://will.illinois.edu/media/mp3/illinoisbroadcastarchives_040T_96k.mp3. This file is described by elements like:
<pbcoreSubject> <subject>African American Civil Rights</subject> <subjectAuthorityUsed>American Archive Civil Rights Subject Categories</subjectAuthorityUsed> </pbcoreSubject>
<pbcoreContributor> <contributor>Hughes, Langston</contributor> <contributorRole>Lecturer</contributorRole> </pbcoreContributor>
Do we have URIs for “African American Civil Rights,” “American Archive Civil Rights Subject Categories,” “Hughes, Langston” and “Lecturer”? They aren’t currently in this PBCore document but perhaps they could be if we did something radical (for PBCore) like adding attributes. We could then tap into existing namespaces or create our own as needed, using and assigning URIs for each and every element in our PBCore records.
But this adds a great deal of additional work and complication. PBCore was invented to provide interoperability among public media systems, and if we all used plain old PBCore it would work for that purpose. If we raise the bar too high and require URIs, fewer people will be able to adopt PBCore in their local workflows and systems.
Yet it would be very useful to add URIs to key elements like subject and genre terms, contributors, creators, and publishers. We could even apply this to the currently almost meaningless pbcoreRightsSummary, for example using Creative Commons license URIs.
Such an approach could be incredibly powerful, but I believe it should be optional. URI attributes should be allowed but not required in PBCore 2.0. We might also create PBCore Profiles which use URIs in specific namespaces; for example in a future American Archive PBCore Profile there could be subject and genre namespaces facilitating a common taxonomy. This could solve one of the thorniest problems of interoperability in different content collections.
But I believe content creators in pubmedia will never universally adopt the same taxonomies, nor should they. After all, we’re probably all dealing with a babel of descriptive metadata and controlled vocabularies generated by various legacy systems and catalog records, and we can’t make them conform to a new common standard. I also like the idea of harvesting user-generated keywords for my content and adding that to the metadata mix. Also, PBCore Profiles might work within the PBCore user community, but what about linking to other content domains? I want my data to interoperate not only inside pubmedia, but also with related data anywhere. A story on the new water plant drawing from the Central Illinois Mahomet Aquifer could link to historical information, climate records, corporate earnings reports, and huge data sets on the geography and hydrology of the region. Each of these domains has tons of public data but it’s all in different formats and none of it is PBCore. Almost none of it is RDF.
This leads to the notion that something else should create the RDF. It would harvest whatever it could from available data and build an explorable triple store. This would enable other applications to traverse, search, and build interesting interfaces to a growing universe of linked data.
Several software projects are working on tools to do this. Project Tupelo can harvest existing URI schemes, or in their absence create them for a given set of data. We’re attempting to test Tupelo with our content collections, and hope to have something to report later this year.
But I bet there are other tools and approaches, and maybe it’s simpler than I’m seeing it. If you know something about this please respond by commenting on this post, or starting a new one.
PBCore in Drupal Supports Customizing Elements End Users See
Don’t look now, but the PBCore module I added to Drupal.org is now #3 in results for pbcore on Google bumping pbcoreresources.org to #4. That module wasn’t intended to be the definitive PBCore module for Drupal, but a conversation started to help locate users and developers with similar needs. This approach of releasing modules that aren’t ready for users and developing in public has worked really well for the MERCI (Manage Equipment Reservations, Checkout and Inventory) and Creative Commons modules. I’m hoping to find additional Drupal users and developers interested in PBCore so I’m not writing code that only serves our needs… which is basically all the module does so far.
The new PBCore module does 4 things:
1. Populates tables with full list of items genre, rating and language elements that adhere to the PBCore standard. This creates a shared vocabulary required to facilitate sharing content.
2. Allow a each site to enable/disable items within an element to meet there local needs
3. Allow a each site to customize the description of items within an element to meet there local needs
4. Provide functions that can be added to CCK fields to populate the select list with the enabled items. Users see the customized descriptions, but the value stored in the database is the standard PBCore value.
After talking to Andrew Feigenson and Ed Leonard about PBcore 2.0, I’m thinking of adding a 5th feature…
5. An option to report the status and customizations of PBCore items to a centralize location.
I’ve posted a detailed explanation about why we’re customizing PBCore terms, but it can be summed up by the Gay/lesbian genre item. We customize that to Gender/Sexuality everywhere a user would see it, but store the actual PBCore term in the database. This works well to appease our producers creating content in that genre who find limiting the description of their work to a Gay/lesbian insulting. It can’t imagine how would argue that Gay/lesbian shouldn’t be deprecated and Gender/Sexuality added, but in cases where the change is less obvious having data from sites using the PBCore module would be helpful. If hundreds of sites are customizing Gay/lesbian and no one is enabling French, that would be a logical place to start revising genre.
Any thoughts on the approach of allowing customizations? Sharing status and customizations of items?
Use of PBCore in the American Archive Pilot Project
Illinois Public Media was one of the 20-some public TV and Radio stations in the CPB-funded American Archive Pilot Project. The AAPP required participating stations to use PBCore as a metadata format, at least in principle. I decided to push implementation of PBCore in my AAPP content collection as far as possible using the toolset I used on a previous video archive project (Prairiefire on WILL-TV).
This toolset is based on the website Content Management System called ExpressionEngine, which makes setting up a particular database structure rather easy. I set up the database structure based on PBCore elements, with controlled vocabularies reflecting the AAPP taxonomy and suggested PBCore picklists. I then created xml templates in ExpressionEngine to render my AAPP collection metadata as valid PBCore records. I then went one step further, following discussions with Dan Jacobson and David Rice, and created a PBCoreCollection wrapper containing all 235 of the PBCore item records (each as a PBCoreDescriptionDocument) in my collection. The national portal for the AAPP, being developed and hosted at Oregon Public Broadcasting, was able to simply ingest the PBCoreCollection, demonstrating the viability of this approach to aggregating a large collection from multiple content sources.
This article details the methods used to accomplish this in ExpressionEngine. Similar methods could be used in Drupal, which we’re working on now.
In ExpressionEngine, one can easily define a set of fields to input data. For example a blog would need fields for a Title, a Body, and maybe a separate Image upload field along with a label field for the image (so you could add a caption or an alt tag at least). When you create these fields, you also pick a field type: textarea, dropdown list, file upload, etc. EE has several pre-defined field types and there are dozens of addons from third-party developers to add more.
One of the really great EE addons is FieldFrame, developed by Brandon Kelly. FieldFrame is a framework for developing new EE fieldtypes, and there are a bunch of good ones. The most important for our EE PBCore tool is called FF Matrix, which allows you to bundle several fields in a “row” of related data.
Here’s the way you create an FF Matrix field in ExpressionEngine:
With an FF Matrix field, you can do things like enter a PBCore subject tied to a subjectAuthorityUsed, or title along with titleType. Since most of PBCore elements are wrapped in pairs like this, it’s important to solve this in a straightforward way. With FF Matrix, you can enter as many linked pairs as needed, for example with many subject terms you want to have each term wrapped individually along with its corresponding subjectAuthorityUsed.
Here’s the PBCore Item entry form showing a number of such fields (but not the entire form which is a bit long):
We used this form to enter all the Intellectual Content and Intellectual Property metadata for each media item. Nothing in this Item form relates to the physical or digital Instantiation of that item. For that we used a different form with fields and fieldtypes defined specifically for Instantiation metadata. Here’s the fun part: One of the fieldtypes in the Instantiation form is a “relationship” field, which allows you to select an existing Item to which the Instantiation should be linked. So if you have several Instantations, like a wav file, and mp3, and an analog tape of the same Item, you create Instantiations records for each and link them to the Item.
This proved to be a quick and effective way to link multiple Instantiations with a single Item.
You might be able to see that some of the fields are blank, and their instructions say things like “formatDataRate - If MP3 file don’t enter anything.” Lots of the technical metadata like formatFileSize etc could be extracted automatically from the digital files by the system, so we don’t have to enter that data by hand. EE has a nice addon called MP3 Info + that does most of that work.
David Rice has developed better methods of reading file metadata into his PBCore Records Repository using a free tool called MediaInfo. We should get him to write more about that, as it’s work that could be leveraged and used in different systems I’m sure.
After entering all the metadata for our collection using the two forms above, the payoff is in rendering everything in usable form. Since it’s all in the CMS, it’s a simple matter to make a website displaying everything, and providing media players for the files. In fact we did this initially for the catalogers so they could work remotely and listen to and view the audio and video files.
This site was intended for that purpose: http://will.illinois.edu/metadata/aapp-inventory-all/.
As the catalogers added descriptive metadata, the site became much more interesting! We added as much descriptive stuff as possible, even full tape logs for some of the World War II oral history interviews. I chose not to display all that metadata on the web page, but it is rendered in the PBCore XML record for each item.
For example, here is a web page for one such interview: http://will.illinois.edu/metadata/aapp-inventory-all/WWII_oral_history_WesleyMatthews2008-02-21
And here is the PBCore record for the same interview: http://will.illinois.edu/metadata/pbcoreAAPP/wwii_oral_history_wesleymatthews2008-02-21
The way these are rendered is simple: an html template for the web page, and an xml template for the PBCore record, both drawing from the same database. In ExpressionEngine this is very simple to set up, and once it’s set up, you’re done.
Finally, as mentioned above I chose to try implementing the idea of a PBCoreCollection wrapper element, enclosing all 235 of the individual PBCoreDescriptionDocuments in my AAPP media collection. This is, of course, not a valid wrapper element in any PBCore version to date. This experience suggests that it should be. OPB was able to ingest my entire collection in a single gulp from this URL. Other stations in the AAPP were able to export using the same method (PBcoreCollection) even though they have different local systems. The ability to render a PBCoreCollection is all that matters, not the underlying system that rendered it.
I hope this is useful to anyone who might be looking for systems for cataloging media assets and doing various things with them like creating websites and PBCore records or whatever metadata format. I used ExpressionEngine but the basic method would work with Drupal, Plone, and other CMSs and frameworks. Most importantly, regardless of the system used, I hope this demonstration of the power of PBCoreCollection informs the development of PBCore 2.0, which is now in progress.
Time to get funky with PBCore
Yesterday somebody asked me "Is anything really happening with PBCore? Or is it a nice idea that CPB funded and then left hanging out to dry?" The answer seems to be yes, and maybe.
I'm aware of several significant PBCore projects, mostly below the CPB radar:
- An open source media player that will ingest content and metadata via PBCore records
- A Drupal profile that will include PBCore among other methods for exchanging media
- A project to build PBCore modules for other CMSs including ExpressionEngine, and Joomla
- The folks at NPR Online are adding PBCore as an output format for the NPR API
- A preservation repository for media using PBCore as its metadata foundation
I also just saw a CPB RFP for STEM projects relating to climate science, requiring the use of PBCore for all project media.
Meanwhile, OPB is tackling the next phase of the American Archive project, which could play a large role in shaping the future of PBCore. This is critical, because without a formal change-management process, active development, and support, A/V archivists and online media developers aren't likely to have confidence that PBCore will become a common standard for the long-term.
I think it should be, because PBCore is simply a great standard for A/V metadata. It's simple enough for most people to understand, but detailed enough to be truly useful. But the PBCore project needs further work, including refining the controlled vocabularies for subjects, genres, and probably everything else. The PBCore Resource Group has been dormant, and I don't see evidence that anyone else has officially taken the reins. Correct me if I'm wrong please.
I suspect this is the year that PBCore either sinks or swims. There are lots of good reasons it should emerge as a common standard, and lots of "things" being developed around it. The question is, who will take responsibility for maintaining the PBCore standard?
PBCore Tool Quest
Back in 2005 I became somewhat obsessed with cataloging media on the job here at WILL Public Media. As the website manager, people would do things like hand me a videotape and say "can you put this on the web?" But what's on the tape? It became clear producers had no clue about the importance of recording actual information about their productions. Meanwhile I started learning about metadata and controlled vocabularies, and some of the cool things we could do with structured data on the web. This lead me directly to PBCore as the theoretical metadata standard for public broadcasting and beyond. If producers and stations could create PBCore-compliant XML records for their content, we could develop tools for automated exchange of deep information about media, and the media essence as well.
But we can't do much if nobody catalogs their stuff. So the question became what tools to use? Working with a graduate assistant from Library Science here at Illinois (the great Jimi Jones!), we embarked on a PBCore tool quest.
As we began digitizing and cataloging 15 years worth of the WILL-TV program Prairie Fire, we wanted a tool that would spit out PBCore XML. Jimi and I tried the free MIC database tool, and sorry MIC, but at least back then it was (ahem) a work in progress. We were somehow unaware of the free PBCore File Maker Pro database, which I guess is pretty OK, and if you've used it feel free to comment. So we tried things like using the open-source Greenstone repository software, which I wouldn't recommend for video unless you like migraines. We even started hand-coding XML using Oxygen, which is enough to make your eyes bleed after a while.
Eventually though I started thinking like a web developer, which after all is my job at WILL. We're using a Content Management System to "catalog" media for web pages. We use the same CMS database to output RSS feeds, which is a flavor of XML. Why not add a PBCore flavor of XML? So I added a bunch of fields reflecting the PBCore elements, and asked our producers to provide...a little more detail when adding their content to the website. The result was the new Prairie Fire website, released in January 2007, featuring among other things PBCore records of every episode and segment for every program over the past 15 years.
Since then we've continued to refine our little home-grown CMS-based cataloging tool. Here are a couple of screen shots that kinda show how it works.
This is a form for entering descriptive and administrative metadata based on the PBCore elements. Mary Miller from the Peabody Archives worked with me on this, and she likes to call it the "Platonic Record" of the media object, as it is purely the intellectual record.
This is the input form for the PBCore Instantiation, which is the actual physical or digital object being cataloged. The Instantiation is then linked to the Platonic Record. The result is a complete PBCore-compliant record which looks exactly like this. (If you hit this with Safari, do a View Source to see the XML.) So our CMS can take metadata directly from the Producer to create web pages, RSS feeds/podcasts, and PBCore XML. Why not?
But here we are now in (almost) 2009, and we need more than my cute little CMS solution. Which is why I'm so excited to see the work of Dave Rice and Mike Castleman, who created a much better online system called PBCore Vermicelli. This is a Ruby app that does everything I had set up, plus a lot more. So I wanted to call more attention to it, and suggest that it's a good starting point for much more powerful things to come. What things? I have a few ideas, but I'll pause here for now. Let's talk about it.
PBCore links 2008-11-24
The Open Media Project is being designed to help automate the scheduling, exchanging, ingesting, and cataloging of public access program and includes incorporations of PBCore.
WNET PBCore Record Repository project