Article category: Semantic Web
Sneak preview of PBCore 2.0
If I’ve learned one thing about the PBCore user community, it’s that we’re not satisfied with the current state of PBCore. We’ve used it enough to discover its strengths in describing AV assets and creating shareable metadata, but we keep running into its gaps and flaws. We’ve been pushing for a change process, and have argued for specific changes. Common threads have emerged right here on this site:
- A need for PBCore to support multi-part instantiations, e.g. when you have one complete work comprised of several reels or tapes or files.
- A need to express rights information related to a specific Instantiation, instead of only the entire asset. For example, you might want to allow users to download an mpeg4 version of a film for personal use, but not grant the same kind of access to the actual film!
- Speaking of rights, formatting of the pbcoreRightsSummary element disallows inclusion of metadata from existing standards such as ORDL or Creative Commons, which seems odd to say the least. If you already have structured rights data, why not simply reuse it?
- A need to show relationships between Instantiations, like when you digitize a film to 10-bit uncompressed digital video, then encode an mpeg4 file from the 10-bit uncompressed file, it seems important to show that in the PBCore record.
- With pbcoreContributor, you can say that Harrison Ford is an Actor, but you can’t say what role he plays in the film.
- There’s no way to uniquely identify a person, subject term, location, or other value that might have an actual URI.
- The lack of attributes of any kind! Everything is elements and sub-elements, which seems inefficient and makes parsing more difficult.
- The lack of a valid way to identify clip information within an asset, for example where in the timeline a particular subject is discussed or a specific person appears.
- The lack of any way to bundle multiple PBCore XML records together in a feed or collection, so you could export/import large groups of records between systems or use PBCore in RESTful web applications.
Well good news folks! PBCore 2.0 is on the way, and it solves all these issues.
Even better, it solves them in a way that doesn’t add complexity for those who want to keep PBCore simple. For example, PBCore 2.0 allows you to use attributes for a subject term to specify a URI, and a startTime and endTime for that term in the media asset timeline. So you could have something like this:
<pbcoreSubject ref=”http://en.wikipedia.org/wiki/Hobbit” startTime=”00:23:14” endTime=”00:24:22”>Hobbits</pbcoreSubject>
You can also do this:
<contributor affiliation=”NPR” ref=”http://en.wikipedia.org/wiki/Michele_Norris”>Michele Norris</contributor>
<contributor ref=”http://en.wikipedia.org/wiki/Sean_Connery”>Sean Connery</contributor>
<contributorRole portrayal=”James Bond”>Actor</contributorRole>
But the use of the new 2.0 attributes is totally optional. You can keep it simple and use PBCore the same way as before.
Once you get used to the idea of adding attributes, however, you may find it opens up all kinds of new possibilities for your PBCore metadata. For example, the use of URIs to identify values like subject terms, people, and locations is the first step to enabling content to live and breath in the emerging semantic web/linked data universe. The addition of the optional ref=“URI” attribute in the 2.0 schema puts PBCore squarely onto that path.
But I suspect many of the other improvements to PBCore 2.0 will make life easier for all concerned. From what I see, the changes solve the issues people have raised on this site. The folks at WGBH who managed the 2.0 project did additional extensive research and outreach to find out what people using PBCore need, and how best to evolve the schema. And I give a lot of credit to CPB for supporting an open and transparent process. We all contributed to the 2.0 version of PBCore, and our input was taken seriously. I understand the schema will be publicly released soon, and you’ll see.
PBCore has thus far only sort of worked as a metadata standard for AV assets and collections, but gaps in its earlier versions drove many of us to implement workarounds and hacks. The result was lack of clarity at best, which is not a good thing for a technical standard. The 2.0 PBCore schema probably isn’t perfect, and we’ll all find out more as we learn about it and begin our own implementations. But in my view it takes PBCore to a much higher state of functionality and flexibility, while retaining its simplicity and its humble origins as a child of Dublin Core.
PBCore and the Semantic Web
Recently I had a chance to discuss with Dave Rice, Dan Jacobson, and Chris Beer the potential role of PBCore in enabling content to flow in the “semantic web.” The topic has also come up in discussions around the development of PBCore 2.0. It seems like a good idea to open the topic for broader consideration.
The Semantic Web would mostly likely be implemented by Linked Data. Using URIs and RDF, “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” Those other methods would include linking things together by hand, piece by piece, which webmasters of the world can tell you doesn’t exactly scale well.
So how does PBCore fit into linked data using URIs and RDF? I don’t have anything close to a complete answer, just ideas that came from the discussion. I would love some feedback to take this thing further.
Linked Data depends upon having Uniform Resource Identifiers for content items, their descriptive/administrative metadata, and their relationships to other items and descriptors. Every resource should have a URI, whether the resource is a file or a subject term. Relationships between resources also get URIs, such as “friend” or “author”. With URIs for everything, you can establish so-called Triples (subject -> predicate -> object) as machine-processable links, which then establish other machine-processable links to other resources, and the Linked Data universe scales up from there based on network effects. These Triples and their relationships with other resources are made possible by RDF statements. How do we create RDF statements using PBCore?
As the most basic level, a PBCore record may already contain many URIs, for example in the case of a streaming media file a valid URI might be http://will.illinois.edu/media/mp3/illinoisbroadcastarchives_040T_96k.mp3. This file is described by elements like:
<pbcoreSubject> <subject>African American Civil Rights</subject> <subjectAuthorityUsed>American Archive Civil Rights Subject Categories</subjectAuthorityUsed> </pbcoreSubject>
<pbcoreContributor> <contributor>Hughes, Langston</contributor> <contributorRole>Lecturer</contributorRole> </pbcoreContributor>
Do we have URIs for “African American Civil Rights,” “American Archive Civil Rights Subject Categories,” “Hughes, Langston” and “Lecturer”? They aren’t currently in this PBCore document but perhaps they could be if we did something radical (for PBCore) like adding attributes. We could then tap into existing namespaces or create our own as needed, using and assigning URIs for each and every element in our PBCore records.
But this adds a great deal of additional work and complication. PBCore was invented to provide interoperability among public media systems, and if we all used plain old PBCore it would work for that purpose. If we raise the bar too high and require URIs, fewer people will be able to adopt PBCore in their local workflows and systems.
Yet it would be very useful to add URIs to key elements like subject and genre terms, contributors, creators, and publishers. We could even apply this to the currently almost meaningless pbcoreRightsSummary, for example using Creative Commons license URIs.
Such an approach could be incredibly powerful, but I believe it should be optional. URI attributes should be allowed but not required in PBCore 2.0. We might also create PBCore Profiles which use URIs in specific namespaces; for example in a future American Archive PBCore Profile there could be subject and genre namespaces facilitating a common taxonomy. This could solve one of the thorniest problems of interoperability in different content collections.
But I believe content creators in pubmedia will never universally adopt the same taxonomies, nor should they. After all, we’re probably all dealing with a babel of descriptive metadata and controlled vocabularies generated by various legacy systems and catalog records, and we can’t make them conform to a new common standard. I also like the idea of harvesting user-generated keywords for my content and adding that to the metadata mix. Also, PBCore Profiles might work within the PBCore user community, but what about linking to other content domains? I want my data to interoperate not only inside pubmedia, but also with related data anywhere. A story on the new water plant drawing from the Central Illinois Mahomet Aquifer could link to historical information, climate records, corporate earnings reports, and huge data sets on the geography and hydrology of the region. Each of these domains has tons of public data but it’s all in different formats and none of it is PBCore. Almost none of it is RDF.
This leads to the notion that something else should create the RDF. It would harvest whatever it could from available data and build an explorable triple store. This would enable other applications to traverse, search, and build interesting interfaces to a growing universe of linked data.
Several software projects are working on tools to do this. Project Tupelo can harvest existing URI schemes, or in their absence create them for a given set of data. We’re attempting to test Tupelo with our content collections, and hope to have something to report later this year.
But I bet there are other tools and approaches, and maybe it’s simpler than I’m seeing it. If you know something about this please respond by commenting on this post, or starting a new one.