Article category: Taxonomy
PBCore and the Semantic Web
Recently I had a chance to discuss with Dave Rice, Dan Jacobson, and Chris Beer the potential role of PBCore in enabling content to flow in the “semantic web.” The topic has also come up in discussions around the development of PBCore 2.0. It seems like a good idea to open the topic for broader consideration.
The Semantic Web would mostly likely be implemented by Linked Data. Using URIs and RDF, “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” Those other methods would include linking things together by hand, piece by piece, which webmasters of the world can tell you doesn’t exactly scale well.
So how does PBCore fit into linked data using URIs and RDF? I don’t have anything close to a complete answer, just ideas that came from the discussion. I would love some feedback to take this thing further.
Linked Data depends upon having Uniform Resource Identifiers for content items, their descriptive/administrative metadata, and their relationships to other items and descriptors. Every resource should have a URI, whether the resource is a file or a subject term. Relationships between resources also get URIs, such as “friend” or “author”. With URIs for everything, you can establish so-called Triples (subject -> predicate -> object) as machine-processable links, which then establish other machine-processable links to other resources, and the Linked Data universe scales up from there based on network effects. These Triples and their relationships with other resources are made possible by RDF statements. How do we create RDF statements using PBCore?
As the most basic level, a PBCore record may already contain many URIs, for example in the case of a streaming media file a valid URI might be http://will.illinois.edu/media/mp3/illinoisbroadcastarchives_040T_96k.mp3. This file is described by elements like:
<pbcoreSubject> <subject>African American Civil Rights</subject> <subjectAuthorityUsed>American Archive Civil Rights Subject Categories</subjectAuthorityUsed> </pbcoreSubject>
<pbcoreContributor> <contributor>Hughes, Langston</contributor> <contributorRole>Lecturer</contributorRole> </pbcoreContributor>
Do we have URIs for “African American Civil Rights,” “American Archive Civil Rights Subject Categories,” “Hughes, Langston” and “Lecturer”? They aren’t currently in this PBCore document but perhaps they could be if we did something radical (for PBCore) like adding attributes. We could then tap into existing namespaces or create our own as needed, using and assigning URIs for each and every element in our PBCore records.
But this adds a great deal of additional work and complication. PBCore was invented to provide interoperability among public media systems, and if we all used plain old PBCore it would work for that purpose. If we raise the bar too high and require URIs, fewer people will be able to adopt PBCore in their local workflows and systems.
Yet it would be very useful to add URIs to key elements like subject and genre terms, contributors, creators, and publishers. We could even apply this to the currently almost meaningless pbcoreRightsSummary, for example using Creative Commons license URIs.
Such an approach could be incredibly powerful, but I believe it should be optional. URI attributes should be allowed but not required in PBCore 2.0. We might also create PBCore Profiles which use URIs in specific namespaces; for example in a future American Archive PBCore Profile there could be subject and genre namespaces facilitating a common taxonomy. This could solve one of the thorniest problems of interoperability in different content collections.
But I believe content creators in pubmedia will never universally adopt the same taxonomies, nor should they. After all, we’re probably all dealing with a babel of descriptive metadata and controlled vocabularies generated by various legacy systems and catalog records, and we can’t make them conform to a new common standard. I also like the idea of harvesting user-generated keywords for my content and adding that to the metadata mix. Also, PBCore Profiles might work within the PBCore user community, but what about linking to other content domains? I want my data to interoperate not only inside pubmedia, but also with related data anywhere. A story on the new water plant drawing from the Central Illinois Mahomet Aquifer could link to historical information, climate records, corporate earnings reports, and huge data sets on the geography and hydrology of the region. Each of these domains has tons of public data but it’s all in different formats and none of it is PBCore. Almost none of it is RDF.
This leads to the notion that something else should create the RDF. It would harvest whatever it could from available data and build an explorable triple store. This would enable other applications to traverse, search, and build interesting interfaces to a growing universe of linked data.
Several software projects are working on tools to do this. Project Tupelo can harvest existing URI schemes, or in their absence create them for a given set of data. We’re attempting to test Tupelo with our content collections, and hope to have something to report later this year.
But I bet there are other tools and approaches, and maybe it’s simpler than I’m seeing it. If you know something about this please respond by commenting on this post, or starting a new one.
PBCore subject and pbcoreSubjectAuthorityUsed: Adding subject authorities
In some ways it's great that PBCore is so agnostic about using specific subject terms and authorities, but it also makes exchanging records between systems too unpredictable. If I say the subject is Climate Change, and your system uses Global Warming, we have a problem communicating between systems. PBCore.org doesn't even suggest any subject taxonomies, leaving users to fish for one or invent their own.
Here's a proposal to address this: Let's pick a few subject authorities as a starting point. Certain applications of PBCore may need different subject authorities and that's fine, they can be added. The list of possible subject authorities doesn't have to be written into the standard, but a few suggestions might help form usage patterns, preferences, and perhaps eventually best practices for certain types of content.
For radio news stories, for example, we might use the NPR All Topics list: http://api.npr.org/list?id=3002. I want to use this to pull in related content from the NPR API, so if I tag my content using terms from the NPR All Topics list, I can build an automated query based on those topics. More on that soon....
PBCore Genre Picklist from Hell
Let's be honest: The controlled vocabularies for pbcoreGenre suggested at pbcore.org lack relevance in many cases. I mean, "boat"? The main genre list suggested, "PBCore + Tribune Media Services Genre Categories (TiVo)," is mostly very good as far as it goes. But it doesn't go far enough.
And here's the problem: Because it's on the official PBCore website, it looks to many people like the Official PBCore Genre List. I've spoken with several PBCore users (speak up if you wish) who wanted to use certain genre terms not on the list, but didn't think it would be valid. It is valid, as long as you also declare the genreAuthorityUsed to identify the genre list.
This really matters when you want to exchange stuff between systems that speak PBCore, and you want that stuff to show up in the right places. By using a controlled vocabulary that is common to the systems exchanging the stuff, things work as intended. If I call something "boat" and you're expecting "marine," things fall apart. If I use "Horse" as in the suggested picklist, and your system wants to call it "Equestrian," we have a problem.
So what would move this forward? I'll suggest something: People should compile a genre list for a given PBCore user community (yes these really exist), and document it clearly for that community. Code it into applications (in drop-down lists for example) so everyone selects terms from the same genre list. Name that list, and you've got a valid new PBCore genreAuthorityUsed.