Time to get funky with PBCore
Yesterday somebody asked me "Is anything really happening with PBCore? Or is it a nice idea that CPB funded and then left hanging out to dry?" The answer seems to be yes, and maybe.
I'm aware of several significant PBCore projects, mostly below the CPB radar:
- An open source media player that will ingest content and metadata via PBCore records
- A Drupal profile that will include PBCore among other methods for exchanging media
- A project to build PBCore modules for other CMSs including ExpressionEngine, and Joomla
- The folks at NPR Online are adding PBCore as an output format for the NPR API
- A preservation repository for media using PBCore as its metadata foundation
I also just saw a CPB RFP for STEM projects relating to climate science, requiring the use of PBCore for all project media.
Meanwhile, OPB is tackling the next phase of the American Archive project, which could play a large role in shaping the future of PBCore. This is critical, because without a formal change-management process, active development, and support, A/V archivists and online media developers aren't likely to have confidence that PBCore will become a common standard for the long-term.
I think it should be, because PBCore is simply a great standard for A/V metadata. It's simple enough for most people to understand, but detailed enough to be truly useful. But the PBCore project needs further work, including refining the controlled vocabularies for subjects, genres, and probably everything else. The PBCore Resource Group has been dormant, and I don't see evidence that anyone else has officially taken the reins. Correct me if I'm wrong please.
I suspect this is the year that PBCore either sinks or swims. There are lots of good reasons it should emerge as a common standard, and lots of "things" being developed around it. The question is, who will take responsibility for maintaining the PBCore standard?
PBCore subject and pbcoreSubjectAuthorityUsed: Adding subject authorities
In some ways it's great that PBCore is so agnostic about using specific subject terms and authorities, but it also makes exchanging records between systems too unpredictable. If I say the subject is Climate Change, and your system uses Global Warming, we have a problem communicating between systems. PBCore.org doesn't even suggest any subject taxonomies, leaving users to fish for one or invent their own.
Here's a proposal to address this: Let's pick a few subject authorities as a starting point. Certain applications of PBCore may need different subject authorities and that's fine, they can be added. The list of possible subject authorities doesn't have to be written into the standard, but a few suggestions might help form usage patterns, preferences, and perhaps eventually best practices for certain types of content.
For radio news stories, for example, we might use the NPR All Topics list: http://api.npr.org/list?id=3002. I want to use this to pull in related content from the NPR API, so if I tag my content using terms from the NPR All Topics list, I can build an automated query based on those topics. More on that soon....
WNET/Thirteen releases its implementation of a PBCore Cataloging Tool
WNET/Thirteen hereby releases the software of its PBCore Repository Project under the GPLv3 license (http://www.gnu.org/licenses/gpl-3.0.txt). The PBCore Record Repository is an online database tool built on Ruby on Rails, Sphinx search, and MYSQL that was created at WNET/Thirteen to facilitate the import, export, search, creation and modification of PBCore records according to the PBCore 1.2.1 standard (http://www.pbcore.org/PBCore/PBCoreXMLSchema.html). For testing and evaluation a public installed version of the application can be found at http://pbcore.vermicel.li (for administrative testing log in as username=admin and password=secret). This work employs PBCore. The PBCore (Public Broadcasting Metadata Dictionary) was created by the public broadcasting community in the United States of America for use by public broadcasters and others. Initial development funding for PBCore was provided by the Corporation for Public Broadcasting. The PBCore is built on the foundation of the Dublin Core (ISO 15836), an international standard for resource discovery (http://dublincore.org), and has been reviewed by the Dublin Core Metadata Initiative Usage Board. Copyright: 2005, Corporation for Public Broadcasting. Further technical documentation can be found here http://git.mlcastle.net/?p=pbcore.git;a=blob;f=doc/README_FOR_APP;hb=HEAD and a current snapshot of the source code here http://git.mlcastle.net/?p=pbcore.git;a=snapshot;h=HEAD;sf=tgz. This tool is under development and feedback is appreciated. David Rice Digital Media Archivist WNET/ThirteenPBCore Genre Picklist from Hell
Let's be honest: The controlled vocabularies for pbcoreGenre suggested at pbcore.org lack relevance in many cases. I mean, "boat"? The main genre list suggested, "PBCore + Tribune Media Services Genre Categories (TiVo)," is mostly very good as far as it goes. But it doesn't go far enough.
And here's the problem: Because it's on the official PBCore website, it looks to many people like the Official PBCore Genre List. I've spoken with several PBCore users (speak up if you wish) who wanted to use certain genre terms not on the list, but didn't think it would be valid. It is valid, as long as you also declare the genreAuthorityUsed to identify the genre list.
This really matters when you want to exchange stuff between systems that speak PBCore, and you want that stuff to show up in the right places. By using a controlled vocabulary that is common to the systems exchanging the stuff, things work as intended. If I call something "boat" and you're expecting "marine," things fall apart. If I use "Horse" as in the suggested picklist, and your system wants to call it "Equestrian," we have a problem.
So what would move this forward? I'll suggest something: People should compile a genre list for a given PBCore user community (yes these really exist), and document it clearly for that community. Code it into applications (in drop-down lists for example) so everyone selects terms from the same genre list. Name that list, and you've got a valid new PBCore genreAuthorityUsed.
automating formatIdentifiers for digital assets
For the use of formatIdentifier, pbcore.org states "Best practice is to identify the media item (whether analog or digital) by means of a string or number corresponding to an established or formal identification system if one exists." This field holds such values as barcodes, unique tape labels, file paths, database generated unique identifiers, and other such data.
In addition to these types of data, I needed an identifier to track the use digital assets in Final Cut workflows and to identify when the digital archive is receiving the same exact file more than once.
In a Filemaker based PBCore staging database I added these scripts (assuming that formatLocation hold a usable filepath):
Perform AppleScript ["tell current record¶set cell \"formatIdentifier\" to do shell script (\"md5 -q '"& PBCoreInstantiation::formatLocation &"'\")¶end tell"]
Set Field [formatIdentifierSource; "md5"]
and
Perform AppleScript ["tell current record¶set cell \"formatIdentifier\" to do shell script (\"echo `tail -c 50 '"& PBCoreInstantiation::formatLocation&"' | head -c 41`\")¶end tell"]
Set Field [formatIdentifierSource; "com.apple.finalcutstudio.media.uuid"]
Note: The resulting formatIdentifier here must be evaluated to see if it conforms to the Final Cut UUID standard (ex. 13EBEE4A-FB65-4AF4-97F3-8B02A04A0A71), else you're just getting random bits from the end of the file. Possibly using this method you could also determine that Final Cut is the creatingApplicationUsed if you're using PREMIS as well. Let me know if you have a more efficient or alternative ways to extract Final Cut's internal UUIDs from media files. /
As described above, the steps must be placed in the proper Filemaker script context to generate a new formatIdentifier record that relates to an instantiation record before running, but the result is formatIdentifiers that contain an md5 checksum and, if appropriate, the Final Cut UUID. If the resulting PBCore records contain records that have equal formatIdentifiers where formatIdentiferSource equals "md5", then you have redundant multiple media copies. The Final Cut UUID is assigned to most files that Final Cut generates either through capturing or exporting. This value is usually also stored in the Final Cut project binary (the .fcp file) and can be used to build relation records between source material, the Final Cut project itself, and (occasionally) the exported material.