Category Archives: Query

PDF/L?

Here’s a question for the gallery: Have any of you heard of PDF/L, and do you know what it is?
Continue reading

Possible book on digital preservation tools

Update: It’s clear from the small response that the necessary level of interest isn’t there. Oh, well, that’s what testing the waters is for.

I’m getting the urge to write another book, going the crowdfunding route which has worked twice for me and my readers. My earlier Files that Last got good responses, though the “digital preservation for everygeek” audience proved not to be huge. Tomorrow’s Songs Today, a non-tech book, got more recognition and additional confirmation that book crowdfunding works. This time I’m aiming squarely at the institutions that engage in preservation — libraries, archives, and academic institutions — and proposing a reference on the software tools for preservation. The series I’ve been running on file identification tools was an initial exploration of the idea.

In the book, I’ll significantly expand these articles as well as covering a broader scope. Areas to cover will include:

  • File identification
  • Metadata formats
  • Detection of problems in files
  • Provenance management
  • The OAIS reference model
  • Repository creation and management
  • Keeping obsolescent formats usable

Continue reading

Crowdsourcing song identification

Some friends of mine are pulling together a project for crowdsourcing identification of a large collection of music clips. At least a couple of us are professional software developers, but I’m the one with the most free time right now, and it fits with my library background, so I’ve become lead developer. In talking about it, we’ve realized it can be useful to librarians, archivists, and researchers, so we’re looking into making it a crowdfunded open source project.

A little background: “Filk music” is songs created and sung by science fiction and fantasy fans, mostly at conventions and in homes. I’ve offered a definition of filk on my website. There are some shoestring filk publishers; technically they’re in business, but it’s a labor of love rather than a source of income. Some of them have a large backlog of recordings from past conventions. Just identifying the songs and who’s singing them is a big task.

This project is, initially, for one of these filk publishers, who has the biggest backlog of anyone. The approach we’re looking at is making short clips available to registered crowdsource contributors, and letting them identify as much as they can of the song, the author, the performer(s), the original tune (many of these songs are parodies), etc. Reports would be delivered to editors for evaluation. There could be multiple reports on the same clip; editors would use their judgment on how to combine them. I’ve started on a prototype, using PHP and MySQL.

There’s a huge amount of enthusiasm among the people already involved, which makes me confident that at least the niche project will happen. The question is whether there may be broader interest. I can see this as a very useful tool for professionals dealing with archives of unidentified recordings: folk music, old jazz, transcribed wax cylinder collections, whatever. There’s very little in the current design that’s specific to one corner of the musical world.

The first question: Has anyone already done it? Please let me know if something like this already exists.

If not, how interesting does it sound? Would you like it to happen? What features would you like to see in it?

Update: On the Code4lib mailing list, Jodi Schneider pointed out that nichesourcing is a more precise word for what this project is about.

The state of JHOVE

As you may have noticed, I’ve been neglectful of JHOVE since last September, when 1.11 came out. Issues are continuing to arise, and people are still using it, and I’m not getting anything done about them.

The problem is that my current job has rather long hours, and when I come home from it, looking at more Java code isn’t at the top of my list of things to do. I’m very glad people are still using JHOVE, close to a decade after I started work on it as a contractor to the Harvard Library, but I’m not getting anything actually done.

It would help if there were more contributions from others, and its being on the moribund SourceForge isn’t helping. I think I could undertake the energy to move it to Github, where more contributors might be interested. There’s already a Mavenized version by Andy Jackson there, which doesn’t include the Java source code but provides some important scaffolding and pom.xml files. It probably makes sense to start by forking this. This migration should also make the horrible JHOVE build procedure easier.

If this is something you’d like to see, let me know. I’d like some reassurance that this will actually help before I start.

A PDF question

A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:

The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?

Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.

Who’s using FITS?

It would be helpful for me to have at least a partial list of institutions that are using Harvard’s FITS (File Information Tool Set). If you can help me build this list, could you reply here or contact me by other usual channels? Thanks.

Future paths for JHOVE

With the next SPRUCE Hackathon coming up, I’m thinking of possible ways to improve JHOVE that I might present there. The home page says, “This hackathon will therefore focus on unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities.” So aside from the general goal of improving JHOVE, coordination is a key point.

I’d posted earlier on some possible enhancements. These are all still possibilities. The focus on coordination brings up other things that could be done. In general, the API hasn’t been given as much thought as the command line interface, and it could be improved without a huge amount of effort. Here are a few thoughts:

  • The API currently requires creating an output stream, such as an XML or text file. It should be possible to call JHOVE and get back an in-memory object. The RepInfo object already serves this purpose; it’s mostly a matter of writing a new method that returns it instead of writing a stream.
  • The caller has the choice of running one module or all the modules in the configuration file and can’t change their order. It might improve efficiency if the caller could specify a list indicating the modules to try and the order in which they should be applied. For instance, a caller might use DROID to get the signature and use this information to pick the module that JHOVE should run first.
  • There’s currently no provision for selecting which output items to generate, except for a few ad hoc options. Would a way to do this, eliminating items that are unwanted, be helpful?
  • Would any additional output handlers, such as JSON, be useful?

I’d welcome any thoughts on which of these, or what other changes, would help JHOVE to coordinate with other applications.

JHOVE Tips for Developers: Call for proofreaders

As a practice run for publishing Files that Last on Smashwords, I’ve put together a small but hopefully useful e-booklet, JHOVE Tips for Developers, which I’m planning to put up there on a “choose your own price” basis. This will help me work out the process of creating the book on a small scale, and maybe it will buy me a Whopper and fries.

For a book of this sort I obviously can’t afford paid proofreading, but I’m hoping one or two people might give it a looking over before I submit the book. You can get the draft as a PDF here.

I’d offer you a free copy in return, but you can get that anyway. What I can do is offer people who give useful feedback credit in the book, as well as my personal thanks.

Preliminary JHOVE fix list

I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.

Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.

• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.

• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.

• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.

• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.

• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.

• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.

• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.

• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.

• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.

• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.

• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.

All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?