With the next SPRUCE Hackathon coming up, I’m thinking of possible ways to improve JHOVE that I might present there. The home page says, “This hackathon will therefore focus on unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities.” So aside from the general goal of improving JHOVE, coordination is a key point.
I’d posted earlier on some possible enhancements. These are all still possibilities. The focus on coordination brings up other things that could be done. In general, the API hasn’t been given as much thought as the command line interface, and it could be improved without a huge amount of effort. Here are a few thoughts:
- The API currently requires creating an output stream, such as an XML or text file. It should be possible to call JHOVE and get back an in-memory object. The RepInfo object already serves this purpose; it’s mostly a matter of writing a new method that returns it instead of writing a stream.
- The caller has the choice of running one module or all the modules in the configuration file and can’t change their order. It might improve efficiency if the caller could specify a list indicating the modules to try and the order in which they should be applied. For instance, a caller might use DROID to get the signature and use this information to pick the module that JHOVE should run first.
- There’s currently no provision for selecting which output items to generate, except for a few ad hoc options. Would a way to do this, eliminating items that are unwanted, be helpful?
- Would any additional output handlers, such as JSON, be useful?
I’d welcome any thoughts on which of these, or what other changes, would help JHOVE to coordinate with other applications.
From your list, I would support definitely them since I believe that nowadays separate tools must be used concurrently (sequentially or in collaboration) to get more information. In order to do that, efficiency does matter, both in term of time (duration spends in the computation) and in memory (how much data to generate, read, …).
From my own experimentation, I feel like temptative to unify some tools is a good idea (like FITS) but the lack of “embedded” solution for the sub components tends to limit the effictive response to efficiency.
Therefore, having object as result is one first way to get embedded solution, having the choice of the running module, depending on other tools is another good one, and JSON output is probably one good option since nowaday, bigdata is leading to this kind of interface, leaving XML on the side for high computation model (large number of files to analyze).
So yes I agree with your points.
I work for an ISV in Sweden. We have developed an e-archive product based on the OAIS-model. We are really not archivists, we are developers who found us a niche with an OAIS-based product.
Our customers are mostly large och medium size Swedish government agencys. Incoming information packages are to be validated. That function are pluggable in our solution. Most of our customers (who are archivists) choose to use Jhove for that. Our National Archives has a list of file formats for long term storage – just TIFF, JPG, XML and PDF/A are permitted at the moment. Leading up to: PDF/A are very important for our customers.
Well that said. This specific use of Jhove leads to some specific requirements and ideas of future paths regarding PDF/A:
1. More precise information whats cause a profile to fail. Jhove is not perfect as a validator. Although it validates perfectly well TIFF and PDF/A files it do not report problems very well. Especially with PDF/A it’s often very hard to determine why a file isn’t PDF/A compliant. I know about one customer who have modified Jhove by filling repInfo with information about violations specific to the PDF/A profile. I do not know if it’s of general interest. It’s very PDF/A specific.
2. There are some parts of the PDF/A profile that Jhove do not check. A fact that worry some of our customers. For example “The data within content streams, and therefore the use of operators and the glyph descriptions of embedded fonts.” are not checked.
3. And PDF/A-2 are coming of course.