Tag Archives: software

The coming of WebP (or not)

The WebP image format has been around for about five years, but till recently it’s been mostly a curiosity. I last blogged about it in 2013, when it didn’t have very wide support. Since then most browsers have adopted it, and now Google+ is making more use of it (no surprise, since Google is the format’s principal backer). It promises smarter lossy compression than JPEG and smaller file sizes for the same image quality.
Continue reading

Video

Video: Introduction to JHOVE

A new video on my YouTube channel offers a seven-minute introduction to JHOVE. This is a teaser for my upcoming video course on file format identification tools, as well as a public test of the techniques I’ve been developing. It’s a screen capture video, and I cover the GUI version, even if it’s not as widely used, because it lets me focus on the concepts, and because it’s silly to teach a command line application in a video.
Continue reading

JHOVE 1.12 beta

New JHOVE logoJHOVE 1.12 will be the first release of JHOVE that I had no significant role in, but I’m still glad to see that the beta release is now available. I’ve downloaded it, run the installer (yes, there’s now an installer!), and then launched JHOVE without having to edit any configuration files by hand! That’s a huge advance by itself. Nice work by Carl Wilson and everyone else at the Open Preservation Foundation. It’s now built with Maven, and I’m sure that the building process is much better than the clunky old one.
Continue reading

Course planning: File identification tools

My current main project is creating a course to offer on Udemy on file format identification tools. As currently planned, I’ll cover file (the command line tool), DROID, ExifTool, JHOVE, and Apache Tika. Covering more than five tools in one course would make it too big, though I might consider changing the list. If I can keep my schedule, I’ll have it out in December for early feedback, giving me a chance to clean it up before MIT’s Independent Activity Period in January.

Right now I’m occupied with the mechanics. The course insists on 1280 x 720 pixel video, so I need a new camera; a friend is selling me a Canon Elph 520 HS cheap. Screen capture software is proving interesting; I’ve looked at three different Macintosh applications so far.
Continue reading

EpubCheck 4.0

EPUB is the favorite format for e-books (ignoring Amazon, which like to be incompatible so it can lock users in). EpubCheck is the open-source industry standard for validating EPUB files. If you’re an author creating your own e-book files, you should run them against EpubCheck before releasing them. It’ll make hosting sites happier, since they’ll probably run it themselves and will like your book better if it passes. A book that passes EpubCheck will also give you fewer headaches with readers complaining it doesn’t work on their reader.
Continue reading

Update on JHOVE

JHOVE logoYesterday the Open Preservation Foundation held a webinar on JHOVE, presented by Carl Wilson. I was really impressed by the progress he’s made there, and any rumors of JHOVE’s death (including ones I may have contributed to) have been greatly exaggerated.

The big changes include reorganizing the code under Maven and making installation more straightforward. These are both badly needed changes. I never had the opportunity to do them at Harvard, and when I took the code over for a while after leaving there, I focused on fixing bugs rather than fixing the design.

In my comments during the webinar, I pointed out the importance of Stephen Abrams’ contribution, which a lot of people don’t remember. I didn’t create JHOVE; he did. The core application and design principles were already in place when I entered the project. OPF will, I’m sure, give him the credit he deserves.

Possible book on digital preservation tools

Update: It’s clear from the small response that the necessary level of interest isn’t there. Oh, well, that’s what testing the waters is for.

I’m getting the urge to write another book, going the crowdfunding route which has worked twice for me and my readers. My earlier Files that Last got good responses, though the “digital preservation for everygeek” audience proved not to be huge. Tomorrow’s Songs Today, a non-tech book, got more recognition and additional confirmation that book crowdfunding works. This time I’m aiming squarely at the institutions that engage in preservation — libraries, archives, and academic institutions — and proposing a reference on the software tools for preservation. The series I’ve been running on file identification tools was an initial exploration of the idea.

In the book, I’ll significantly expand these articles as well as covering a broader scope. Areas to cover will include:

  • File identification
  • Metadata formats
  • Detection of problems in files
  • Provenance management
  • The OAIS reference model
  • Repository creation and management
  • Keeping obsolescent formats usable

Continue reading

veraPDF validator

The veraPDF Consortium has announced a public prototype of its PDF validation software.

It’s ultimately intended to be “the definitive open source, file-format validator for all parts and conformance levels of ISO 19005 (PDF/A)”; however, it’s “currently more a proof of concept than a usable file format validator.”

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.
Continue reading