Tag Archives: preservation

DOTS: Almost a datalith

A lot of people in digital preservation are convinced a “digital dark age” is nothing to worry about. I’ve consistently disagreed with this. The notion that archivists will replace outdated digital media every decade or two through the centuries is a pipe dream. Records have always gone through periods of neglect, and they will in the future. Periods of unrest will happen; authorities will try to suppress inconvenient history; groups like Daesh will set out to destroy everything that doesn’t match their worldview; natural disasters will disrupt archiving.

I’ve proposed the idea of a “datalith,” a data record made out of rock or equivalent material, optically readable and self-explanatory assuming a common language survives. DOTS, Digital Optical Technology System, is burned on tape rather than engraved in stone, but in every other respect it matches my vision of a datalith. It can store digital images in any format but also allows them to be recorded as a visual representation. The Long Now Foundation explains:
Continue reading

PDF forever?

Distant galaxiesThe PDF Association has an article on its site titled “What’s unique about PDF? and why PDF will live forever.” The article claims PDF is “a format of such flexibility and power that it will define the essential ‘electronic document’ concept forever.”

Forever is a long time. No one will think they mean that the last object left as the universe succumbs to entropy will be a disk with a PDF file, but what scale of “forever” gives sense to their claim? In a tweet responding to my skepticism, they offered a clarification:

Continue reading

TIFF/A by any other name

TIFF/A is in search of a new name.

Today’s online kickoff discussion for the TIFF/A Initiative was productive in a lot of ways, but the big news for the broader public is that it will have to change its name. Adobe owns the TIFF trademark, and it doesn’t want “TIFF/A” used for the proposed new standard for archival TIFF.
Continue reading

OPF seeking executive director

With Ed Fay’s departure, the Open Preservation Foundation is seeking a new executive director. The location is “negotiable,” but I’m sure major centers of digital preservation activity in Europe will get top consideration.

Update on JHOVE

JHOVE logoYesterday the Open Preservation Foundation held a webinar on JHOVE, presented by Carl Wilson. I was really impressed by the progress he’s made there, and any rumors of JHOVE’s death (including ones I may have contributed to) have been greatly exaggerated.

The big changes include reorganizing the code under Maven and making installation more straightforward. These are both badly needed changes. I never had the opportunity to do them at Harvard, and when I took the code over for a while after leaving there, I focused on fixing bugs rather than fixing the design.

In my comments during the webinar, I pointed out the importance of Stephen Abrams’ contribution, which a lot of people don’t remember. I didn’t create JHOVE; he did. The core application and design principles were already in place when I entered the project. OPF will, I’m sure, give him the credit he deserves.

Possible book on digital preservation tools

Update: It’s clear from the small response that the necessary level of interest isn’t there. Oh, well, that’s what testing the waters is for.

I’m getting the urge to write another book, going the crowdfunding route which has worked twice for me and my readers. My earlier Files that Last got good responses, though the “digital preservation for everygeek” audience proved not to be huge. Tomorrow’s Songs Today, a non-tech book, got more recognition and additional confirmation that book crowdfunding works. This time I’m aiming squarely at the institutions that engage in preservation — libraries, archives, and academic institutions — and proposing a reference on the software tools for preservation. The series I’ve been running on file identification tools was an initial exploration of the idea.

In the book, I’ll significantly expand these articles as well as covering a broader scope. Areas to cover will include:

  • File identification
  • Metadata formats
  • Detection of problems in files
  • Provenance management
  • The OAIS reference model
  • Repository creation and management
  • Keeping obsolescent formats usable

Continue reading

TIFF/A

TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags. Some of them have achieved wide acceptance, such as the TIFFTAG_ICCPROFILE tag (34675), which fills the need to associate ICC color profiles with images. Many applications use the EXIF tag set to specify metadata, but this isn’t part of the “standard” either.

In other words, TIFF today is the sum of a lot of unwritten rules.

It’s generally not too hard to deal with the chaos and produce files that all well-known modern applications can handle. On the other hand, it’s easy to produce a perfectly legal TIFF file that only your own custom application will handle as you intended. People putting files into archives need some confidence in their viability. Assumptions which are popular today might shift over a decade or two. Variations in metadata conventions might cause problems.
Continue reading

File identification tools, part 5: JHOVE

In 2004, the Harvard University Libraries engaged me as a contractor to write the code for JHOVE under Stephen Abrams’ direction. I stayed around as an employee for eight more years. I mention this because I might be biased about JHOVE: I know about its bugs, how hard it is to install, what design decisions could have been better, and how spotty my support for it has been. Still, people keep downloading it, using it, and saying good things about it, so I must have done something right. Do any programmers trust the code they wrote ten years ago?

The current home of JHOVE is on GitHub under the Open Preservation Foundation, which has taken over maintenance of it from me. Documentation is on the OPF website. I urge people not to download it from SourceForge; it’s out of date there, and there have been reports of questionable practices by SourceForge’s current management. The latest version as of this writing is 1.11.

JHOVE stands for “JSTOR/Harvard Object Validation Environment,” though neither JSTOR nor Harvard is directly involved with it any longer. It identifies and validates files in a small set of formats, so it’s not a general-purpose identification tool, but does a fairly thorough job. The formats it validates are AIFF, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, WAV, XML, ASCII, and UTF-8. If it doesn’t recognize a file as any of those formats, it will call it a “Bytestream.” You can use JHOVE as a GUI or command line application, or as a Java library. If you’re going to use the library or otherwise do complicated things, I recommend downloading my payment-optional e-book, JHOVE Tips for Developers. Installation and configuration are tricky, so follow the instructions carefully and take your time.

JHOVE shouldn’t be confused with JHOVE2, which has similar aims to JHOVE but has a completely different code base, API, and user interface. It didn’t get as much funding as its creators hoped, so it doesn’t cover all the formats that JHOVE does.

Key concepts in JHOVE are “well-formed” and “valid.” When allowed to run all modules, it will always report a file is a valid instance of something; it’s a valid bytestream if it’s not anything else. This has confused some people; a valid bytestream is nothing more than a sequence of zero or more bytes. Everything is a valid bytestream.

The concept of well-formed and valid files comes from XML. A well-formed XML file obeys the syntactic rules; a valid one conforms to a schema or DTD. JHOVE applies this concept to other formats, but it’s generally not as good a fit. Roughly, a file which is “well-formed but not valid” has errors, but not ones that should prevent rendering.

JHOVE doesn’t examine all aspects of a file. It doesn’t examine data streams within files or deal with encryption. It focuses on the semantics of a file rather than its content. However, it’s very aggressive in what it does examine, so that sometimes it will declare a file not valid when nearly all rendering software will process it correctly. If there’s a conflict between the format specification and generally accepted practice, it usually goes by the specification.

It checks for profiles within a format, such as PDF/A and TIFF/IT. It only reports full conformance to a profile, so if a file is intended to be TIFF/A but fails any tests for the profile, JHOVE will simply not list PDF/A as a profile. It won’t tell you why it fell short.

The PDF module has been the biggest adventure; PDF is really complicated, and its complexity has increased with each release. Bugs continue to turn up, and it covers PDF only through version 1.6. It needs to be updated for 1.7, which is equivalent to ISO 32000.

Sorry, I warned you that I’m JHOVE’s toughest critic. But I wouldn’t mind a chance to improve it a bit, through the funding mechanism I mentioned earlier in the blog.

Next: FITS. To read this series from the beginning, start here.

Funding for preservation software development

The Open Preservation Foundation (formerly the Open Planets Foundation) is launching a new model for funding the development of preservation-related software. Quoting from the announcement:

‘Over the last year the OPF has established a solid foundation for ensuring the sustainability of digital preservation technology and knowledge,’ explains Dr. Ross King, Chair of the OPF Board. ‘Our new strategic plan was introduced in November 2014 along with community surveys to establish the current state of the art. We developed our annual plan in consultation with our members and added JHOVE to our growing software portfolio. The new membership and software supporter models are the next steps towards realising our vision and mission.’ …

The software supporter model allows organisations to support individual digital preservation software products and ensure their ongoing sustainability and maintenance. We are launching support for JHOVE based on its broad adoption and need for active stewardship. It is also a component in several leading commercial digital preservation solutions. While it remains fully open source, supporters can steer our stewardship and maintenance activities and receive varying levels of technical support and training.

JHOVE logoI have a selfish personal interest in spreading the word. At the moment, I’m between contracts, and I wouldn’t mind getting some funding from OPF to resume development work on JHOVE. I know its code base better than anyone else, I worked on it without pay as a hobby for a year or so after leaving Harvard, and I’d enjoy working on it some more if I could just get some compensation. This is possible, but only if there’s support from outside.

US libraries have been rather insular in their approach to software development. They’ll use free software if it’s available, but they aren’t inclined to help fund it. If they could each set aside some money for this purpose, it would help assure the continued creation and maintenance of the open source software which is important to their mission.

How about it, Harvard?