For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.
With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?
Continue reading
PDF/A-4
It looks as if I’ll have a little input into the upcoming PDF/A-4 standardization process; earlier this month I got an email from the 3D PDF Consortium inviting me to participate, and I responded affirmatively. While waiting for whatever happens next, I should figure out what PDF/A-4 is all about.
ISO has a placeholder for it, where it’s also called “PDF/A-NEXT.” There’s some substantive information on PDFlib. What’s interesting right at the start is that it will build on PDF/A-2, not PDF/A-3. A lot of people in the library and archiving communities thought A-3 jumped the shark when it allowed any kind of attachments without limitation. It’s impossible to establish a document’s archival suitability if it has opaque content.
Continue reading →
Comments Off on PDF/A-4
Posted in commentary
Tagged PDF, standards