There are currently 21 answered questions on ErpaAdvisory:
Questions 1 to 5 shown below.
Submitted by helen on 2 August 2004 at 8:56
I understand that PDF is not considered suitable as a longterm preservation standard format, and that the upcoming PDF/A is not necessarily a successful attempt to oblige the requirements of such a format. Can you describe the shortcomings of PDF in comparison to eg. XML, or do they have different advantages and drawbacks dependent on document type?
Answered by dutched on 20 August 2004 at 8:55
PDF was in the past and continues to be widely used as a "preservation format" (which is highlighted in various erpaStudies). The main reasons for this are the consistent appearance of PDF documents across platforms, the openly available PDF specification, the wide availability of PDF viewers, and the ease of document conversion to PDF (e.g. from a text-processing format like Microsoft Word). In the pharmaceutical sector, for instance, PDF is a de facto industry standard for information exchange and often used for preservation. Moreover, relevant regulatory bodies, foremost the US Food and Drug Administration (FDA), promote PDF as a qualified format for electronic submission and as a possible approach to preserving a record's content and meaning.
When taking a closer look at the suitability of PDF as an archival format however, a number of caveats can be identified. First of all, while the specification of the current PDF version is openly available, the format is owned by Adobe and they may decide to keep the specification secret for future versions. A repeatedly cited problem in this context is the incompatibility between some PDF versions. Furthermore, PDF documents are not necessarily self-contained, but they may rely on system fonts and other external components. Perhaps most troublesome are the variety of PDF features that may adversely affect preservation, including encryption, compression, digital rights management, and embedding of objects that may be in a proprietary format. These properties of PDF may obstruct future migration to other formats, may partially change content (such as the problems with formulas highlighted by Thomas Fischer), may essentially make a PDF file unusable in the absence of auxiliary software, or raise other concerns with regard to long-term preservation.
PDF and PDF/A are geared towards retaining the appearance of printable documents. Pages in PDF are meant to be immutable as if a photo of the printed paper had been taken. XML on the other hand is not geared towards retaining object appearance. XML is a semi-structured language that aims to facilitate automatic processing on a syntax-level. Information marked-up in an XML-based archival format can be converted automatically and on-the-fly to any access format including PDF. This of course facilitates migration to a future format as well. (Other advantages of using XML for defining archival formats have been highlighted repeatedly and are not discussed here.) However, when transferring information from an XML container to PDF or any other presentation format, its appearance is eventually dependent on the conversion routine and may vary. Generally speaking, neither PDF nor XML-based formats have been created for digital preservation purposes specifically. Current initiatives such as the one by the ISO working group involved in the creation of the ISO Records Management standard, are currently underway to establish criteria for durable file formats. Such criteria may be useful when assessing the suitability of PDF/A or other formats, or when designing a format dedicated to specific preservation purposes.
So clearly, PDF/A and XML have very different roles. At least in the short to medium term PDF/A could be a suitable archival format for preserving the appearance of page-based documents. Eventually this depends on how successful the standardisation process of PDF/A is, on whether it is widely used and whether it is stable in the long term. Those who use PDF as a "preservation format" at the moment are certainly advised to rescue their existing archival holdings to a more stable format. Since the conversion from PDF to anything other than PDF/A could be difficult, the current prevalence of PDF might be a strong indicator for the future success of PDF/A.
* PDF - Portable Document Format, Adobe Inc. http://www.adobe.com/products/acrobat/adobepdf.html
* PDF/A - http://www.aiim.org/pdf_a/
* ERPANET Seminar: File Formats for Preservation. Wien, 10-11 May 2004.
Presentations and the seminar report are available at www.erpanet.org
* erpaStudy - Pharmaceutical Sector. http://www.erpanet.org/php/studies/docs/erpaStudies_Pharma_final2.pdf
* FDA Guidance for Industry: Part 11, Electronic Records; Electronic Signatures - Scope and Application. Chapter 5 - Record Retention. August 2003.
* FDA Guidance for Industry: Providing Regulatory Submissions in Electronic Format - ANDAs. June 2002.
-> available via the "Guidance Documents" homepage of the US Food and Drug Administration, Center for Drug Evaluation and Research, section "Electronic Submissions": http://www.fda.gov/cder/guidance/index.htm
* John Mark Ockerbloom: Archiving and Preserving PDF Files.
* Thomas Fischer: LaTeX as an Archiving Format: Benefits and Problems. Experiences from the MathDiss International Project and the EMANI project. 2003. http://edoc.hu-berlin.de/etd2003/fischer-thomas/PDF/index.pdf
* Digital Preservation Testbed: XML and Digital Preservation. White Paper, September 2002. http://www.digitaleduurzaamheid.nl/bibliotheek/docs/white-paper_xml-en.pdf
* Andreas Aschenbrenner: The Bits and Bites of Data Formats - Stainless Design for Digital Endurance. RLG DigiNews, February 2004. http://www.rlg.org/preserv/diginews/diginews8-1.html#feature3
Submitted by query on 27 February 2004 at 13:38
‘Interoperability’ has become a buzz word, but what does it actually refer to in technical terms? And how does it differ from ‘portability’?
Answered by dutched on 27 February 2004 at 13:44
Interoperable systems may not be portable and vice versa. In order to answer your theoretical question, the two terms ‘interoperability’ and ‘portability’ are explained from a technical perspective in the following.
Interoperability can be defined as the capability of systems to interact and exchange information. Thereby, systems can be all kinds of information technology components including computers, networks, and software. Interoperable systems may be heterogeneous, in that they are composed of diverse system components, use distinct operating systems or software, or are built by different vendors. Depending on the scalability of an interoperable system it can be enhanced by adding more peers.
For interoperability an agreed interaction format is necessary. This can take place at different levels. At a lower level, common protocols ensure that signals are transmitted successfully, so that data can be exchanged via networks. At a higher level, software can employ services and processing tools of remote systems. The systems that are probably the most complex in this respect are processing grids that make use of the processing power of a number of systems towards a common goal. At the highest level, interoperable software applications are capable to query and manipulate information that is distributed over different systems. For example, a gateway to an interoperable system is an online service that allows the extraction of specific information from a number of databases with a single query.
The Standard Computer Dictionary by the Institute of Electrical and Electronics Engineers (IEEE) defines ‘portability’ as “the ease with which a system or component can be transferred from one hardware or software environment to another”. Therefore, a portable software application, for example, can be ported to and run on different computers and operating systems.
Plainly the fact that a system component can be integrated in another system environment does, however, not mean that it is interoperable. The component may lack the capability to interact with other system components. Vice versa, an interoperable system component may not be portable. In other words, while it is capable to interact with remote system components, it is tied to a specific system environment.
* JISC, Resource: Interoperability Focus; http://www.ukoln.ac.uk/interop-focus/
* DLESE (Digital Library for Earth Science Education) Interoperability Web Site;
Submitted by query on 27 February 2004 at 13:38
Emulation as a preservation method was already introduced in 1999 by Jeff Rothenberg. Since then, however, not much seems to have moved. I am not aware of any archive that applies emulation as a preservation method. If emulation is not practicable, why is there still talk about it?
Answered by dutched on 27 February 2004 at 13:45
You suggest that the technique ‘emulation’ is not being applied in practice. This is not quite right; in fact the technique is used in various areas: There is an array of emulators around that allow executing computer games from superseded platforms; only a quick web search yields numerous references for emulators to execute old Atari games, C64, Amiga, and others. Emulators are often used in research and development of industrial applications; for example, they offer a quick, flexible, and cheap way to prototype embedded microprocessor systems (cf. Patel). Crossover between different platforms is achieved through emulators such as VMware that allows the execution of Windows native applications like MSExcel in a Linux environment (cf. Goodwin). This list of applications of emulators can surely be further extended.
You are, however, right that not many digital preservation solutions currently build on emulation. This may be due to the fact that emulation attempts to preserve “original technology”, the original application and its look-and-feel. Thereby, emulation is specific to the very technical specifications of the material to be preserved, which makes the development of more general methodologies difficult. Despite this emulation has been applied in preservation initiatives already, most prominently the BBC Domesday project (cf. CAMiLEON).
In conclusion it can be said that while there is not much practical experience on the application of emulation in digital preservation, it might be a viable method. Its advantages and disadvantages have been analysed comprehensively (cf. Thibodeau). More research and practical experience are needed before emulation can be incorporated in reliable preservation solutions. Cooperation with other areas that apply emulation, some of which were mentioned in the first paragraph, may help in this task.
* CAMiLEON: BBC Domesday (2002); http://www.si.umich.edu/CAMILEON/domesday/domesday.html
* Simon N. Goodwin: An Overview of Emulators for Linux. Linux Format Magazine (April 2001); http://simon.mooli.org.uk/LXF/Overview/Overview.html
* Alok Patel: Software Emulation of an embedded Microprocessor System. Thesis, University of Queensland (October 2001); http://innovexpo.itee.uq.edu.au/2001/projects/s341612/thesis.pdf
* G. Schneider: On longtime preservation of digital documents. http://www.exp-math.uni-essen.de/algebra/veranstaltungen/schneid1.ppt
* Jeff Rothenberg: Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation (June 1999); http://www.clir.org/pubs/reports/rothenberg/pub77.pdf
* Kenneth Thibodeau: Overview of Technological Approaches to Digital Preservation and Challenges in the Coming Years. In: CLIR Conference Proceedings: The State of Digital Preservation: An International Perspective. (July 2002); http://www.clir.org/pubs/reports/pub107/thibodeau.html
Submitted by query on 27 February 2004 at 13:38
What is the role of traditional identification schemes such as the ISBN (International Standard Book Number) in the unique identification of digital resources on the World Wide Web, and what impact do they have on digital preservation?
Answered by dutched on 27 February 2004 at 13:46
Unique and persistent identification of digital objects is a central component of any digital preservation strategy. Without that, the administration and retrieval of digital objects in preservation systems is encumbered, perhaps entirely impossible. In your question you relate to a more global kind of unique identification that allows any user to retrieve digital resources from the World Wide Web.
Traditional identification schemes such as the ISBN assign a unique code to resources. With an ISBN code at hand you can go to any bookshop and request a specific publication. In the World Wide Web a ‘resolution system’ is needed that maps the identifier to the actual location of the digital resource. Unique identifiers for the digital world such as the DOI (Digital Object Identifier) or the PURL (Persistent URLs) thus combine unique identification and resolution.
The role of traditional identifiers in the digital world is still open. There have been attempts to incorporate traditional identifiers in existing resolution systems such as the URN (Uniform Resource Name). For example, ISSN (which assigns unique ISSN numbers to serial publications in paper as well as digital form) has implemented a resolution system for the digital world called the ‘ISSN URN Demonstrator’.
As John Kunze with his ARK identifier proposal pointed out, an important aspect of an identifier for a future information infrastructure is that it is trustworthy for users. Since traditional identifiers are already established in society, their adoption in digital resolution systems may be favourable. Whether this is viable, however, and which solution will be adopted by most organisations and will hence offer the most comprehensive service remains unclear at this point of time.
* Diana Dack: Persistent Identification Systems. consultancy NLA, May 2001. http://www.nla.gov.au/initiatives/persistence/PIcontents.html
* Meredith Dickison: Persistent Locators for Federal Government Publications. Summary of a study conducted for the Depository Services Program and the National Library of Canada. November 19, 2002. http://www.nlc-bnc.ca/8/4/r4-500-e.html
* Giuseppe Vitiello: Identifiers and Identification Systems: An Informational Look at Policies and Roles from a Library Perspective. In: D-Lib 10(1); January 2004. http://www.dlib.org/dlib/january04/vitiello/01vitiello.html
* K.Sollins, L.Masinter (Network Working Group): IETF RFC 1737 (Internet Engineering Task Force - Request for Comment) - Functional Requirements for Uniform Resource Names. December 1994. http://www.ietf.org/rfc/rfc1737.txt
* S. Rozenfeld (ISSN International Centre): Using The ISSN (International Serial Standard Number) as URN (Uniform Resource Names) within an ISSN-URN Namespace. IETF RFC 3044 (Internet Engineering Task Force - Request for Comment); January 2001.
* ISSN URN Demonstrator; http://urn.issn.org/
* Andy Powell: Guidelines for encoding identifiers in Dublin Core metadata. 2nd Draft, July 2002. http://www.ukoln.ac.uk/metadata/dcmi/dc-identifiers/2002-07-24/
* John Kunze: The ARK Persistent Identifier Scheme. Internet Engineering Task Force (IETF) Internet Draft, July 2003. http://www.ietf.org/internet-drafts/draft-kunze-ark-06.txt
Submitted by query on 27 February 2004 at 13:37
Ich arbeite an einem Mathematik-Institut. Wir planen alle unsere Berichte permanent zu bewahren. In welchem Datenformat sollen wir am besten die mathematischen Formeln abspeichern?
We are a mathematical institute and we attempt to preserve our reports over the long-term. In what data format should we store mathematical formula?
Answered by dutched on 27 February 2004 at 13:45
Your question addresses the software format most appropriate to preserve reports that contain mathematical formulas over the long-term. Mathematical formulas with their fractions, radicals, and other special notations are often difficult to represent in software applications. Some text processing tools provide specific modules to embed formulas. However, these are mostly proprietary and unstable, and are, hence, not suitable for long-term preservation.
A data format often used for preservation purposes in general is PDF (Portable Document Format). For retaining mathematical formulas that are highly fault sensitive, however, it is not considered appropriate. This is due to the fact that the PDF format is not designed for representing formulas and is not stable enough for their long-term preservation. As a factual example, at the change from one PDF Reader version to a newer one, the signs in formulas were changed, rendering these invalid.
A possible option is to store formulas in an image format. Storing formula in an image format means, however, that the formulas cannot be reused at a later point. To embed formula in a text document either inlay pictures have to be put in place of the formulas, or the whole report needs to be stored as an image. This again makes reuse of the document difficult. Thus, while storage as an image format is an option, it might not be the preferred choice.
Instead, digital archives currently often use the LaTeX format (1) to store scientific text containing mathematical formulas. LaTeX is a markup language to store information as well as control commands and is in some aspects similar to XML (eXtensible Markup Language). The LaTeX specification is open and freely available. Conversion tools can automatically translate these sources to specific PDF versions, HTML, Postscript, or other formats. Furthermore, LaTeX is prevalently used for typesetting in the scientific community; in fact, it is the de facto standard for scientific documents and, hence, various software tools and tutorials are available mostly for free and open source. These features make LaTeX suitable for long-term preservation.
In the future MathML, the Mathematical Markup Language, might become a viable alternative to LaTeX. MathML is currently being specified by the W3C, the World Wide Web Consortium. The markup language consists of a number of XML tags, and is closely related to HTML. Apart from representation of mathematical formula on the web, it is envisioned to be a low-level format for describing mathematics as a basis for machine-to-machine communication, and to facilitate use and re-use of scientific content. Once the definition of MathML is finalised, automatic conversion tools from LaTeX to MathML are expected to be available.
To sum up, LaTeX is the most viable data format for preservation of mathematical formulas at this point of time. In the future, MathML may take this position. Changing from LaTeX to MathML can be expected to be relatively easy, facilitated through tools and other support mechanisms that will be freely available via the web.
(1) LaTex is a TeX makro package developed by Leslie Lamport in the 1980s. The typesetting language TeX in turn was developed by Donald Knuth. TeX stands for ‘Tau Epsilon Chi’ and describes virtuosity and applied knowledge (cf. http://latex.yauh.de/latex.html).
For a closer explanation refer to “The TeXbook” by Donald Knuth, Addisson-Wesley, ISBN 0201134489.
* arXiv.org: Why submit the TeX source?. Online documentation of the arXiv e-Print archive. http://arxiv.org/help/faq/whytex
* MathDiss International - Ergebnisse und Visionen. Workshop, 28.-29.November 2002, Staats- und Universitäts-Bibliothek (SUB) Göttingen. (in German) http://www.ub.uni-duisburg.de/mathdiss/work2002.html
* Thomas Fischer: LaTeX as an Archiving Format: Benefits and Problems. Experiences from the MathDiss International Project and the EMANI project. Presented at ETD 2003, Electronic Theses and Dissertations Worldwide. http://edoc.hu-berlin.de/etd2003/fischer-thomas/PDF/index.pdf
* TeX Users Group Home Page. http://www.tug.org/
* W3C Math Home. http://www.w3.org/Math/