http://downloads.bnl.lu/schemas/mets_profile_bnl_newspaper.xml
National Library of Luxembourg, METS Profile for Digitized Newspapers
This profile describes the XML output needed for the digitalization project of the National Library of Luxembourg (BnL).
The default XML output basically consists of a METS file that describes the physical and logical structure of a printed document.
The content files, typically image files, are described by ALTO XML files.
This profile explains the core elements of the METS file the physical structure (image page linking), the logical structure of the document (articles, sections, chapters, paragraphs, illustrations ...), descriptive (dmdSec), administrative (amdSec) and technical meta data and especially descriptive meta data of structure elemets.
2017-04-10T12:00:00
Ralph Marschall
Bibliothèque nationale de Luxembourg (BnL)
31, bvd Konrad Adenauer. L-1115 Luxembourg
+352 26 09 59 219
ralph.marschall@bnl.etat.lu
There are no related profiles.
METS: Metadata Encoding and Transmission Standard
http://www.loc.gov/standards/mets/mets.xsd
MODS: Metadata Object Description Schema, version 3.0
http://www.bnl.lu/schemas/mods.xsd
THis schema is used to describe metadata stored in dmdSec elements.
ALTO: Analyzed Layout and Text Object, version 3.1
https://www.loc.gov/standards/alto/v3/alto-3-1.xsd
NISO Data Dictionary: Technical Metadata for Digital Still Images, version 0.2
http://www.loc.gov/standards/mix/mix.xsd
This schema is used to describe metadata of archival (TIFF) images.
XML Linking Language (XLink), version 1.0
http://www.loc.gov/standards/mods/xlink.xsd
ODRL: Open Digital Rights Language
https://www.w3.org/ns/odrl/2/ODRL21.xsd
This schema is used to model rights and uses for documents subject to specific copyrights.
PREMIS Data Dictionary for Preservation Metadata
https://www.loc.gov/standards/premis/premis.xsd
This schema is used to model rights and uses for documents subject to specific copyrights.
dmdSec
Those elements contain metadata encoded using MODS and in some cases MARC21.
amdSec
Those elements contain metadata related to scanned images and encoded using MIX 2.0. Some amdSec elements might contain copyright informations encoded using ODRL and PREMIS.
Each document contains a <metsHdr> element containing the creation and last modification date.
It contains at least one agent with attribute ROLE="CREATOR", which documents the software and version used to process and create the METS file.
The complete technical requirements can be found on downloads.bnl.lu.
Every METS file contains several <dmdSec> elements. Those elements contain different metadata and are linked back to specific elements in the logical structure.
For all document types (Newspaper, Serial and Monograph), there is 1 mandatory <dmdSec> element:
<dmdSec> with ID="MODSMD_PRINT": Contains MODS metadata related to the printed version and contains information such the current title, sub-title, issue number, date of the issue, printer, publisher and more.
The ID of the <dmdSec> element is referenced in the DMDID attribute of the <structMap> with TYPE="PHYSICAL" (physical structure) and in the DMDID attribute of the <div> with TYPE="VOLUME" (logical structure).
For documents of type Newsaper and Serial, there is 1 additional mandatory <dmdSec> element:
<dmdSec> with ID="MODSMD_COLLECTION": Contains MODS metadata related to the collection, such as the id, main title and language of the paper.
The ID of the <dmdSec> element is referenced in the DMDID attribute of the <structMap> with TYPE="PHYSICAL" (physical structure) and in the DMDID attribute of the <div> with TYPE="VOLUME" (logical structure).
For documents of type Monograph, there is 1 additional mandatory <dmdSec> element:
<dmdSec> with ID="MARCMD_ALEPHSYNC": Contains MARC metadata related to the system number of the monograph (controlfield 001).
After those mandatory <dmdSec> elements, there is 1 additional <dmdSec> element for every <div> element, in the logical structure, with one of the types listed below:
Article
Section
Illustration (as well as Map, Chart, Diagram)
Supplement
Chapter (or Contribution or Review)
Appendix
Those <dmdSec> elements contain extra metadata information such as the title, author and language for a particular <div> element in the logical structure.
The ID of the <dmdSec> element is referenced in the DMDID attribute of that <div> element.
The complete technical requirements can be found on downloads.bnl.lu.
The metadata of every scanned page is recorded into an <amdSec> element. Inside that element, the path "techMD/mdWrap/xmlData" brings to an element <mix:mix>, which describes the image metadata in MIX 2.0 format.
The complete technical requirements can be found on downloads.bnl.lu.
In the <fileSec> element, there is 1 <fileGrp> per type of linked files. Below is the list of <fileGrp> elements with their attributes and description:
ID="IMGGRP", USE="Images": Links all TIFF files, used as archival images.
ID="ALTOGRP", USE="Text": Links all ALTO files, containing the OCR.
ID="PDFGRP", USE="PDF": Links all PDF files. Each page is available as a 1-page PDF.
ID="BWGRP", USE="BlackWhiteImages": Links all high-contrast black and white image files.
ID="THUMBGRP", USE="Thumbnails": Links all thumbnails files. Thumbnails are smaller sized images based on the TIFF and saved in JPEG format.
ID="COMPLETEOBJECTGRP", USE="CompleteObject": Links the all pages PDF file.
Each <fileGrp> element contains several <file> elements. Each <file> element has attributes such as ID, CREATED, MIMETYPE, ADMID (Link to the amdSec), SEQ (sequence), GROUPID, CHECKSUM, CHECKSUMTYPE and SIZE.
Inside <file> is an element <FLocat> which points to the actual file resource using the attribute xlink:href
The complete technical requirements can be found on downloads.bnl.lu.
Physical structMap
This structMap describes the physical sequences of the images for this document.
Under the structMap is a <div> element having attributes:
ID
LABEL (The title of the document)
DMDID (with references to the MODSMD_COLLECTION, if it is a newspaper or serial, and MODSMD_PRINT)
TYPE (e.g. Newspaper, Serial, Monograph)
Then, for each page there is one <div> element containing the elements "fptr > par" and 1 <area> per TIFF, PDF, ALTO and PNG files that are related to the same physical page.
That div must also have the following attributes:
ID: The ID of the element.
ORDER: The automatically incremented values starting at '1'. This reflects the physical sequence of images.
TYPE: This is set to the value "PAGE".
LABEL: The page number as it is printed on this particular page. If no page number is printed on the page, the value of "LABEL" should be the page number within the page sequence (i.e. the same as ORDERLABEL).
ORDERLABELS: The page number within the page sequence in roman numerals. It is filled automatically for pages without printed page number.
The complete technical requirements can be found on downloads.bnl.lu.
Single File structMap
This structMap describes the all pages PDF file that is the complete object of this document.
It contains a single <div> element containing the one "fptr > par > area" element. The div has the following attributes:
ID
LABEL="Physical Structure"
TYPE="CompleteObject"
The complete technical requirements can be found on downloads.bnl.lu.
Logical structMap
This structMap describes the logical structure of the document.
It describes this logical structure as a hierarchical tree of <div> elements, each representing a whole issue, or illustration, image caption, chapter, titles, subtitles, paragraph etc…
The purpose of such a structured hierarchy is to allow full text search to be applied within particular structural elements (for example chapter headings or table captions) and also within certain zone types such as captions of illustrations.
Every <div> has an ID. The IDs for the different <div> elements in the logical <structMap> element must have the format DIVL[d] where [d] is a counter starting at 1 for every issue (the top-level <div> will have ID=”DIVL1”) and incremented for every new <div>. (the second <div> will have ID=”DIVL2”, and then the following ID=”DIVL3”, “DIVL4”…) Each <div> element has an attribute TYPE which describes what part of an issue the <div> refers to. Thus, attribute "TYPE" can have values such as "ISSUE", “SECTION”, "HEADLINE", “ARTICLE”, “BODY_CONTENT”, “PARAGRAPH”, “CHAPTER”, “TABLE” etc. To simplify the handling of the data (e.g. loading data in a presentation system) these values must be chosen from a controlled vocabulary. The vocabulary is defined by the corresponding newspaper, serial or monograph schemas.
The logical structure is validated using the BnL Newspaper schema or BnL Monograph schema.
The complete technical requirements can be found on downloads.bnl.lu.
No <structLink> element is allowed in the METS file.
No <behaviorSec> element is allowed in the METS file.
<multiSection> element is allowed in the METS file.
Every digitized document consist of 1 METS file, 1 all pages PDF file and then 1 TIFF per page, 1 ALTO per page, 1 PDF per page, 1 black and white image per page and 1 thumbnail per page.
This is refered to as the METS/ALTO package.
The METS/ALTO package must follow the complete requirements of the BnL, which is available at downloads.bnl.lu.
No behavior files are allowed in the METS file.
No metadata files are allowed in the METS file.
There are not tools for the moment.
There are no examples for the moment.
There are no appendices for the moment.