!!!PDF This is by far the most problematic format to convert to xml, often needing extensive manipulation of the variables in the metadata documents to get the wanted output in the converted documents. Portable Document Format (PDF) is a digital document format developed by Adobe Systems and was introduced in in 1993. Each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. A loose definition of the format could be "digital paper". Extracting text from a pdf document can be approximated to that of extracting text using OCR: to retain the "story" of the document, we often need to skip pages, headers, footers, page numbers, foot notes, etc. !!Converted document contains less (or no) text compared to the original document Decrease margins to 0, then compare document to the converted output. Then adjust variables to taste. !!Extracting individual articles from a document Some documents contain many articles written by different authors. To correctly attribute the authors their text, we need to extract their article from the document. First download the document into a corpus, preferrably using __add_files_to_corpus__. Remove the metadata document of the downloaded document, we will not need it. Make a soft link to the document, e.g. {{{ ln -s original.pdf original-author1-author2.pdf ln -s original.pdf original-author3-author4.pdf }}} Run convert2xml on both the soft-linked documents to make basic metadata files belonging to these soft linked files. {{{ convert2xml original-author1-author2.pdf convert2xml original-author3-author4.pdf }}} Then use __skip_pages__ in the files {{original-author1-author2.pdf.xsl}} and {{original-author3-author4.pdf.xsl}} so that only the wanted pages are left in the converted documents. !!Order in the converted document is not retained Run the command: {{{ pdftohtml -hidden -enc UTF-8 -stdout -nodrm -i -xml documentname.pdf | less }}} to see if order of the text is contained. This is the command that is used by the pdf converter to do the first conversion from pdf to xml. It produces a xml format specific to the [poppler|https://poppler.freedesktop.org/] tools, which pdftohtml is a part of. If the order of the text from the above content is different from the content of the converted document, then there is a bug in the pdf converter. File a bug on bugzilla. Use the __product__ "Corpus", __component__ "xml conversion". !!Most of the text lines in the pdf documents are interpreted as paragraphs
Have a look at the documentation on linespacing below.
!!Variables specific to pdf documents
!!Skipping pages
Typical uses are to skip front page, pages containing tables of content, indexes, etc. In short, removing pages not relevant for the "story" of the document.
{{{
This sentence
is divided
into many paragraphs,
although it clearly
to a human eye,
only is a single
paragraph.
}}} Increasing the value for this variable improves this situation. {{{