This page documents conventions for transcriptions and other notations applied in the [ELAN|ELAN.html] annotations created by the [Freiburg-Tromsø Speech Corpora|freiburg.html].
Since the audio/video remains linked to our onnotations in ELAN, we do not need very complicated transcription conventions, like marking overlapping speech, and other nuances (hesitations, different lenghts for pauses, etc.) Instead, we aim for a simple orthographic transcription and need only very few special tags in the transcription tier.
!!!Tier "ref" (reference)
The annotations in this tier provide unique references to single linguistic units (like sentences, utterances, etc.) in the ELAN-file. The ELAN program has a function for automatic labelling und numbering of annotation units.
A typical label looks like:
*sms1961lagercrantz318.003
where __sms1961lagercrantz318__ is the file name (including a hint to the language [[sms], the data of origin of the text [[1961] and additional textual explanation [[lagercrantz]. __003__ marks the third sentence in the text.
!!!Tier "orth" (orthography)
This tier is a child of __ref__. It includes either an orthographic transcription (of spoken texts), an orthographic transliteration (of texts written in other than the standard orthography) or the original orthography (of texts written in the standard orthography).
!!Capitalization and punctuation
In the case of transcribed spoken texts, is up to the taste of the annotator whether or not the transcription has to comply with the rules for standard orthography.
In the case of written texts, the original orthography, incl. capitalization and punctuation is preserved as in the original.
If punctuation marks are used always write a whitespace before them (in order to tokenize them)! Note the (syntactic) difference between PUNCT vs. CLB in the punctuation marks!
!!Other conventions
Non-comprehensible speech
* (…) = one unclear word
* (…) (…) (three) = two unclear words, the third is perhaps "three"
* Part(…) = unclear ending
* Part(is)anen = I am hearing "is", but I am not quite certain
False start and correction or incomplete articulated word
* hyphen following right after the word, like in Partisa- Partanen
Clear Pause
* … (after whitespace)
Some authers have symbols or additional explanationsin their transcriptions, e.g. Lagercrantz: (Ill.) (1234) etc. These notations needs to be kept, but marked as belonging to non-speech. Leave these signs in the text, but put them in parenthesis.
!!!Tiers "ft" (free translation)
Depending on the source language, we translate into different languages, normally English and then relevant lingua franca (e.g. Russian for Kildin Sámi texts, Swedish for Pite Sámi texts, etc.). The target languages are specified by the three-letter-codes used also by Giellatekno elsewhere (e.g. deu = German, fin = Finnish).
Non-spoken parts in the translations are written in square brackets
* and [[he] sailed away
!!!ELAN tag sets
Tag sets are called "controlled vocabularies" in ELAN. Here are the tag sets currently used by our projects (copied from the ELAN-xml).
!!Language codes
*deu
*nob
*eng
*rus
*swe
*sjd
*sjt
*sje
*smj
*sme
*kpv
!!PoS
*(??)
*(NO)
*A
*ABBR
*ACR
*Adv
*CLB
*Cc
*Cs
*Det
*Interj
*N
*Num
*PUNCT
*Pcle
*Po
*Pr
*Pro
*Prop
*V
!!!See also
*Giellatekno's [tag sets for Kildin Sámi|/lang/sjd/docu-grammartags.eng.html]