korpuslinguist ist jeder der mit hilfe von korpora etwas erforscht (du zum beispiel). aber unser hauptziel soll es sein eine _korpuslinguistische infrastruktur_ zu schaffen, damit man später aus DOBES mehr machen kann Multimedia corpora and language technology for endangered Saami languages or Computer-based analyses of Saami language corpora or Spoken corpora and computer linguistics for three small Saami languages what are corpora? -Giellatekno corpus largely derived by automated processes (morphological+syntactic automators) what are multimedia corpora? -DOBES/ELAR corpora largely derived by hand (ELAN, Toolbox) what is language technology? why Saami? -because computer linguistic infrastructure already available for closely related (linguistically and culturally) written Northern Saami -because computer linguistic know-how already available for simple and parallel written corpora of Northern Saami what Saami? Pite Saami: annotated corpus available at ELAR, linguist (for programming morphological (and syntactic) automators) available: Josh Skolt, Kildin, Ter Saami: annotated corpus available at DOBES, linguists (for programming morphological (and syntactic) automators) available: Micha+Lena Corpus linguistics=study of language as expressed in "real world" text Computer linguistics=maschinelle Verarbeitung natürlicher Sprache -Entwicklung von Analyse- und Generierungsverfahren für natürlich-sprachliche Texte -Programme zur Sammlung und statistischen Auswertung großer Mengen von Sprachdaten (Lemmatisierung, Häufigkeitswortlisten, Konkordanzen) -Praktische Anwendungen: maschinelle Übersetzung, computergestützter Sprachunterricht -Giellatekno already has all this for Northern Saami, partly also for Lule and South Saami, very little even for Kildin Saami PROJECT AIM Automatically annotated corpora for spoken Pite Saami, Kildin Saami, Skolt Saami, Ter Saami languages TEAM documentary and computer linguists and programmers: Micha (Saami documentary linguist, general linguist, Kildin and Skolt Saami) -project leader -??payed from the project as?? Josh (Saami documentary linguist, Pite Saami) -payed from the project as Saami linguist Lena (General linguist, Saami, Syntax, Semantics, Computer linguistics) -payed from the project as SHK (80h/week) for computer/corpus linguistics -writes M.A./PhD Trond (Saami linguist, general linguist, computer linguist, leader of Giellatekno) -not payed from the project ?Ciprian (Programmer at Giellatekno) -not payed from the project ?Collaboration with Berlin CONTENT Documentation, Archiving, Automatic annotation, AIM creation of workflows, tools useful for Saami and other DOBES languages practical products: -tagged (single and parallel) corpora of Saami languages =practical for linguists -dictionaries =practical for linguists and revitalization -teaching programs =practical revitalization methodological problems, Questions of automatically annotated corpora CORPUS OF WRITTEN LANGUAGE we have many and large corpora of written languages -CORPUS OF SPOKEN LANGUAGE --we have much less and smaller corpora of spoken language (e.g. dialects, sociolects) --written representation of text --different units than in text --linguistic features characteristic of spoken language (intonation, hesitation, false start, etc.) ---CORPUS OF AN ENDANGERED SPOKEN LANGUAGE ---we have very few and only very small corpora of endangered languages ---written representation of text ---different units than in text ---linguistic features characteristic of spoken language (intonation, hesitation, false start, etc.) ---code switching to other languages ---language attrition Giellatekno infrastructure (including tools for Northern languages) DOBES infrastructure (Multimedia) Knowledge in the Saami languages in questions Knowledge in corpus- and computer linguistics Knowledge in Revitalization Workflow -manually annotated texts linked to audio and perhaps video (ELAN) already available -create annotation in standard orthography (+using conventions for features characteristic of spoken language) -create automators -parse -create automatically annotated texts linked to audio and perhaps video (ELAN) Practical applications -oahppas -dictionaries (GT+LEXUS) DOBES requirements -we have to use their tools (ELAN, LEXUS, etc.) --[ELAN and LEXUS is used already] -we have to use their archive (Nijmegen) --[no problem] -we have to involve to speech community --[hire and train native speaker assistant for work with LEXUS (e.g. create a monolingual multimedia dictionary after automatic lemmatization]