Hei, Lars! Vi har eit opplegg: For 1a, 1b, 2 har vi tre beslekta oppgåver. Vi vil arbeide i lag, og seinare bestemme kven som skal skrive kva. Når det står "samisk" meiner vi enten nordsamisk/lulesamisk eller nordsamisk/sørsamisk. 1a.1 RBTM: Lese litteratur relatert til RBTM-system: Apertium, Gramtrans 1a.2 SMT: Lese litteratur relatert til STM-system: Moses 1a.3 Ord: Lese litteratur relatert til GIZA++ og ordparallellföring 1b.1 RBTM: Alfaversjon for eit regelbasert MT-system for samisk 1b.2 SMT: Alfaversjon for eit statistisk basert MT-system for samisk 1b.3 Ord: Alfaversjon for eit ordparallellsystem for samisk 2.1 RBTM: Arbeide med / evaluere det samiske RBMT-systemet 2.2 SMT: Arbeide med / evaluere det samiske SMT-systemet 2.3 Ord: Arbeide med / evaluere det samiske systemet &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&& 1a Reading list (preliminary &&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1a.1 RBTM: Lese litteratur relatert til RBTM-system: Apertium, Gramtrans ======================================================================== The RBTM reading list contains of the key papers listed on the home pages of rule-based systems Apertium and Gramtrans. We have chosen these two systems since they are based on a robust parser component (CG), therefore avoiding the problems faced by other rule-based systems. (the list may be changed as we read the papers) MTsummit07_final.pdf Bick, Eckhard & Hansen: The Fyntour Multilingual Weather and Sea Dialogue System armentano05p.pdf Carme Armentano-Oller, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Boyan Bonev, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez. An open-source shallow-transfer machine translation toolbox: consequences of its release and availability. In OSMaTran: Open-Source Machine Translation, A workshop at Machine Translation Summit X, p. 23-30, September 12-16, 2005, Phuket, Thailand armentano06.pdf Carme Armentano-Oller, Rafael C. Carrasco, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Miriam A. Scalco. Open-source Portuguese-Spanish machine translation. In Lecture Notes in Computer Science 3960 (Computational Processing of the Portuguese Language, Proceedings of the 7th International Workshop on Computational Processing of Written and Spoken Portuguese, PROPOR 2006), p. 50-59, May 13-17, 2006, ME - RJ / Itatiaia, Rio de Janeiro, Brazil. corbi05.pdf Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, Kepa Sarasola. An open-source shallow-transfer machine translation engine for the Romance languages of Spain. In Proceedings of the Tenth Conference of the European Associtation for Machine Translation, p. 79-86, May 30-31, 2005, Budapest, Hungary. Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Iñaki Alegria, Aingeru Mayor, Kepa Sarasola. An open-source shallow-transfer machine translation engine for the Romance languages of Spain. In Proceedings of the Tenth Conference of the European Associtation for Machine Translation, p. 79-86, May 30-31, 2005, Budapest, Hungary. eamt2005.pdf Carme Armentano-Oller, Antonio M. Corbí-Bellot, Mikel L. Forcada, Mireia Ginestí-Rosell, Boyan Bonev, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez. An open-source shallow-transfer machine translation toolbox: consequences of its release and availability. In OSMaTran: Open-Source Machine Translation, A workshop at Machine Translation Summit X, p. 23-30, September 12-16, 2005, Phuket, Thailan nodalida2007mt.pdf Bick, Eckhard & Lars Nygård 2007: Using Danish as a CG Interlingua: A WideCoverage NorwegianEnglish Machine Translation System ramirez06.pdf Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. Opentrad Apertium open-source machine translation system: an opportunity for business and research. In Proceedings of Translating and the Computer 28 Conference, November 16-17, 2006, London, United Kingdom sanchez07a.pdf Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, Mikel L. Forcada. Integrating corpus-based and rule-based approaches in an open-source machine translation system. In Proceedings of METIS-II Workshop: New Approaches to Machine Translation, a workshop at CLIN 17 - Computational Linguistics in the Netherlands , p. 73-82, January 11, 2007, Leuven, Belgiu sanchez07b.pdf Felipe Sánchez-Martínez, Carme Armentano-Oller, Juan Antonio Pérez-Ortiz, Mikel L. Forcada 2007: Training Part-of-Speech Taggers to build Machine Translation Systems for Less-Resourced Language Pairs. In Procesamiento del Lenguaje Natural, (XXIII Congreso de la Sociedad Española de Procesamiento del Lenguaje Natural) sanchez07c.pdf Felipe Sánchez-Martínez, Mikel L. Forcada. Automatic induction of shallow-transfer rules for open-source machine translation. In Proceedings of TMI, The Eleventh Conference on Theoretical and Methodological Issues in Machine Translation, p. ??-??, September 7-9, 2007, Skövde, Sweden. tsd_paper.pdf Dan2eng: Wide-Coverage Danish-English Machine Translation 1a.2 SMT: Lese litteratur relatert til STM-system: Moses ========================================================= The SMT reading list contains one classical SMT paper (Brown et al 2002) (to be suppleted by other introductory papers or textbook chapters), which lays the foundation for the approach as such. Furthermore, we have chosen the system Moses (as it in freely available) as an example of a machine-translated system. (the list may be changed as we read the papers) J90-2002.pdf Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin 2002: A statistical approach to machine translation. Computational Linguistics Volume 16, Number 2, June 1990. N07-1062.pdf Richard Zens and Hermann Ney 2007: Efficient Phrase-table Representation for Machine Translation with Applications to Online MT and Speech Translation. Proceedings of NAACL HLT 2007, pages 492–499, Rochester, NY, April 2007. P07-1040.pdf Antti-Veikko I. Rosti and Spyros Matsoukas and Richard Schwartz 2007: ImprovedWord-Level System Combination for Machine Translation. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 312–319, Prague, Czech Republic, June 2007. P07-2045.pdf Philipp Koehn et al 2007: Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177–180, Prague, June 2007. Shen_IWSLT_2006.pdf Wade Shen, Richard Zens, Nicola Bertoldi, Marcello Federico 2006: The JHU Workshop 2006 IWSLT System W07-0725.pdf Holger Schwenk 2007: Building a Statistical Machine Translation System for French using the Europarl Corpus. Proceedings of the Second Workshop on Statistical Machine Translation, pages 189–192, Prague, June 2007. emnlp2007-factored.pdf Philipp Koehn and Hieu Hoang 2007: Factored Translation Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 868–876, Prague, June 2007. sanchez06b.pdf Felipe Sánchez-Martínez, Juan Antonio Pérez-Ortiz, and Mikel L. Forcada 2006: Speeding up Target-Language DrivenPart-of-Speech Tagger Training for Machine Translation icslp2002-srilm.ps.gz Stolcke, Andreas: SRILM — an extensible language modeling toolkit. Speech Technology and Research Laboratory. SRI International, Menlo Park, CA, U.S.A. http://www.speech.sri.com/ 1a.3 Ord: Lese litteratur relatert til GIZA++ og ordparallellföring ==================================================================== The list contains the classical GIZA++ paper. We might also supplement by other word alignment publications. (the list may be changed as we read the papers) J03-1002.pdf Franz Josef Och, Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003 This is the standard reference to GIZA++ It is pretty technical, and must be supplemented by tutorial text. caseli08p.pdf &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&& 1a Commented reading list &&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Jurdda lea čállit kommentárat deike, kánske geavahit svn. 1a.1 RBTM: Lese litteratur relatert til RBTM-system: Apertium, Gramtrans ======================================================================== Apertium ======== armentano06.pdf ----------------------------------------------------------------------- This is the standard reference for Apertium. Key quote: "[Apertium] is based on a simple rationale: to produce fast, reasonably intelligible and easily correctable translations between related languages, it suffices to use a MT strategy which uses shallow parsing techniques to refine word-for-word MT. sanchez07b.pdf ----------------------------------------------------------------------- The goal of the paper is to show that small parallel corpora are enough to train the HMM taggers used in MT. sanchez07c.pdf -------------------------------------------------------------------------- Sanchez and Forcada 2007 report a mixed approach: The framework for their MT is Apertium, a shallow RBMT with morphological analys and disambiguation, transfer rules, and morphological generation. The major result is that by making the machine learn the transfer rules from a word-aligned parallel corpus, one obtains results which qpproach the hand-written rules in quality. Here is the Word Error Rate for the different approaches. AT count represent the learned templates, and AT log gives long, but seldom matches more weight. Hand gives hand-coded transfer rules. Trans. dir. Eval. corpus No rules AT count AT log Hand es-ca post-edit 12.6 % 8.6 % 8.5 % 6.7 % parallel 26.6 % 20.4 % 20.4 % 20.8 % ca-es post-edit 11.6 % 8.1 % 8.1 % 6.5 % parallel 19.3 % 15.0 % 14.9 % 14.5 % Gramtrans ========= MTsummit07_final.pdf Bick, Eckhard & Hansen: The Fyntour Multilingual Weather and Sea Dialogue System -------------------------------------------------------------------------------- This paper reports from a practical application nodalida2007mt.pdf Bick, Eckhard & Lars Nygård 2007: Using Danish as a CG Interlingua: ------------------------------------------------------------------- This paper reports on a nob2eng system using Danish as an interlingua. Relevant to us might be the nob2dan part. The article reports on a chaining nob2dan + dan2eng (and eng2dan + dan2nob). The nob<>dan system was built like this: Creating nob<>dan lexica: 1-1 word lists obtained like this: (a) Create a large corpus of monolingual Norwegian text and lemmatize it automatically. (b) Regard Norwegian as misspelled Danish, and run a Danish spell checker on (a) (c) Produce phonetic transmutation rules for Norwegian and Danish spelling to generate hypothetical Danish words from Norwegian candidates, than check if a word of the relevant word class was listed in the Danish lexicon (a-c) gave 226000 lemma pairs, these were refined and checked in various ways. Translation procedure. - Norwegian analysis den store bilen > den stor+Def bil+Def - Norwegian - Danish lexical transfer den>den, stor>stor, bil>bil - Norwegian-Danish grammatical transfer (via a CG grammar) SUBSTITUTE (DEF) (IDF) TARGET (N) IF (*1 ART BARRIER NONPREN/ADV) ; den stor+Def bil+Def -> den stor+Def bil+Inder - Danish generation den store bil tsd_paper.pdf Dan2eng: Wide-Coverage Danish-English Machine Translation --------------------------------------------------------- The paper presents the different components of the system, and reports on the bleu score of the results. System architecture (a) A Danish Constraint Grammar (DanGram) ~ 6000 rules (b) Dependency rules establishing syntactic-semantic links between ~ 220 rules (c) Lexical transfer rules, .. acc. to gramm. category, dependency... ~ 17.000 rules (d) Generation rules for inflexion, verb chains, composita ~ 700 rules (e) Syntactic transformation (movement) rules: word order, subclauses, questions etc. ~ 75 rules Lexical transfer Local relations meget_ADV :a_lot; S=(>A) :very; D=(>A) :much If the word "meget" itself is @>A, choose "very", if "meget" is dependent to @>A, choose "much", otherwise, choose "a lot slægt S=(S):family, S=(P):generation non-local relations boligsøgende @>N:house-hunting boligsøgende (@SUBJ|@ACC):house-hunter regne_V1 (a) D=(@S-SUBJ) :rain; (b) D=( @ACC) D=("for" PRP)_nil :consider; (c) D=("med" PRP)_on GD=() :count; (d) D=("med" PRP)_nil :expect; (e) D=(@ACC) D=("med" ADV)_nil :include; (f) D=( @SUBJ) D?=("på" PRP)_nil :calculate; MWE aflåst sideleje = recovery position male byen rød male_V :paint; .... D=("by" DEF @ACC)_nil; D=("rød" @OC)_nil :have some serious fun Structural transfer Morphological structural transfer Private biler sælges ikke uden moms Private cars aren't sold without VAT s-passive > be + ...ed Transformations (movement rules) (@ADVL|@ACC|@FS-ADVL|@>>P), I_dag w(@FMV|@FAUX|@FS-[^Q]+), drikker w(@ICL-AUX<)?, w(@ADVL)?, (@SUBJ|@F-SUBJ|@S-SUBJ) vi -> 1, 5, 2, 3, 4 Evaluation TER = 5-8 BLEU = 0,55 1a.2 SMT: Lese litteratur relatert til STM-system: Moses ========================================================= J90-2002.pdf Peter F. Brown, ... 2002: A statistical approach to machine translation. ------------------------------------------------------------------------ This is the central paper. N07-1062.pdf Richard Zens and Hermann Ney 2007: Efficient Phrase-table Representation for Machine Translation ------------------------------------------------------------------------------------------------ P07-1040.pdf Antti-Veikko I. Rosti ...2007: ImprovedWord-Level System Combination for Machine Translation. ------------------------------------------------------------------------------------------------ P07-2045.pdf Philipp Koehn et al 2007: Moses: Open Source Toolkit for Statistical Machine Translation. ------------------------------------------------------------------------------------------------ This (+ the tutorials and howtos) is the key article Shen_IWSLT_2006.pdf Wade Shen, Richard Zens, Nicola Bertoldi, Marcello Federico 2006: The JHU Workshop 2006 IWSLT System ------------------------------------------------------------------------------------------------ The article is threefold: It deals with ASR (speech recognition), with MT, and it introduces the Moses system. Well, there are almost no references to Moses. W07-0725.pdf Holger Schwenk 2007: Building a Statistical Machine Translation System for French using the Europarl ------------------------------------------------------------------------------------------------ The article reports on using Moses for a system of french<>english MT. Giza++ was used to create a dictionary. The Europarl monolingual corpora (appr 40M of English and French text) was used to train the language models. A 1.3M copus of aligned sentences was used to train the statistical perser. It shows a bleu score of 0.3189 for en2fr and 0.33 for fr2en (compared with Bick 2007 at 0.55. emnlp2007-factored.pdf Philipp Koehn and Hieu Hoang 2007: Factored Translation Models. ------------------------------------------------------------------------------------------------ sanchez06b.pdf Felipe Sánchez-Martínez, ... 2006: Speeding up Target-Language DrivenPart-of-Speech Tagger Training for Machine Translation ------------------------------------------------------------------------------------------------ This is an Apertium article, but it deals with statistical methods. 1a.3 Ord: Lese litteratur relatert til GIZA++ og ordparallellföring ==================================================================== The list contains the classical GIZA++ paper. We might also supplement by other word alignment publications. (the list may be changed as we read the papers) Ahrenberg, L, M. Andersson & M. Merkel (1998). A simple hybrid aligner for generating lexical Correspondences in Parallel Texts. In Proceedings of COLING-ACL-98, Montreal, pp 29-35. --------------------------------------------------------------------------------------------------- This work is part of the project "Parallell corpora in Linköping, Uppsala and Göteborg" (PLUG). They present an algorithm for bilingual word alignment that extends previous work by treating multi-word candidates on a par with single words, and combining some simple assumptions about the translation process to capture alignments for low frequency words. As most other alignment algorithms it uses cooccurrence statistics as a basis, but differs in the assumptions it makes about the translation process. The algorithm has been implemented in a modular system that allows the user to experiment with different combinations and variants of these assumptions. They give performance results from two evaluations. The paper explains how the WA approach functions, and some modules that can be combined freely: - a morphological module that groups expressions that are identical according to suffix sets for regular paradigms of the SL and TL. This strategy makes it possible to link low-frequency source expressions belonging to the same suffix paradigm. - a weight module distribute weights over the target expressions depending on their position relative to the given source expression. The weights must be provided by the user in the form of lists of numbers (greater than or equal to 0). This way it is possible to specify the maximal distance between a source and target expression measured as their relative position in the sentences. - a phrase module that includes multi-word expressions generated in the pre-processing stage as candidate expressions for alignment - stored in a phrase module. There are scores where one on them is a multi-word expression and the other is a single-word that is part of the multi-word expression. The almost identical target multi-word expression over a single word candidate if it has a t-value over the threshold and is one of the top six target candidates. When a multi-word expression is found to be an element of a translation pair, the expressions that overlap with it, whether multiword or single-word expressions, are removed from the current agenda and not considered until the next iteration. Ahrenberg, L. et al. (2000) A knowledge-lite approach to word alignment; in Veronis, J., "Parallel Text Processing: Alignment and use of translation corpora." Kluwer Academic. --------------------------------------------------------------------------------------------------- Bojar, O & M. Prokopová (2006). Czech-English Word Alignment. Lecture Notes in Computer Science 4139/2006, p. 214-224. --------------------------------------------------------------------------------------------------- Caseli, H. , Maria das Graças V. Nunes & Mikel L. Forcada. (2008) From free shallow monolingual resources to machine translation systems: easing the task. Mixing Approaches To Machine Translation, MATMT2008. 41-48. --------------------------------------------------------------------------------------------------- The paper describes a methodology to build automatically both bilingual dictionaries and shallow-transfer rules. These resources are built by extracting knowledge from automatically word-aligned (or lexically aligned) parallel corpora which have been processed with shallow monolingual resources (morphological analysers and part-of-speech taggers). They use LIHLA and GIZA++. This approach has an advantage compared with SMT in that it also generates dictionaries and rules which may be edited by humans. The dictionary induction process is better described in Caseli and Nunes 2007. The induction process has these steps: (1) the compilation of two bilingual dictionaries, one for each translation direction (one source–target and another target–source); (2) the merging of these two dictionaries; (3) the generalization of morphological attribute values in the bilingual entries; and (4) the treatment of morphosyntactic differences related to entries in which the value of the target gender/number attribute has to be determined from information that goes beyond the scope of the entry itself. There are 3 types of alignment blocks: type 0 = omissions type 1 = alignments preserving item order in sentence type 2 = reorderings Chapter 3.2. describes alignment tehcniques. Dan Tufiş, Ana Maria Barbu, Radu Ion. (2004). Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, 163-189 --------------------------------------------------------------------------------------------------- Helgegren, Sofia (2005) Tracing Translation Universals and Translator Development by Word Aligning a Harry Potter Corpus. Magisteruppsats i kognitionsvetenskap, Instutitionen för datavetenskap, Linköpings universitet - 83 pages --------------------------------------------------------------------------------------------------- A descriptive translation study. A translation corpus was built from roughly the first 20,000 words of each of the first four Harry Potter books and their respective translations into Swedish. I*Link was used to align the samples on a word level and to investigate and analyse the aligned corpus. The purpose of the study was threefold: to investigate manifestations of translation universals, to search for evidence of translator development and to study the efficiency of different strategies for using the alignment tools. The results show that all three translation universals were manifested in the corpus, both on a general pattern level and on a more specific lexical level. Additionally, a clear pattern of translator development was discovered, showing that there are differences between the four different samples. The tendency is that the translations become further removed from the original texts, and this difference occurs homogeneously and sequentially. In the word alignment, four different ways of using the tools were tested, and one strategy was found to be more efficient than the others. This strategy uses dynamic resources from previous alignment sessions as input to I*Trix, an automatic alignment tool, and the output file is manually post-edited in I*Link. Hiemstra, D. (1996). Using statistical methods to create a bilingual dictionary. Master Thesis. University of Twente. 66 pages. --------------------------------------------------------------------------------------------------- This master's thesis covers a method to compile a probabilistic bilingual dictionary, (or bilingual lexicon), from a parallel corpus (i.e. large documents that are each others translation). Two research questions are answered in this paper. In which way can statistical methods applied to bilingual corpora be used to create the bilingual dictionary? And, what can be said about the performance of the created bilingual dictionary in a multilingual document retrieval system? To build the dictionary, they used a statistical algorithm called the EM-algorithm. The EMalgorithm was first used to analyse parallel corpora at IBM in 1990. In this paper they developed an EM-algorithm that compiles a bi-directional dictionary. Jurafsky, D. & J.H. Martin (2008) Speech an Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second Edition. Prentice Hall, New Jersey. --------------------------------------------------------------------------------------------------- Kashioka, H. (2005): Word Alignment Viewer for Long Sentences. Department or Natural Language Processing. Proceedings of MT Summit X, pp.427-431. --------------------------------------------------------------------------------------------------- Merkel, Magnus and Ahrenberg, Lars ((1999) 2000). Evaluation of word alignment systems. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC) 6 pages --------------------------------------------------------------------------------------------------- When evaluating WA systems, then it is important to decide the purposeand usage of such a system. If it is to beadopted for creating full-text alignments usedfor bilingual searches (bilingual concordancing) or for creating bilingual dictionaries, the evaluation must be tailored towards that particular usage. Secondly, the appropriate segmentation of the source text, in particular, is fundamental for comparisons of scorings between different systems. In the paper several approaches to evaluation of alignment systems are described with regard to the purpose of the system, text segmentation, metrics and scoring methods, gold standards, error analysis and performance data. Here is also some warnings and explanations for that an extracted dictionary can give another result than an existing bilingual dictionary. Och, F. (1995). "Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der komintorischen Optimierung. Studienarbeit im Fach Informatik. --------------------------------------------------------------------------------------------------- Och, F. & H. Ney (2000). "Improved Statistical Alignment Models". In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. --------------------------------------------------------------------------------------------------- Och, F. & H. Ney (2003). "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003 --------------------------------------------------------------------------------------------------- This is the standard reference to GIZA++ It is pretty technical, and must be supplemented by tutorial text. Och, F. & H. Ney (2004). "The alignment template approach to statistical machine translation." Computational Linguistics, volume 30, number 4, pages 417-449, 2004, MIT Press. --------------------------------------------------------------------------------------------------- Piperidis, S. et al. (2000) From sentences to words and clauses; S. in Veronis, J., "Parallel Text Processing: Alignment and use of translation corpora." Kluwer Academic. --------------------------------------------------------------------------------------------------- Sánchez-Martínez, F. & M. L. Forcada (2007) Automatic induction of shallow-transfer rules for open-source machine translation. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007), p. 181-190, September 7-9, 2007, Skövde, Sweden. --------------------------------------------------------------------------------------------------- Tiedemann, J. (2003). Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis. Uppsala University. --------------------------------------------------------------------------------------------------- Two automatic WA-systems - Uppsala WA (UWA) and Clue Aligner. UWA implements an iterative 'knowledge-poor' word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb). Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems. Tiedemann, J. (2003). Combining Clues for Word Alignment. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 8 pages. --------------------------------------------------------------------------------------------------- Trushkina, J (2007). Development of a Multilingual Parallel Corpus and a Part-of-Speech Tagger for Afrikaans. IFIP International Federation for Information Processing 228/2007, p. 453-462. --------------------------------------------------------------------------------------------------- Tufiş, D., Radu Ion, Alexandru Ceauşu & Dan Ştefănescu: Combined word alignments Romanian Academy Institute for Artificial Intelligence. 13, “13 Septembrie”, 74311, Bucharest 5, Romania --------------------------------------------------------------------------------------------------- Wang, X. (2004) Evaluation of Two Word Alignment Systems. Final Thesis. Department for Computer and Information Science, Linköping. --------------------------------------------------------------------------------------------------- Wang has evaluated two different systems that generate word alignments on English-Swedish data - Giza++ and I*Trix. She has evaluated them with parameters such as corpus size, characteristics of the corpus, the effect of linguistic knowledge etc. She has compared the alignment results with human’s opinion separately to see which of them is closer to the gold standard. In addition, the running time and the costs of two systems can be compared. Which system is easier to run and what is the easiest size of corpus for the systems to handle are also interesting aspects to be evaluated. Her conclusion is that in general Giza++ is better applying on big corpora while I*Trix is better for small corporas - especially for those with high statistical ratio or special resources. Using Giza++ is better for big corpora because the speed is much faster. For example, for corpus Access XP 5000, the result from Giza++ is almost the same as the result from I*Trix, but Giza++ is 10 times faster. Big corpora with word classes can lead to better results. For small and high statistical ratio corpus or corpus with specific resource, I*Trix is a better choice. Giza++ is a complete word alignment system. It follows the statistical machine model to deal with the data. The running speed for Giza++ is very fast, and the more training, the better results can be achieved. Although the running is from command line, it is still easy to learn. The weakness for Giza++ is that only “word classes” is an optional step. The parameters that can be changed by the users are very few. And because Giza++ implemented by using C, it can only use in Unix. I*Trix, on the other hand, is not a complete software for independent using. The purpose of developing it is just a middle step in other programs. Although it has a good interface, still, setting the parameters like POS, function... and understanding all the functions are quite complicated. But I*Trix can deal with different corpus for different parameters very carefully. In particular, I*Trix can include specific resources for every corpus which might improve the result. &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&& Old discussion &&&&&&&&&& &&&&&&&& &&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 1a Reading task "a reading assignment on a selected course topic where findings are to be presented both in a short paper (4-5 pages) and orally" emne med beskriving til i morgon/neste veke- ferdig til oktober -------------------------------------------------------- Vejolaš bargobihtát: --> biibal paraleallakorpusan: analyseret + lemma alignment (gx helena apertiumis) GIZA++ er eit ordparallellsystem som vi kan teste --> Moses på nytestamentet, alfa --> MT alfa?? 2 Term paper (longer task) "Carry out a project and present its result in a term paper" ------------------------------------------------------------- --> MT regelbasert med fungerande reglar (=> 1a ferte lea Bick dahje Apertium-artihkalat) --> Ordparallellföring -- resultat makkar systemaid lea geavahuvvon seammá áššái pro/con --> SMT Moses -- vurdere resultat Quote from the course plan 1. Two individual assignments a. a reading assignment on a selected course topic where findings are to be presented both in a short paper (5 pages) and orally A specific approach, such as Example-based MT, Statistical MT, Constraint-based MT, ... A comparison of two approaches, Treatment of some significant problem in different approaches, Multi-engine systems, Some approach to evaluation, Design and evaluation of translation memories, Alignment methods, Spoken language translation, ... b. The other is a practical assignment which could be an implementation or an evaluation task. 2. Carry out a project and present its result in a term paper (ja dan birra L i čále maidege)