Literature: ____________ Ahrenberg, L, M. Andersson & M. Merkel (1998). A simple hybrid aligner for generating lexical Correspondences in Parallel Texts. In Proceedings of COLING-ACL-98, Montreal, pp 29-35. --------------------------------------------------------------------------------------------------- This work is part of the project "Parallell corpora in Linköping, Uppsala and Göteborg" (PLUG). They present an algorithm for bilingual word alignment that extends previous work by treating multi-word candidates on a par with single words, and combining some simple assumptions about the translation process to capture alignments for low frequency words. As most other alignment algorithms it uses cooccurrence statistics as a basis, but differs in the assumptions it makes about the translation process. The algorithm has been implemented in a modular system that allows the user to experiment with different combinations and variants of these assumptions. They give performance results from two evaluations. The paper explains how the WA approach functions, and some modules that can be combined freely: - a morphological module that groups expressions that are identical according to suffix sets for regular paradigms of the SL and TL. This strategy makes it possible to link low-frequency source expressions belonging to the same suffix paradigm. - a weight module distribute weights over the target expressions depending on their position relative to the given source expression. The weights must be provided by the user in the form of lists of numbers (greater than or equal to 0). This way it is possible to specify the maximal distance between a source and target expression measured as their relative position in the sentences. - a phrase module that includes multi-word expressions generated in the pre-processing stage as candidate expressions for alignment - stored in a phrase module. There are scores where one on them is a multi-word expression and the other is a single-word that is part of the multi-word expression. The almost identical target multi-word expression over a single word candidate if it has a t-value over the threshold and is one of the top six target candidates. When a multi-word expression is found to be an element of a translation pair, the expressions that overlap with it, whether multiword or single-word expressions, are removed from the current agenda and not considered until the next iteration. Ahrenberg, L. et al. (2000) A knowledge-lite approach to word alignment; in Veronis, J., "Parallel Text Processing: Alignment and use of translation corpora." Kluwer Academic. --------------------------------------------------------------------------------------------------- Bojar, O & M. Prokopová (2006). Czech-English Word Alignment. Lecture Notes in Computer Science 4139/2006, p. 214-224. --------------------------------------------------------------------------------------------------- Half a thousand sentences were manually annotated by two annotators in parallel and the most frequent reasons for disagreement are described. They evaluate the accuracy of GIZA++ alignment toolkit on the data and identify that lemmatization of the Czech part can reduce alignment error to a half. Furthermore they document that about 38 % of tokens difficult for GIZA++ were difficult for humans already. Two independent manual WA of 515 sentences, distinguishing among cases where individual words match (SURE alignment), whole phrases correspond – but not words by themselves – (PHRASAL alignment) and cases when the connection is possible though doubtful (POSSIBLE alignment). The inter-annotator mismatch was 9 % when the type of connection was not taken into account. The WA types which were problematic: - articles (Czech does not have them) - especially in situations where words are changing their POS during the translation process. For instance English idiomatic expressions. - verbs and their belongings. Verb tenses and usage of auxiliaries to express them is different in Czech and English. Especially verb "be", the preposition "to" and pronouns, because of prodrop in Czech. And the Czech reflexive pronoun that has no real equivalent in English. - Punctuation and $. different in Czech and English. - Other tokens: prepositions, year with addition words in Czech, conjunctions. They used GIZA++ which is capable of guessing 1-n alignments (more target words get assigned to one source word). GIZA++ is used twice to obtain alignments in both directions. There are two common ways to obtain a joined alignment. Either the two directions are combined using intersection or using union. Intersection alignments have in general higher precision and lower recall compared to union alignments. Ways of improving the accurasy of GIZA++ (AER is a combination of precision and recall): - Czech has the morphological richness. The baseline accuracy level (AER of 27%) is achieved using input text that has been only tokenized. As documented in Table 4, lemmatization of the input text reduces the Czech vocabulary size to a half so that the vocabulary sizes of Czech and English become comparable. The effect of lemmatization of the English is not that great. The alignment task is thus greatly simplified and AER drops to about 15%. This matches with observations of on Serbian-English machine translation task. - Another great saving in vocabulary size can be achieved by replacing all words occurring only once (singletons) with a special symbol representing their part of speech. Gives some improvement in alignment quality. - Using a common symbol for all numbers (provided that there is equal number of numbers in the Czech and the corresponding English sentence). This technique brings again a small improvement of AER. They tried to tackle the problem with articles by removing them completely before running GIZA. The AER from this experiment was unfortunately worse than without circumventing articles (evaluated against golden annotations with removed articles, as well as evaluated against full golden annotations with articles aligned to the governing Czech noun using an independent rule). (Popovi´c et al., 2005) report that removing articles helped in English to Serbian machine translation on a corpus of limited size. The positive effect vanished with corpus of about 2,500 sentences. Caseli H.M. & M. G. V. Nunes (2007). Automatic induction of bilingual lexicons for machine translation. International Journal of Translation, 19:29–43. 15 pages --------------------------------------------------------------------------------------------------- The lexicons with bilingual word and multiword entries enriched by morphological and translation direction information, are built by extracting knowledge from PoS-tagged and lexically aligned parallel corpora. Preliminary experiments were carried out on Brazilian Portuguese, Spanish and English parallel texts. The results of a manual analysis showed that 85 % of pt-es and 89% of pt-en entries are plausible correspondences. These results were obtained taking into consideration only the classes of entries which achieved the best results. Target sentences were generated using all induced entries. These sentences were compared with target sentences generated by commercial systems. This comparison emphasizes the relevance of translation lexicons in machine translation, mainly in Portuguese-Spanish. Part of the ReTraTos-project. The sets of translation examples were PoS-tagged using two tools available in Apertium: a morphological analyzer and a PoS tagger. The morphological analyzer provides one or more analysis (lemma, lexical category and morphological inflection information) for each surface form based on a monolingual morphological dictionary. After PoS-tagging, the translation examples were lexically aligned using two different tools. The pt-es examples were word-aligned using LIHLA with 94.25% precision and 94.97% recall. The pt-en examples were word-aligned using GIZA++ with 90.47% precision and 92.34% recall. These results were obtained by comparing the automatic alignments of a small set of sentences (about 500 sentences) with manually produced (reference) alignments as described in (Caseli 2007). In the first step, the method looks for all possible translations in the target (source) sentence for each source (target) word (its lemma, PoS tags and attributes), in each translation example. This search is performed based on the lexical alignments. If more than one word is found in one or both sides, the PoS information of these words is joined by the character “+”, forming a multiword unit. At the end of this step, the method stores all possible translations for each source (target) word or multiword unit and their occurrence frequency. Second step: The ambiguity is solved by merging the two translation lexicons built in the first step. The lexicons are merged by: (1) choosing the translation with the highest occurrence frequency; (2) setting the valid translation direction (source-target or target-source), if necessary; and (3) applying a frequency threshold to constrain the creation of multiword unit entries. An entry involving more than one word on one or both sides will be created only if it occurs at least n times (n = 50 in the experiments presented in this paper). This constraint reduces the effect of wrong multiword unit alignments since, for this alignment category, the error rate is fairly high (11% in pt-es and 16% in pt-en parallel corpora) (Caseli 2007). Third step: generalize the attribute values in bilingual entries with the same translation direction by merging the different values. For example, the value of number attribute (pl and sg). Fourth step deals with entries whose values of the gender or number attributes can not be determined by the information in the entry. This happens when the same word is valid for both gender or number attribute values in one language but renders two different translations in the other language, one for each attribute value. In this step, for each word, the system looks for an entry which has the general value for either gender (mf) or number (sp) on one side and, on the other side, there is the merged value for either gender (f|m) or number (pl|sg). If such entry is found, the system replaces it with three entries according to the translation directions: one for each attribute value and another replacing the merged value with the value of gender (GD) or number (ND) to be determined. These results point to one problematic class for single word entries and two for multiword unit entries. A lot of word entries classified as new translation direction are plausible correspondences in the context of lexical alignment, but implausible in the context of a translation lexicon. Interesting results were: 85 % of pt-es and 89 % of pt-en entries were plausible correspondences. These results were obtained taking into consideration only the classes of entries which achieved the best results. Another evaluation was carried out with the target sentences generated using the induced lexicons. This evaluation emphasized the relevance of translation lexicons’ length in machine translation, mainly in Portuguese-Spanish. Helgegren, Sofia (2005) Tracing Translation Universals and Translator Development by Word Aligning a Harry Potter Corpus. Magisteruppsats i kognitionsvetenskap, Instutitionen för datavetenskap, Linköpings universitet - 83 pages --------------------------------------------------------------------------------------------------- A descriptive translation study. A translation corpus was built from roughly the first 20,000 words of each of the first four Harry Potter books and their respective translations into Swedish. I*Link was used to align the samples on a word level and to investigate and analyse the aligned corpus. The purpose of the study was threefold: to investigate manifestations of translation universals, to search for evidence of translator development and to study the efficiency of different strategies for using the alignment tools. The results show that all three translation universals were manifested in the corpus, both on a general pattern level and on a more specific lexical level. Additionally, a clear pattern of translator development was discovered, showing that there are differences between the four different samples. The tendency is that the translations become further removed from the original texts, and this difference occurs homogeneously and sequentially. In the word alignment, four different ways of using the tools were tested, and one strategy was found to be more efficient than the others. This strategy uses dynamic resources from previous alignment sessions as input to I*Trix, an automatic alignment tool, and the output file is manually post-edited in I*Link. Hiemstra, D. (1996). Using statistical methods to create a bilingual dictionary. Master Thesis. University of Twente. 66 pages. --------------------------------------------------------------------------------------------------- This master's thesis covers a method to compile a probabilistic bilingual dictionary, (or bilingual lexicon), from a parallel corpus (i.e. large documents that are each others translation). Two research questions are answered in this paper. In which way can statistical methods applied to bilingual corpora be used to create the bilingual dictionary? And, what can be said about the performance of the created bilingual dictionary in a multilingual document retrieval system? To build the dictionary, they used a statistical algorithm called the EM-algorithm. The EMalgorithm was first used to analyse parallel corpora at IBM in 1990. In this paper they developed an EM-algorithm that compiles a bi-directional dictionary. Jurafsky, D. & J.H. Martin (2008) Speech an Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Second Edition. Prentice Hall, New Jersey. --------------------------------------------------------------------------------------------------- Read the last chapter about MT. Here is very little about WA. Better to use Tiedemann 2003 as a basic book. Kashioka, H. (2005): Word Alignment Viewer for Long Sentences. Department or Natural Language Processing. Proceedings of MT Summit X, pp.427-431. --------------------------------------------------------------------------------------------------- Long sentences are especially difficult for word alignment because the sentences can become very complicated. Also, each (source/target) word has a higher possibility to correspond to the (target/source) word. This paper introduces an alignment viewer a developer can use to correct alignment information. They discuss using the viewer on a patent parallel corpus because sentences in patents are often long and complicated. They use output of GIZA++. Merkel, Magnus and Ahrenberg, Lars ((1999) 2000). Evaluation of word alignment systems. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC) 6 pages --------------------------------------------------------------------------------------------------- When evaluating WA systems, then it is important to decide the purposeand usage of such a system. If it is to beadopted for creating full-text alignments usedfor bilingual searches (bilingual concordancing) or for creating bilingual dictionaries, the evaluation must be tailored towards that particular usage. Secondly, the appropriate segmentation of the source text, in particular, is fundamental for comparisons of scorings between different systems. In the paper several approaches to evaluation of alignment systems are described with regard to the purpose of the system, text segmentation, metrics and scoring methods, gold standards, error analysis and performance data. Here is also some warnings and explanations for that an extracted dictionary can give another result than an existing bilingual dictionary. Och, F. (1995). "Maximum-Likelihood-Schätzung von Wortkategorien mit Verfahren der komintorischen Optimierung. Studienarbeit im Fach Informatik. --------------------------------------------------------------------------------------------------- Och, F. & H. Ney (2000). "Improved Statistical Alignment Models". In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. --------------------------------------------------------------------------------------------------- Och, F. & H. Ney (2003). "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003 --------------------------------------------------------------------------------------------------- This is the standard reference to GIZA++ It is pretty technical, and must be supplemented by tutorial text. They present and compare various methods for computing word alignments using statistical or heuristic models. They consider the five alignment models presented in Brown, Della Pietra, Della Pietra, and Mercer (1993), the hidden Markov alignment model, smoothing techniques, and refinements. These statistical models are compared with two heuristic models based on the Dice coefficient. They present different methods for combining word alignments to perform a symmetrization of directed statistical alignment models. An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models. Och, F. & H. Ney (2004). "The alignment template approach to statistical machine translation." Computational Linguistics, volume 30, number 4, pages 417-449, 2004, MIT Press. --------------------------------------------------------------------------------------------------- Piperidis, S. et al. (2000) From sentences to words and clauses; S. in Veronis, J., "Parallel Text Processing: Alignment and use of translation corpora." Kluwer Academic. --------------------------------------------------------------------------------------------------- Tiedemann, J. (2003). Recycling Translations - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral Thesis. Uppsala University. --------------------------------------------------------------------------------------------------- This on is a good as a basic book - with basic MT, WA, SA, evaluation and definitions explained in chapter 2. There is a very nice illustration on p. 7 with an overview of the compilation and use of parallel corpora within NLP. Two automatic WA-systems - Uppsala WA (UWA) and Clue Aligner. UWA implements an iterative 'knowledge-poor' word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb). Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems. Tiedemann, J. (2003). Combining Clues for Word Alignment. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL). 8 pages. --------------------------------------------------------------------------------------------------- A word alignment approach is presented which is based on a combination of clues. Word alignment clues indicate associations between words and phrases. They can be based on features such as frequency, part-of-speech, phrase type, and the actual wordform strings. Clues can be found by calculating similarity measures or learned from word aligned data. This clue alignment approach makes it possible to combine association clues taking different kinds of linguistic information into account: POS, phrase and position - with different weights. It allows a dynamic tokenization into token units of varying size. The approach has been applied to an English/Swedish parallel text with promising results. Possible correspondences can be e.g. - string similarity between SL and TL words - frequency counts - co-occurence - POS - syntactic analyse - wordpairs from a dictionary can be an aligment clue for the corresponding word pairs The clues can refer to sets of words which overlap with other sets of words to which another clue refers - therefor Tiedemann defines that a clue indicates an association between all its member token pairs. This makes it possible to combine alignment clues by distributing the clue indication from complex structures to single word pairs. In this way, dynamic tokenization can be used for both, source and target language sentences and combined association scores (the total clue value) can be calculated for each pair of single tokens.Sentence pairs can be represented in a two-dimensional matrix with one source language word per row and one target language word per column. The cells inside the matrix can be filled with the combined clue values for the corresponding word pairs - a clue matrix. It is important for our purposes to allow multiple links from each word (source and target) to corresponding words in the other language in order to obtain phrasal links. We say that a wordtoword link overlaps with another one if both of them refer to either the same source or the same target language word. Sets of overlapping links form link clusters. He difines a word alignment clue as a probability which indicates an association between two lexical items and in parallel texts - generally defined as a weighted assosiation. Lexical item: A lexical item is a set of words with associated features attached to it (word position may be a feature). The clues can be static or dynamic, and they can be declarative (pre-defined) or estimated from training data. Tiedemann describes three main diffuculties in WA: One of the main difficulties in all alignment strategies is the identification of appropriate units in the source and the target language to be aligned. Splitting source and target language texts into appropriate units for alignment (henceforth: tokenization) is often not possible without considering the translation relations. In other words, initial tokenization borders may change when the translation relations are investigated. Previous approaches use either iterative procedures to re-estimate alignment parameters or preprocessing steps for the identification of token Ngrams. The second problem of traditional word alignment approaches is the fact that parameter estimations are usually based on plain text items only. Linguistic data, which could be used to identify associations between lexical items are often ignored. Linguistic tools such as part-of-speech taggers, (shallow) parsers, named-entity recognizers become more and more robust and available for more languages. Linguistic information including contextual features could be used to improve alignment strategies. The third problem, alignment recovery, is a search problem. Furthermore, a search strategy becomes very complex if we allow dynamic tokenization borders (overlapping N-grams, inclusions), which leads us not only to a larger number of possible combinations but also to the problem of comparing alignments with variable number of links. Trushkina, J (2007). Development of a Multilingual Parallel Corpus and a Part-of-Speech Tagger for Afrikaans. IFIP International Federation for Information Processing 228/2007, p. 453-462. --------------------------------------------------------------------------------------------------- This paper describes design and creation of a multilingual parallel corpus for South African languages. There are 11 official languages in South-Africa. The induction of a part-of-speech tagger for Afrikaans, is presented in the paper. They use three languages in the work: English, Afrikaans and Dutch, because Dutch is the closest relative of Afrikaans. The Bible has been chosen as the basis for the multilingual corpus - 820 000 - 840 000 tokens for different languages. SA is done with the Vanilla aligner, and the result is corrected manually. WA is done with Giza - using only the "reliable" alignments. Semi-automatic heuristics have beein implemented to increase the number of reliable alignments - like transitivity heuristics (comparing three languages), interspan heuristics (looking at the word before and after) and correction heuristic (some systematic errors have been corrected manually). The share of reliable alignments were 52-57 %, and of them were 97-98 % correct. Development of the Afrikaans part-of-speech tagger is based on a modified method for induction of linguistic tools from parallel corpora originally proposed by Yarowsky and Ngai (2001). The original model provides a high-quality annotation of a resource-poor language given a bilingual parallel corpus aligned on word level with annotation of one language part of the corpus. The method is based on an observation that linguistic analyses of translations of the same sentence in different languages often coincide. They follow the main principles of the described model: at first, the part-ofspeech tags are projected from the English data onto the Afrikaans tokens, and then an n-gram language model is trained on the POS tag projections. They modify the original model: Only on reliable alignments, with safe alignments identified by the heuristics A trigram model is used instead of the originally proposed bigram model. The Afrikaans language model uses the full Penn Treebank set of 46 POS tags, unlike the originally described model which employs reduced tagsets of 14 and 9 core tags. No aggressive re-estimation of lexical probabilities in line with the original experiments is performed. The Trigram'n'Tags (TnT) tagger, an HMM trigram tagger, has been used in our tagging experiments. But the corpus is only partially annotated, since unreliable tag projections are not included. Second, a small part of the corpus is assigned multiple tags. These multiple tags are a result of one-to-many projections, such as projections produced in case of aligning a single Afrikaans token with an English phrase. The evaluation demonstrated an accuracy of 84 %, but with some adaptions due to differences in language structures (using a single tag for punctuation, reducing the amount of different tags for verbs and collapsing the tags for "to" and "in", they got the accuracy to 92 %. The project on the development of the corpus continues. Further development includes expansion of the corpus to other Soutii African languages, deeper annotation of the Afrikaans part of the corpus, and aligimient and linguistic analysis of the isiXhosa ans the isiZulu parts of the corpus. Tufiş, D., Ana Maria Barbu, Radu Ion. (2004). Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, 163-189 --------------------------------------------------------------------------------------------------- Tufiş, D., Radu Ion, Alexandru Ceauşu & Dan Ştefănescu: Combined word alignments Romanian Academy Institute for Artificial Intelligence. 13, “13 Septembrie”, 74311, Bucharest 5, Romania --------------------------------------------------------------------------------------------------- They describe a word alignment system that combines two different methods in bitext correspondences identification - using GIZA++ and TREQ, and TREQ-AL nor MEBA. They show that combining the two aligners the results are significantly improved as compared to each individual aligner. Thay make evaluation of the individual alignments without a startup bilingual lexicon and with an initial mid-sized bilingual lexicon. They found that while the performance of TREQ-AL increases a little bit, MEBA is doing better without an additional lexicon. Wang, X. (2004) Evaluation of Two Word Alignment Systems. Final Thesis. Department for Computer and Information Science, Linköping. --------------------------------------------------------------------------------------------------- Wang has evaluated two different systems that generate word alignments on English-Swedish data - Giza++ and I*Trix. She has evaluated them with parameters such as corpus size, characteristics of the corpus, the effect of linguistic knowledge etc. She has compared the alignment results with human’s opinion separately to see which of them is closer to the gold standard. In addition, the running time and the costs of two systems can be compared. Which system is easier to run and what is the easiest size of corpus for the systems to handle are also interesting aspects to be evaluated. Her conclusion is that in general Giza++ is better applying on big corpora while I*Trix is better for small corporas - especially for those with high statistical ratio or special resources. Using Giza++ is better for big corpora because the speed is much faster. For example, for corpus Access XP 5000, the result from Giza++ is almost the same as the result from I*Trix, but Giza++ is 10 times faster. Big corpora with word classes can lead to better results. For small and high statistical ratio corpus or corpus with specific resource, I*Trix is a better choice. Giza++ is a complete word alignment system. It follows the statistical machine model to deal with the data. The running speed for Giza++ is very fast, and the more training, the better results can be achieved. Although the running is from command line, it is still easy to learn. The weakness for Giza++ is that only “word classes” is an optional step. The parameters that can be changed by the users are very few. And because Giza++ implemented by using C, it can only use in Unix. I*Trix, on the other hand, is not a complete software for independent using. The purpose of developing it is just a middle step in other programs. Although it has a good interface, still, setting the parameters like POS, function... and understanding all the functions are quite complicated. But I*Trix can deal with different corpus for different parameters very carefully. In particular, I*Trix can include specific resources for every corpus which might improve the result.