# Steps to update parallel corpora. This pipeline has been tested for nob2sme and nob2sma. Some scripts contain (for the moment, see TODO) lines with paths to be changed/commented/uncommented: - align_files.sh - encode_gt_corpus.sh - extract_sent_pairs.py 0. Compile XFSTs for both languages 1. make a copy of the 2-langs folder here `cp ~/freecorpus/stable/tmx/XXX2YYY .` 2. Remove xml:lang from all tmx files, since this generates error when encoding with cwb `cd XXX2YYY` `find . -type f| xargs perl -i -p -e 's/xml:lang/lang/g;'` 3. Analyse one genre at a time and extract sentence pairs 3.0. Remove output folder (not needed first time) `rm -rf out_*` 3.1. Analyse 1-lang, specifying genre `python3 analyse_xxx_tmx.py YYY XXX2YYY/GENRE GENRE` The output folder is out_YYY_XXX2YYY 3.2. Analyse 2-lang, specifying output folder `python3 analyse_xxx_tmx.py XXX out_YYY_XXX2YYY` 3.3. Extract sentence pairs for 1-lang, specifying lang and genre. `python3 extract_sent_pairs.py XXX GENRE` 3.4. Extract sentence pairs for 2-lang, specifying lang and genre `python3 extract_sent_pairs.py YYY GENRE` 4. Repeat step 3 (with all sub-steps) for all genres 5. Convert to vrt files `sh run_para_corpus_encoding.sh` # Example (nob2sme). 1. Make copy `cp ~/freecorpus/stable/tmx/nob2sme .` 2. Remove xml:lang `cd nob2sme` `find . -type f| xargs perl -i -p -e 's/xml:lang/lang/g;'` 3. Analyse 3.0. Remove out folder `rm -rf out_*` 3.1. Analyse 1 `python3 analyse_xxx_tmx.py sme nob2sme/bible bible` 3.2. Analyse 2 `python3 analyse_xxx_tmx.py nob out_sme_nob2sme` 3.3. Extract 1 `python3 extract_sent_pairs.py nob bible` 3.4. Extract 2 `python3 extract_sent_pairs.py sme bible` 4. Repeat 4.1. Facta `rm -rf out_*` `python3 analyse_xxx_tmx.py sme nob2sme/facta facta` `python3 analyse_xxx_tmx.py nob out_sme_nob2sme` `python3 extract_sent_pairs.py nob facta` `python3 extract_sent_pairs.py sme facta` 4.2. Laws `rm -rf out_*` `python3 analyse_xxx_tmx.py sme nob2sme/laws laws` `python3 analyse_xxx_tmx.py nob out_sme_nob2sme` `python3 extract_sent_pairs.py nob laws` `python3 extract_sent_pairs.py sme laws` 4.3. Science `rm -rf out_*` `python3 analyse_xxx_tmx.py sme nob2sme/science science` `python3 analyse_xxx_tmx.py nob out_sme_nob2sme` `python3 extract_sent_pairs.py nob science` `python3 extract_sent_pairs.py sme science` 4.4 Admin `rm -rf out_*` `python3 analyse_xxx_tmx.py sme nob2sme/admin administration` `python3 analyse_xxx_tmx.py nob out_sme_nob2sme` `python3 extract_sent_pairs.py nob admin` `python3 extract_sent_pairs.py sme admin` 4.5. Mixed `rm -rf out_*` in tmx file replace all html-entities: `vim nob2sme/data.nob2sme.20121112.tmx` `%s/\n