1. split data on dep, dis and xml files _six extract_data.xsl inDir=out_cd-corrected 2. generate data in cqp format PERL_UNICODE=SAD perl dep2meta-xml.pl 3. filter_dep-info.xsl (dep2dis for fkv ==> only for lang without dep-analysis) _six filter_pseudo-dep-info.xsl 4. filter_pseudo-words.xsl (inDir no-dep_outDir) _six filter_pseudo-words.xsl inDir=xml_DATE_CTYPE_LANG 5.1 filter_pseudo-sentences.xsl _six filter_pseudo-sentences.xsl inDir=no-pwords_outDir ==> __UNDEF__-check here! _six check__UNDEF__.xsl 5.2 _six filter_pseudo-sentences_regex.xsl 6.1 xml2vrt.xsl _six xml2vrt.xsl inDir=no-psent_outDir 6.2 _six compile_cwb_format.xsl - delete newlines stemming from empty texts or adapt corpus encoding parameters? 7. cat cat corpus4cwb/* > sme_corpus_20140120.vrt 3.try>find corpus4cwb/2014-02-21/bc/sma -type f|xargs -J {} cat {} > 2014-02-21_bc_sma.vrt 3.try>find corpus4cwb/2014-02-21/bc/sme -type f|xargs -J {} cat {} > 2014-02-21_bc_sme.vrt 3.try>find corpus4cwb/2014-02-21/bc/smj -type f|xargs -J {} cat {} > 2014-02-21_bc_smj.vrt 3.try>find corpus4cwb/2014-02-21/fc/sma -type f|xargs -J {} cat {} > 2014-02-21_fc_sma.vrt 3.try>find corpus4cwb/2014-02-21/fc/sme -type f|xargs -J {} cat {} > 2014-02-21_fc_sme.vrt 3.try>find corpus4cwb/2014-02-21/fc/smj -type f|xargs -J {} cat {} > 2014-02-21_fc_smj.vrt 8. add root node 9. CDATA correction %s/>/>/g %s/</ smj_corpus_20140318.vrt test: backend http://gtweb.uit.no/cgi-bin/korp/korp.cgi?command=query&cqp=[word=%22go%22]&corpus=SME_CORPUS_20131218&start=0&end=0&defaultcontext=1%20sentence&indent=2 Delete intermediare dirs before compiling the grep-corpus: 1362 cd data4korp/ 1363 ls 1364 cd out_cd-corrected/ 1365 ls 1366 cd input/ 1367 ls 1368 d 1369 d 1370 mv out_cd-corrected/input/2014-02-22 . 1371 ls 1372 rmdir out_cd-corrected/input/ 1373 rmdir out_cd-corrected/ 1374 ls 1375 ls 1376 cd 2014-02-22/bc/ 1377 ls 1378 mv a/sm* . 1379 rmdir a 1380 d 1381 cd fc/ 1382 mv a/sm* . 1383 rmdir a 1384 ls 1385 d 1386 d corpuscle_q_test_korp_data_20141117>g -v 'g '