1. split data on dep, dis and xml files
_six extract_data.xsl inDir=out_cd-corrected
2. generate data in cqp format
PERL_UNICODE=SAD perl dep2meta-xml.pl
3. filter_dep-info.xsl (dep2dis for fkv ==> only for lang without dep-analysis)
_six filter_pseudo-dep-info.xsl
4. filter_pseudo-words.xsl (inDir no-dep_outDir)
_six filter_pseudo-words.xsl inDir=xml_DATE_CTYPE_LANG
5.1 filter_pseudo-sentences.xsl
_six filter_pseudo-sentences.xsl inDir=no-pwords_outDir
==> __UNDEF__-check here!
_six check__UNDEF__.xsl
5.2 _six filter_pseudo-sentences_regex.xsl
6.1 xml2vrt.xsl
_six xml2vrt.xsl inDir=no-psent_outDir
6.2
_six compile_cwb_format.xsl
- delete newlines stemming from empty texts or adapt corpus encoding parameters?
7. cat
cat corpus4cwb/* > sme_corpus_20140120.vrt
3.try>find corpus4cwb/2014-02-21/bc/sma -type f|xargs -J {} cat {} > 2014-02-21_bc_sma.vrt
3.try>find corpus4cwb/2014-02-21/bc/sme -type f|xargs -J {} cat {} > 2014-02-21_bc_sme.vrt
3.try>find corpus4cwb/2014-02-21/bc/smj -type f|xargs -J {} cat {} > 2014-02-21_bc_smj.vrt
3.try>find corpus4cwb/2014-02-21/fc/sma -type f|xargs -J {} cat {} > 2014-02-21_fc_sma.vrt
3.try>find corpus4cwb/2014-02-21/fc/sme -type f|xargs -J {} cat {} > 2014-02-21_fc_sme.vrt
3.try>find corpus4cwb/2014-02-21/fc/smj -type f|xargs -J {} cat {} > 2014-02-21_fc_smj.vrt
8. add root node
9. CDATA correction
%s/>/>/g
%s/</ smj_corpus_20140318.vrt
test: backend
http://gtweb.uit.no/cgi-bin/korp/korp.cgi?command=query&cqp=[word=%22go%22]&corpus=SME_CORPUS_20131218&start=0&end=0&defaultcontext=1%20sentence&indent=2
Delete intermediare dirs before compiling the grep-corpus:
1362 cd data4korp/
1363 ls
1364 cd out_cd-corrected/
1365 ls
1366 cd input/
1367 ls
1368 d
1369 d
1370 mv out_cd-corrected/input/2014-02-22 .
1371 ls
1372 rmdir out_cd-corrected/input/
1373 rmdir out_cd-corrected/
1374 ls
1375 ls
1376 cd 2014-02-22/bc/
1377 ls
1378 mv a/sm* .
1379 rmdir a
1380 d
1381 cd fc/
1382 mv a/sm* .
1383 rmdir a
1384 ls
1385 d
1386 d
corpuscle_q_test_korp_data_20141117>g -v 'g '