!!!A collection of examples This is a short collection of examples serving as a starting point for how to use [XMLSH|http://www.xmlsh.org/]. It is a shell-friendly interface to xml files, and allows fast and easy access to structured data, as long as you know your XPath! :D !!Count the number of sme words in parallel files First run the parallel info xsl script using Saxon (Saxon must be on your CLASSPATH - the saxonXSL alias assumes that it is found in {{~/lib/saxon9.jar}}): {{{ $ saxonXSL -it main $GTHOME/gt/script/corpus/parallel_corpus_info.xsl lang1=nob lang2=sme inDir=$GTFREE/converted }}} Then start xmlsh and extract some statistics from the xml files produced above: {{{ $ xmlsh xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/nob2sme_parallel-corpus_summary.xml 2307 xmlsh$ xquery 'count(//file[@parallelity="true"])' < corpus_report/sme2nob_parallel-corpus_summary.xml 2288 }}} Then off to some slightly more advanced XQuery: get all elements for which we have found a parallel file (as per above), extract the path to that file, and print it (we do this with both the created report files, and {{sort -u}} later): {{{ xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/t_loc/text()' \ < corpus_report/nob2sme_parallel-corpus_summary.xml > sme-files.txt xmlsh$ xquery 'for $i in //file[@parallelity="true"] return $i/location/h_loc/text()' \ < corpus_report/sme2nob_parallel-corpus_summary.xml >> sme-files.txt xmlsh$ exit }}} Finally some traditional processing to extract the words and count them. The most conservative (and probably most reliable) method is to just count the words using {{wc}}: {{{ $ sort -u sme-files.txt > sme-files.sorted.txt $ cat sme-files.sorted.txt | xargs ccat -l sme | wc -w 849855 $ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -l 964529 $ cat sme-files.sorted.txt | xargs ccat -l sme | preprocess | wc -w 977348 }}}