[ Japanese | English | Back to top ]
Last update: "2005/09/29 04:03:21"



Julian - grammar based continuous speech recognition parser


julian [-C jconffile] [options ...]


Julian is a high-performance, multi-purpose, free speech recognition parser based on finite state grammar. It is capable of performing real-time recognition of continuous speech with over thousands of vocabulary. Julian is a derived version of Julius, and almost all com- ponents are the same except language model related part. To execute a recognition, it needs an acoustic model and a finite state grammar that describes sentence patterns to be recognized. The grammar format is an original one, and tools to create a recognirion grammar are included in the distribution. For acoustic model, standard format (i.e. HTK) with any word/phone units and sizes are supported. So users can build a recognition system customized for specific tasks using own task grammar and acoustic models. For details about models and how to write a grammar, please see the documents contained in this package. Julian can perform recognition on audio files, live micro- phone input, network input and feature parameter files. The maximum size of vocabulary is 65,535 words.


Julian supports the following models. Acoustic Models Same as Julius: Sub-word HMM (Hidden Markov Model) in HTK format are supported. Phoneme models (monophone), context dependent phoneme models (triphone), tied-mixture and phonetic tied-mixture models of any unit can be used. When using context dependent models, interword context is also handled. You can further use a tool mkbinhmm to convert the ascii HMM defini- tion file to binary format, for speeding up the startup (this format is incompatible with that of HTK). Language model The grammar format is an original one, and tools to create a recognirion grammar are included in the distribution. A grammar consists of two files: one is a 'grammar' file that describes sentence structures in a BNF style, using word 'categories' as terminate symbols. Another is a 'voca' file that defines word with its pronunci- ations (i.e. phoneme sequences) for each cate- gory. They should be converted by mkdfa.pl(1) to a deterministic finite automaton file (.dfa) and a dictionary file (.dict), respectively.


Same as Julius: Both live speech input and recorded speech file input are supported. Live input stream from micro- phone device, DatLink (NetAudio) device and tcpip network input using adintool is supported. Speech waveform files (16bit WAV (no compression), RAW format, and many other format will be acceptable if compiled with libsndfile library). Feature parameter files in HTK format are also supported. Note that Julian itself can only extract MFCC_E_D_N_Z fea- tures from speech data. If you use an acoustic HMM trained by other feature type, only the HTK parameter file of the same feature type can be used.


Recognition algorithm of Julian is based on a two-pass strategy. In the first pass, a high-speed approximate search is performed using weaker constraints then the given grammar. Here a LR beam search using only inter- category constraints extracted from the grammar is per- formed. The second pass re-searches the input, using the original grammar rules and intermediate results from the first pass, to gain a high precision result quickly. In the second pass the optimal solution is theoretically guaranteed using the A* search. When using context dependent phones (triphones), interword contexts are taken into consideration. For tied-mixture and phonetic tied-mixture models, high-speed acoustic likelihood calculation is possible using gaussian pruning. For more details, see the related document or web page below.


The options below specify the models, system behaviors and various search parameters. These option can be set all at once at the command line, but it is recommended that you write them in a text file as a "jconf file", and specify the file with "-C" option. Most are the same as Julius. Options only in Julian: -gram, -gramlist, -dfa, -penalty1, -penalty2, -looktrellis Options only in Julius: -nlr, -nrl, -d, -lmp, -lmp2, -transp, -silhead, -siltail, -spdur, -sepnum, -sepa- ratescore Speech Input -input {rawfile|mfcfile|mic|adinnet|netaudio|stdin} Select speech data input source. 'rawfile' is waveform file, and specified after startup from stdin). 'mic' means microphone device, and 'adin- net' means receiving waveform data via tcpip net- work from an adinnet client. 'netaudio' is from DatLink/NetAudio input, and 'stdin' means data input from standard input. WAV (no compression) and RAW (noheader, 16bit, BigEndian) are supported for waveform file input. Other format can be supported using external library. To see what format is actually supported, see the help message using option "-help". For stdin input, only WAV and RAW is supported. (default: mfcfile) -filelist file (With -input rawfile|mfcfile) perform recognition on all files listed in the file. -adport portnum (with -input adinnet) adinnet port number (default: 5530) -NA server:unit (with -input netaudio) set the server name and unit ID of the Datlink unit. -zmean -nozmean With speech input, this options enable/disable whether to remove DC offset using zero mean source. (default: disabled (-nozmean)) -nostrip Julian by default removes zero samples in input speech data. In some cases, such invalid data may be recorded at the start or end of recording. This option inhibit this automatic removal. -record directory Auto-save input speech data successively under the directory. Each segmented inputs are recorded to a file each by one. The file name of the recorded data is generated from system time when the input starts, in a style of "YYYY.MMDD.HHMMSS.wav". File format is 16bit monoral WAV. Invalid for mfcfile input. With input rejection by "-rejectshort", the rejected input will also be recorded even if they are rejected. -rejectshort msec Reject input shorter than specified milliseconds. Search will be terminated and no result will be output. In module mode, '<REJECTED REASON="..."/>' message will be sent to client. With "-record", the rejected input will also be recorded even if they are rejected. (default: 0 = off) Speech Detection Options in this section is invalid for mfcfile input. -cutsilence -nocutsilence Force silence cutting (=speech segment detection) to ON/OFF. (default: ON for mic/adinnet, OFF for files) -lv threslevel Level threshold (0 - 32767) for speech triggering. If audio input amplitude goes over this threshold for a period, Julius begin the 1st pass recogni- tion. If the level goes below this level after triggering, it is the end of the speech segment. (default: 2000) -zc zerocrossnum Zero crossing threshold per a second (default: 60) -headmargin msec Margin at the start of the speech segment in milliseconds. (default: 300) -tailmargin msec Margin at the end of the speech segment in mil- liseconds. (default: 400) Acoustic Analysis -smpFreq frequency Set sampling frequency of input speech in Hz. Sam- pling rate can also be specified using "-smpPe- riod". Be careful that this frequency should be the same as the trained conditions of acoustic model you use. This should be specified for micro- phone input and RAW file input when using other than default rate. Also see "-fsize", "-fshift", "-delwin". (default: 16000 (Hz = 625ns)) -smpPeriod period Set sampling frequency of input speech by its sam- pling period (nanoseconds). The sampling rate can also be specified using "-smpFreq". Be careful that the input frequency should be the same as the trained conditions of acoustic model you use. This should be specified for microphone input and RAW file input when using other than default rate. Also see "-fsize", "-fshift", "-delwin". (default: 625 (ns = 16000Hz)) -fsize sample Analysis window size in number of samples. (default: 400). -fshift sample Frame shift in number of samples (default: 160). -delwin frame Delta window size in number of samples (default: 2). -lofreq frequency Enable band-limiting for MFCC filterbank computa- tion: set lower frequency cut-off. (default: -1 = disabled) -hifreq frequency Enable band-limiting for MFCC filterbank computa- tion: set upper frequency cut-off. (default: -1 = disabled) -sscalc Perform spectral subtraction using head part of each file. With this option, Julius assume there are certain length of silence at each input file. Valid only for rawfile input. Conflict with "-ssload". -sscalclen With "-sscalc", specify the length of head part silence in milliseconds (default: 300) -ssload filename Perform spectral subtraction for speech input using pre-estimated noise spectrum from file. The noise spectrum data should be computed beforehand by mkss. Valid for all speech input. Conflict with "-sscalc". -ssalpha value Alpha coefficient of spectral subtraction. Noise will be subtracted stronger as this value gets larger, but distortion of the resulting signal also becomes remarkable. (default: 2.0) -ssfloor value Flooring coefficient of spectral subtraction. The spectral parameters that go under zero after sub- traction will be substituted by the source signal with this coefficient multiplied. (default: 0.5) GMM-based Input Verification and Rejection -gmm filename GMM definition file in HTK format. If specified, GMM-based input verification will be performed con- currently with the 1st pass, and you can reject the input according to the result as specified by "-gmmreject". Note that the GMM should be defined as one-state HMMs, and their training parameter should be the same as the acoustic model you want to use with. -gmmnum N Number of Gaussian components to be computed per frame on GMM calculation. Only the N-best Gaus- sians will be computed for rapid calculation. The default is 10 and specifying smaller value will speed up GMM calculation, but too small value (1 or 2) may cause degradation of identification perfor- mance. -gmmreject string Comma-separated list of GMM names to be rejected as invalid input. When recognition, the log likeli- hoods of GMMs accumulated for the entire input will be computed concurrently with the 1st pass. If the GMM name of the maximum score is within this string, the 2nd pass will not be executed and the input will be rejected. Language Model (Finite State Grammar) The recognition grammar can be specified in three ways: "-gram", "-gramlist" or combination of "-dfa" and "-v". Multiple grammars can be specified by using "-gram" and "-gramlist". When you use these options several times, all of them will be read at startup. Note that this is a different behavior from other options (last one override previous ones). You can use "-nogram" to reset the already specified grammars at that point. -gram gramprefix1[,gramprefix2[,gramprefix3,...]] Comma-separated list of grammars to be used. the argument should be prefix of a grammar, i.e. if you have "foo.dfa" and "foo.dict", you can specify them by single argument "foo". Multiple grammars can be specified as comma-separated list. -gramlist listfile Specify a grammar list file that contains list of grammars to be used. The list file should contain the prefixs of grammars, each per line. A relative path in the list file will be treated as relative to the list file, not the current path or configuration file. -dfa dfa_filename Finite state automaton grammar file. -v dictionary_file Word dictionary file (required) -nogram Remove the current list of grammars already speci- fied by the options above. -penalty1 float Word insertion penalty for the first pass. (default: 0.0) -penalty2 float Word insertion penalty for the second pass. (default: 0.0) -spmodel {WORD|WORD[OUTSYM]|#num} Name of short pause model as defined in the hmmdefs. In Julian, a word whose pronunciation consists of only this short pause model is called 'short pause word', and handled especially in recognition: even if its appearance in a sentence is explicitly specified in the grammar, it can be skipped while parsing. This behavior is for deal- ing with insertion and deletion of short pause that often appear unintensionally in user utterances. They can be specified in a style as shown below (default: "sp"). Example Word_name <s> Word_name[output_symbol] <s>[silB] #Word_ID #14 (Word_ID is the word position in the dictionary file starting from 0) -forcedict Ignore dictionary errors and force running. Words with errors will be dropped from dictionary at startup. Acoustic Model (HMM) -h hmmfilename HMM definition file to use. Format (ascii/binary) will be automatically detected. (required) -hlist HMMlistfilename HMMList file to use. Required when using triphone based HMMs. This file provides a mapping between the logical triphones names genertated from the phonetic representation in the dictionary and the HMM definition names. -iwcd1 {best N|max|avg} When using a triphone model, select method to han- dle inter-word triphone context on the first and last phone of a word in the first pass. best N: use average likelihood of N-best scores from the same context triphones max: use maximum likelihood of the same context triphones avg: use average likelihood of the same context triphones (default) -force_ccd / -no_ccd Normally Julius determines whether the specified acoustic model is a context-dependent model from the model names, i.e., whether the model names con- tain character '+' and '-'. You can explicitly specify by these options to avoid mis-detection. These will override the automatic detection result. -notypecheck Disable checking of the input parameter type. (default: enabled) Acoustic Computation Gaussian Pruning will be automatically enabled when using tied-mixture based acoutic model. It is disabled by default for non tied-mixture models, but you can activate pruning to those models by explicitly specifying "-gprune". Gaussian Selection needs a monophone model converted by mkgshmm. -gprune {safe|heuristic|beam|none} Set the Gaussian pruning technique to use. (default: 'safe' (setup=standard), 'beam' (setup=fast) for tied mixture model, 'none' for non tied-mixture model) -tmix K With Gaussian Pruning, specify the number of Gaus- sians to compute per mixture codebook. Small value will speed up computation, but likelihood error will grow larger. (default: 2) -gshmm hmmdefs Specify monophone hmmdefs to use for Gaussian Mix- ture Selectio. Monophone model for GMS is gener- ated from an ordinary monophone HMM model using mkgshmm. This option is disabled by default. (no GMS applied) -gsnum N When using GMS, specify number of monophone state to select from whole monophone states. (default: 24) Inter-word Short Pause Handling -iwsp (Multi-path version only) Enable inter-word con- text-free short pause handling. This option appends a skippable short pause model for every word end. The added model will be skipped on inter-word context handling. The HMM model to be appended can be specified by "-spmodel" option. Search Parameters (First Pass) -b beamwidth Beam width (number of HMM nodes) on the first pass. This value defines search width on the 1st pass, and has great effect on the total processing time. Smaller width will speed up the decoding, but too small value will result in a substantial increase of recognition errors due to search failure. Larger value will make the search stable and will lead to failure-free search, but processing time and memory usage will grow in proportion to the width. default value: acoustic model dependent 400 (monophone) 800 (triphone,PTM) 1000 (triphone,PTM, setup=v2.1) -1pass Only perform the first pass search. -realtime -norealtime Explicitly specify whether real-time (pipeline) processing will be done in the first pass or not. For file input, the default is OFF (-norealtime), for microphone, adinnet and NetAudio input, the default is ON (-realtime). This option relates to the way CMN is performed: when OFF CMN is calcu- lated for each input independently, when the real- time option is ON the previous 5 second of input is always used. Also refer to -progout. -cmnsave filename Save last CMN parameters computed while recognition to the specified file. The parameters will be saved to the file in each time a input is recog- nized, so the output file always keeps the last CMN parameters. If output file already exist, it will be overridden. -cmnload filename Load initial CMN parameters previously saved in a file by "-cmnsave". This option enables Julian to recognize the first utterance of a live microphone input or adinnet input with CMN. Search Parameters (Second Pass) -b2 hyponum Beam width (number of hypothesis) in second pass. If the count of word expantion at a certain length of hypothesis reaches this limit while search, shorter hypotheses are not expanded further. This prevents search to fall in breadth-first-like sta- tus stacking on the same position, and improve search failure. (default: 30) -n candidatenum The search continues till 'candidate_num' sentence hypotheses have been found. The obtained sentence hypotheses are sorted by score, and final result is displayed in the order (see also the "-output" option). The possibility that the optimum hypothesis is cor- rectly found increases as this value gets increased, but the processing time also becomes longer. Default value depends on the engine setup on com- pilation time: 10 (standard) 1 (fast, v2.1) -output N The top N sentence hypothesis will be Output at the end of search. Use with "-n" option. (default: 1) -cmalpha float This parameter decides smoothing effect of word confidence measure. (default: 0.05) -sb score Score envelope width for enveloped scoring. When calculating hypothesis score for each generated hypothesis, its trellis expansion and viterbi oper- ation will be pruned in the middle of the speech if score on a frame goes under [current maximum score of the frame- width]. Giving small value makes the second pass faster, but computation error may occur. (default: 80.0) -s stack_size The maximum number of hypothesis that can be stored on the stack during the search. A larger value may give more stable results, but increases the amount of memory required. (default: 500) -m overflow_pop_times Number of expanded hypotheses required to discon- tinue the search. If the number of expanded hypotheses is greater then this threshold then, the search is discontinued at that point. The larger this value is, The longer Julius gets to give up search (default: 2000) -lookuprange nframe When performing word expansion on the second pass, this option sets the number of frames before and after to look up next word hypotheses in the word trellis. This prevents the omission of short words, but with a large value, the number of expanded hypotheses increases and system becomes slow. (default: 5) -looktrellis Expand only the words survived on the first pass instead of expanding all the words predicted by grammar. This option makes second pass decoding slightly faster especially for large vocabulary condition, but may increase deletion error of short words. (default: disabled) -graphrange nframe When graph output is enabled (--enable-graphout), merge same words at neighbor position. If the position of same words differs smaller than this value, they will be merged. The default is 0 (no merging) and specifying larger value will result in smaller graph output. Forced Alignment -walign Do viterbi alignment per word units from the recog- nition result. The word boundary frames and the average acoustic scores per frame are calculated. -palign Do viterbi alignment per phoneme (model) units from the recognition result. The phoneme boundary frames and the average acoustic scores per frame are calculated. -salign Do viterbi alignment per HMM state from the recog- nition result. The state boundary frames and the average acoustic scores per frame are calculated. Server Module Mode -module [port] Run Julian on "Server Module Mode". After startup, Julian waits for tcp/ip connection from client. Once connection is established, Julian start commu- nication with the client to process incoming com- mands from the client, or to output recognition results, input trigger information and other system status to the client. The multi-grammar mode is only supported at this Server Module Mode. The default port number is 10500. jcontrol is sample client contained in this package. -outcode [W][L][P][S][C][w][l][p][s] (Only for Server Module Mode) Switch which symbols of recognized words to be sent to client. Specify 'W' for output symbol, 'L' for grammar entry, 'P' for phoneme sequence, 'S' for score, and 'C' for confidence score, respectively. Capital letters are for the second pass (final result), and small letters are for results of the first pass. For example, if you want to send only the output sym- bols and phone sequences as a recognition result to a client, specify "-outcode WP". Message Output -multigramout Enable multiple grammar output. Usually, Julian will search for the best hypothesis among the gram- mars. This options will change the search to find the best result one by one for each grammar. -quiet Omit phoneme sequence and score, only output the best word sequence hypothesis. -progout Enable progressive output of the partial results on the first pass. -proginterval msec set the output time interval of "-progout" in mil- liseconds. -demo Equivalent to "-progout -quiet" -charconv from to Enable output character set conversion. "from" is the source character set used in the language model, and "to" is the target character set you want to get. On Linux, the arguments should be a code name. You can obtain the list of available code names by invoking the command "iconv --list". On Windows, the arguments should be a code name or codepage number. Code name should be one of "ansi", "mac", "oem", "utf-7", "utf-8", "sjis", "euc". Or you can specify any codepage number supported at your environment. OTHERS -debug (For debug) output enoumous internal status and debug information. -C jconffile Load the jconf file. The options written in the file are included and expanded at the point. This option can also be used within other jconf file. -check wchmm (For debug) turn on interactive check mode of tree lexicon structure at startup. -check triphone (For debug) turn on interactive check mode of model mapping between Acoustic model, HMMList and dictio- nary at startup. -setting Display compile-time engine configuration and exit. -help Display a brief description of all options.


For examples of system usage, refer to the tutorial sec- tion in the Julian documents.


Note about jconf files: relative paths in a jconf file are interpreted as relative to the jconf file itself, not to the current directory.


julius(1), jcontrol(1), adinrec(1), adintool(1), mkdfa(1), mkbinhmm(1) mkgsmm(1), wav2mfcc(1), mkss(1) http://julius.sourceforge.jp/en/


Julian normally will return the exit status 0. If an error occurs, Julian exits abnormally with exit status 1. If an input file cannot be found or cannot be loaded for some reason then Julian will skip processing for that file.


There are some restrictions to the type and size of the models Julian can use. For a detailed explanation refer to the Julius documentation. For bug-reports, inquires and comments please contact julius@kuis.kyoto-u.ac.jp or julius@is.aist-nara.ac.jp.


Copyright (c) 1991-2005 Kyoto University, Japan Copyright (c) 2000-2005 Nara Institute of Science and Technology, Japan Copyright (c) 2005 Nagoya Institute of Technology, Japan


Rev.1.0 (1998/07/20) Designed by Tatsuya KAWAHARA and Akinobu LEE (Kyoto University) Rev.2.0 (1999/02/20) Rev.2.1 (1999/04/20) Rev.2.2 (1999/10/04) Rev.3.1 (2000/05/11) Development of above versions by Akinobu LEE (Kyoto University) Rev.3.2 (2001/08/15) Rev.3.3 (2002/09/11) Rev.3.4 (2003/10/01) Rev.3.4.1 (2004/02/25) Rev.3.4.2 (2004/04/30) Development of above versions by Akinobu LEE (Nara Institute of Science and Technology) Rev.3.5 (2005/09/30) Development of above versions by Akinobu LEE (Nagoya Institute of Technology)


From rev.3.2, Julian is released to the member of the "Information Processing Society, Continuous Speech Consor- tium". From rev.3.4, Julian becomes an open-source prod- ucts incorporated with Julius. The Windows Microsoft Speech API compatible version was developed by Takashi SUMIYOSHI (Kyoto University). LOCAL JULIAN(1)

$Id: julian.html.en,v 2007/01/10 08:01:57 kudravka_ Exp $