[ Japanese | English | Back to top ]
Last update: "2005/09/29 04:03:21"



Julius - open source multi-purpose LVCSR engine


julius [-C jconffile] [options ...]


Julius is a high-performance, multi-purpose, free speech recognition engine for researchers and developers. It is capable of performing almost real-time recognition of con- tinuous speech with over 60k-word vocabulary on most cur- rent PCs. Julius needs an N-gram language model, word dictionary and an acoustic model to execute a recognition. Standard model formats (i.e. ARPA and HTK) with any word/phone units and sizes are supported, so users can build a recog- nition system for various target using their own language model and acoustic models. For details about basic models and their availability, please see the documents contained in this package. Julius can perform recognition on audio files, live micro- phone input, network input and feature parameter files. The maximum size of vocabulary is 65,535 words.


Julius supports the following models. Acoustic Models Sub-word HMM (Hidden Markov Model) in HTK ascii format are supported. Phoneme models (mono- phone), context dependent phoneme models (tri- phone), tied-mixture and phonetic tied-mixture models of any unit can be used. When using con- text dependent models, interword context is also handled. You can further use a tool mkbinhmm to convert the ascii HMM definition file to binary format, for speeding up the startup (this format is incompatible with that of HTK). Language model 2-gram and reverse 3-gram language models are used. The Standard ARPA format is supported. In addition, a binary format N-gram is also sup- ported for efficiency. The tool mkbingram. can convert binary N-gram from the ARPA language models.


Both live speech input and recorded speech file input are supported. Live input stream from microphone device, DatLink (NetAudio) device and tcpip network input using adintool is supported. Speech waveform files (16bit WAV (no compression), RAW format, and many other formats will be acceptable if compiled with libsndfile library). Fea- ture parameter files in HTK format are also supported. Note that Julius itself can only extract MFCC_E_D_N_Z fea- tures from speech data. If you use an acoustic HMM trained by other feature type, only the HTK parameter file of the same feature type can be used.


Recognition algorithm of Julius is based on a two-pass strategy. Word 2-gram and reverse word 3-gram is used on the respective passes. The entire input is processed on the first pass, and again the final searching process is performed again for the input, using the result of the first pass to narrow the search space. Specifically, the recognition algorithm is based on a tree-trellis heuristic search combined with left-to-right frame-synchronous beam search and right-to-left stack decoding search. When using context dependent phones (triphones), interword contexts are taken into consideration. For tied-mixture and phonetic tied-mixture models, high-speed acoustic likelihood calculation is possible using gaussian pruning. For more details, see the related document or web page below.


The options below specify the models, system behaviors and various search parameters. These option can be set all at once at the command line, but it is recommended that you write them in a text file as a "jconf file", and specify the file with "-C" option. Speech Input -input {rawfile|mfcfile|mic|adinnet|netaudio|stdin} Select speech data input source. 'rawfile' is waveform file, and specified after startup from stdin). 'mic' means microphone device, and 'adin- net' means receiving waveform data via tcpip net- work from an adinnet client. 'netaudio' is from DatLink/NetAudio input, and 'stdin' means data input from standard input. WAV (no compression) and RAW (noheader, 16bit, BigEndian) are supported for waveform file input. Other format can be supported using external library. To see what format is actually supported, see the help message using option "-help". For stdin input, only WAV and RAW is supported. (default: mfcfile) -filelist file (With -input rawfile|mfcfile) perform recognition on all files listed in the file. -adport portnum (With -input adinnet) adinnet port number (default: 5530) -NA server:unit (With -input netaudio) set the server name and unit ID of the Datlink unit. -zmean -nozmean With speech input, this options enable/disable whether to remove DC offset using zero mean source. (default: disabled (-nozmean)) -nostrip Julius by default removes zero samples in input speech data. In some cases, such invalid data may be recorded at the start or end of recording. This option inhibit this automatic removal. -record directory Auto-save input speech data successively under the directory. Each segmented inputs are recorded to a file each by one. The file name of the recorded data is generated from system time when the input starts, in a style of "YYYY.MMDD.HHMMSS.wav". File format is 16bit monoral WAV. Invalid for mfcfile input. With input rejection by "-rejectshort", the rejected input will also be recorded even if they are rejected. -rejectshort msec Reject input shorter than specified milliseconds. Search will be terminated and no result will be output. In module mode, '<REJECTED REASON="..."/>' message will be sent to client. With "-record", the rejected input will also be recorded even if they are rejected. (default: 0 = off) Speech Detection Options in this section is invalid for mfcfile input. -cutsilence -nocutsilence Force silence cutting (=speech segment detection) to ON/OFF. (default: ON for mic/adinnet, OFF for files) -lv threslevel Level threshold (0 - 32767) for speech triggering. If audio input amplitude goes over this threshold for a period, Julius begin the 1st pass recogni- tion. If the level goes below this level after triggering, it is the end of the speech segment. (default: 2000) -zc zerocrossnum Zero crossing threshold per a second (default: 60) -headmargin msec Margin at the start of speech segment in millisec- onds. (default: 300) -tailmargin msec Margin at the end of speech segment in millisec- onds. (default: 400) Acoustic Analysis -smpFreq frequency Set sampling frequency of input speech in Hz. Sam- pling rate can also be specified using "-smpPe- riod". Be careful that this frequency should be the same as the trained conditions of acoustic model you use. This should be specified for micro- phone input and RAW file input when using other than default rate. Also see "-fsize", "-fshift", "-delwin". (default: 16000 (Hz) = 625ns). -smpPeriod period Set sampling frequency of input speech by its sam- pling period (nanoseconds). The sampling rate can also be specified using "-smpFreq". Be careful that the input frequency should be the same as the trained conditions of acoustic model you use. This should be specified for microphone input and RAW file input when using other than default rate. Also see "-fsize", "-fshift", "-delwin". (default: 625 (ns) = 16000Hz). -fsize sample Analysis window size in number of samples. (default: 400). -fshift sample Frame shift in number of samples (default: 160). -delwin frame Delta window size in number of samples (default: 2). -lofreq frequency Enable band-limiting for MFCC filterbank computa- tion: set lower frequency cut-off. Also see "-hifreq". (default: -1 = disabled) -hifreq frequency Enable band-limiting for MFCC filterbank computa- tion: set upper frequency cut-off. Also see "-lofreq". (default: -1 = disabled) -sscalc Perform spectral subtraction using head part of each file. With this option, Julius assume there are certain length of silence at each input file. Valid only for rawfile input. Conflict with "-ssload". -sscalclen With "-sscalc", specify the length of head part silence in milliseconds (default: 300) -ssload filename Perform spectral subtraction for speech input using pre-estimated noise spectrum from file. The noise spectrum data should be computed beforehand by mkss. Valid for all speech input. Conflict with "-sscalc". -ssalpha value Alpha coefficient of spectral subtraction for "-sscals" and "-ssload". Noise will be subtracted stronger as this value gets larger, but distortion of the resulting signal also becomes remarkable. (default: 2.0) -ssfloor value Flooring coefficient of spectral subtraction. The spectral parameters that go under zero after sub- traction will be substituted by the source signal with this coefficient multiplied. (default: 0.5) GMM-based Input Verification and Rejection -gmm filename GMM definition file in HTK format. If specified, GMM-based input verification will be performed con- currently with the 1st pass, and you can reject the input according to the result as specified by "-gmmreject". Note that the GMM should be defined as one-state HMMs, and their training parameter should be the same as the acoustic model you want to use with. -gmmnum N Number of Gaussian components to be computed per frame on GMM calculation. Only the N-best Gaus- sians will be computed for rapid calculation. The default is 10 and specifying smaller value will speed up GMM calculation, but too small value (1 or 2) may cause degradation of identification perfor- mance. -gmmreject string Comma-separated list of GMM names to be rejected as invalid input. When recognition, the log likeli- hoods of GMMs accumulated for the entire input will be computed concurrently with the 1st pass. If the GMM name of the maximum score is within this string, the 2nd pass will not be executed and the input will be rejected. Language Model (word N-gram) -nlr 2gram_filename 2-gram language model file in standard ARPA format. -nrl rev_3gram_filename Reverse 3-gram language model file. This is required for the second search pass. If this is not defined then only the first pass will take place. -d bingram_filename Use binary format language model instead of ARPA formats. The 2-gram and 3-gram model can be com- bined and converted to this binary format using mkbingram. Julius can read this format much faster than ARPA format. -lmp lm_weight lm_penalty -lmp2 lm_weight2 lm_penalty2 Language model score weights and word insertion penalties for the first and second passes respec- tively. The hypothesis language scores are scaled as shown below: lm_score1 = lm_weight * 2-gram_score + lm_penalty lm_score2 = lm_weight2 * 3-gram_score + lm_penalty2 The defaults are dependent on acoustic model: First-Pass | Second-Pass -------------------------- 5.0 -1.0 | 6.0 0.0 (monophone) 8.0 -2.0 | 8.0 -2.0 (triphone,PTM) 9.0 8.0 | 11.0 -2.0 (triphone,PTM, setup=v2.1) -transp float Additional insertion penalty for transparent words. (default: 0.0) Word Dictionary -v dictionary_file Word dictionary file (required). -silhead {WORD|WORD[OUTSYM]|#num} -siltail {WORD|WORD[OUTSYM]|#num} Sentence start and end silence word as defined in the dictionary. (default: "<s>" / "</s>") Julius deal these words as fixed start-word and end-word of recognition. They can be defined in several formats as shown below. Example Word_name <s> Word_name[output_symbol] <s>[silB] #Word_ID #14 (Word_ID is the word position in the dictionary file starting from 0) -forcedict Ignore dictionary errors and force running. Words with errors will be dropped from dictionary at startup. Acoustic Model (HMM) -h hmmfilename HMM definition file to use. Format (ascii/binary) will be automatically detected. (required) -hlist HMMlistfilename HMMList file to use. Required when using triphone based HMMs. This file provides a mapping between the logical triphones names genertated from phone sequence in the dictionary and the HMM definition names. -iwcd1 {best N|max|avg} When using a triphone model, select method to han- dle inter-word triphone context on the first and last phone of a word in the first pass. best N: use average likelihood of N-best scores from the same context triphones (default, N=3) max: use maximum likelihood of the same context triphones avg: use average likelihood of the same context triphones -force_ccd / -no_ccd Normally Julius determines whether the specified acoustic model is a context-dependent model from the model names, i.e., whether the model names con- tain character '+' and '-'. You can explicitly specify by these options to avoid mis-detection. These will override the automatic detection result. -notypecheck Disable checking of input parameter type. (default: enabled) Acoustic Computation Gaussian Pruning will be automatically enabled when using tied-mixture based acoutic model. It is disabled by default for non tied-mixture models, but you can activate pruning to those models by explicitly specifying "-gprune". Gaussian Selection needs a monophone model converted by mkgshmm. -gprune {safe|heuristic|beam|none} Set the Gaussian pruning technique to use. (default: 'safe' (setup=standard), 'beam' (setup=fast) for tied mixture model, 'none' for non tied-mixture model) -tmix K With Gaussian Pruning, specify the number of Gaus- sians to compute per mixture codebook. Small value will speed up computation, but likelihood error will grow larger. (default: 2) -gshmm hmmdefs Specify monophone hmmdefs to use for Gaussian Mix- ture Selectio. Monophone model for GMS is gener- ated from an ordinary monophone HMM model using mkgshmm. This option is disabled by default. (no GMS applied) -gsnum N When using GMS, specify number of monophone state to select from whole monophone states. (default: 24) Inter-word Short Pause Handling -iwspword Add a word entry to the dictionary that should cor- respond to inter-word short pauses that may occur in input speech. This may improve recognition accuracy in some language model that has no inter- word pause modeling. The word entry can be speci- fied by "-iwspentry". -iwspentry Specify the word entry that will be added by "-iwspword". (default: "<UNK> [sp] sp sp") -iwsp (Multi-path version only) Enable inter-word con- text-free short pause handling. This option appends a skippable short pause model for every word end. The added model will be skipped on inter-word context handling. The HMM model to be appended can be specified by "-spmodel" option. -spmodel Specify short-pause model name that will be used in "-iwsp". (default: "sp") Short-pause Segmentation The short pause segmentation can be used for sucessive decoding of a long utterance. Enabled when compiled with '--enable-sp-segment'. -spdur Set the short-pause duration threshold in number of frames. If a short-pause word has the maximum likelihood in successive frames longer than this value, then interrupt the first pass and start the second pass. (default: 10) Search Parameters (First Pass) -b beamwidth Beam width (number of HMM nodes) on the first pass. This value defines search width on the 1st pass, and has great effect on the total processing time. Smaller width will speed up the decoding, but too small value will result in a substantial increase of recognition errors due to search failure. Larger value will make the search stable and will lead to failure-free search, but processing time and memory usage will grow in proportion to the width. Default value is acoustic model dependent: 400 (monophone) 800 (triphone,PTM) 1000 (triphone,PTM, setup=v2.1) -sepnum N Number of high frequency words to be separated from the lexicon tree. (default: 150) -1pass Only perform the first pass search. This mode is automatically set when no 3-gram language model has been specified (-nlr). -realtime -norealtime Explicitly specify whether real-time (pipeline) processing will be done in the first pass or not. For file input, the default is OFF (-norealtime), for microphone, adinnet and NetAudio input, the default is ON (-realtime). This option relates to the way CMN is performed: when OFF CMN is calcu- lated for each input independently, when the real- time option is ON the previous 5 second of input is always used. Also refer to "-progout". -cmnsave filename Save last CMN parameters computed while recognition to the specified file. The parameters will be saved to the file in each time a input is recog- nized, so the output file always keeps the last CMN parameters. If output file already exist, it will be overridden. -cmnload filename Load initial CMN parameters previously saved in a file by "-cmnsave". This option enables Julius to recognize the first utterance of a live microphone input or adinnet input with CMN. Search Parameters (Second Pass) -b2 hyponum Beam width (number of hypothesis) in second pass. If the count of word expantion at a certain length of hypothesis reaches this limit while search, shorter hypotheses are not expanded further. This prevents search to fall in breadth-first-like sta- tus stacking on the same position, and improve search failure. (default: 30) -n candidatenum The search continues till 'candidate_num' sentence hypotheses have been found. The obtained sentence hypotheses are sorted by score, and final result is displayed in the order (see also the "-output" option). The possibility that the optimum hypothesis is cor- rectly found increases as this value gets increased, but the processing time also becomes longer. Default value depends on the engine setup on com- pilation time: 10 (standard) 1 (fast, v2.1) -output N The top N sentence hypothesis will be Output at the end of search. Use with "-n" option. (default: 1) -cmalpha float This parameter decides smoothing effect of word confidence measure. (default: 0.05) -sb score Score envelope width for enveloped scoring. When calculating hypothesis score for each generated hypothesis, its trellis expansion and viterbi oper- ation will be pruned in the middle of the speech if score on a frame goes under [current maximum score of the frame- width]. Giving small value makes the second pass faster, but computation error may occur. (default: 80.0) -s stack_size The maximum number of hypothesis that can be stored on the stack during the search. A larger value may give more stable results, but increases the amount of memory required. (default: 500) -m overflow_pop_times Number of expanded hypotheses required to discon- tinue the search. If the number of expanded hypotheses is greater then this threshold then, the search is discontinued at that point. The larger this value is, The longer Julius gets to give up search (default: 2000) -lookuprange nframe When performing word expansion on the second pass, this option sets the number of frames before and after to look up next word hypotheses in the word trellis. This prevents the omission of short words, but with a large value, the number of expanded hypotheses increases and system becomes slow. (default: 5) -graphrange nframe When graph output is enabled (--enable-graphout), merge same words at neighbor position. If the position of same words differs smaller than this value, they will be merged. The default is 0 (no merging) and specifying larger value will result in smaller graph output. Forced Alignment -walign Do viterbi alignment per word units from the recog- nition result. The word boundary frames and the average acoustic scores per frame are calculated. -palign Do viterbi alignment per phoneme (model) units from the recognition result. The phoneme boundary frames and the average acoustic scores per frame are calculated. -salign Do viterbi alignment per HMM state from the recog- nition result. The state boundary frames and the average acoustic scores per frame are calculated. Server Module Mode -module [port] Run Julius on "Server Module Mode". After startup, Julius waits for tcp/ip connection from client. Once connection is established, Julius start commu- nication with the client to process incoming com- mands from the client, or to output recognition results, input trigger information and other system status to the client. The multi-grammar mode is only supported at this Server Module Mode. The default port number is 10500. jcontrol is sample client contained in this package. -outcode [W][L][P][S][C][w][l][p][s] (Only for Server Module Mode) Switch which symbols of recognized words to be sent to client. Specify 'W' for output symbol, 'L' for N-gram entry, 'P' for phoneme sequence, 'S' for score, and 'C' for confidence score, respectively. Capital letters are for the second pass (final result), and small letters are for results of the first pass. For example, if you want to send only the output sym- bols and phone sequences as a recognition result to a client, specify "-outcode WP". Message Output -separatescore Output the language and acoustic scores separately. -quiet Omit phoneme sequence and score, only output the best word sequence hypothesis. -progout Enable progressive output of the partial results on the first pass. -proginterval msec set the output time interval of "-progout" in mil- liseconds. -demo Equivalent to "-progout -quiet" -charconv from to Enable output character set conversion. "from" is the source character set used in the language model, and "to" is the target character set you want to get. On Linux, the arguments should be a code name. You can obtain the list of available code names by invoking the command "iconv --list". On Windows, the arguments should be a code name or codepage number. Code name should be one of "ansi", "mac", "oem", "utf-7", "utf-8", "sjis", "euc". Or you can specify any codepage number supported at your envi- ronment. OTHERS -debug (For debug) output enoumous internal status and debug information. -C jconffile Load the jconf file. The options written in the file are included and expanded at the point. This option can also be used within other jconf file for recursive expansion. -check wchmm (For debug) turn on interactive check mode of tree lexicon structure at startup. -check triphone (For debug) turn on interactive check mode of model mapping between Acoustic model, HMMList and dictio- nary at startup. -setting Display compile-time engine configuration and exit. -help Display a brief description of all options.


For examples of system usage, refer to the tutorial sec- tion in the Julius documents.


Note about jconf files: relative paths in a jconf file are interpreted as relative to the jconf file itself, not to the current directory.


julian(1), jcontrol(1), adinrec(1), adintool(1), mkbin- gram(1), mkbinhmm(1), mkgsmm(1), wav2mfcc(1), mkss(1) http://julius.sourceforge.jp/en/


Julius normally will return the exit status 0. If an error occurs, Julius exits abnormally with exit status 1. If an input file cannot be found or cannot be loaded for some reason then Julius will skip processing for that file.


There are some restrictions to the type and size of the models Julius can use. For a detailed explanation refer to the Julius documentation. For bug-reports, inquires and comments please contact julius@kuis.kyoto-u.ac.jp or julius@is.aist-nara.ac.jp.


Copyright (c) 1991-2005 Kyoto University, Japan Copyright (c) 1997-2000 Information-technology Promotion Agency, Japan Copyright (c) 2000-2005 Nara Institute of Science and Technology, Japan Copyright (c) 2005 Nagoya Institute of Technology, Japan


Rev.1.0 (1998/02/20) Designed by Tatsuya KAWAHARA and Akinobu LEE (Kyoto University) Development by Akinobu LEE (Kyoto University) Rev.1.1 (1998/04/14) Rev.1.2 (1998/10/31) Rev.2.0 (1999/02/20) Rev.2.1 (1999/04/20) Rev.2.2 (1999/10/04) Rev.3.0 (2000/02/14) Rev.3.1 (2000/05/11) Development of above versions by Akinobu LEE (Kyoto University) Rev.3.2 (2001/08/15) Rev.3.3 (2002/09/11) Rev.3.4 (2003/10/01) Rev.3.4.1 (2004/02/25) Rev.3.4.2 (2004/04/30) Development of above versions by Akinobu LEE (Nara Institute of Science and Technology) Rev.3.5 (2005/09/30) Development of above versions by Akinobu LEE (Nagoya Institute of Technology)


From rev.3.2, Julius is released by the "Information Pro- cessing Society, Continuous Speech Consortium". The Windows DLL version was developed and released by Hideki BANNO (Nagoya University). The Windows Microsoft Speech API compatible version was developed by Takashi SUMIYOSHI (Kyoto University). LOCAL JULIUS(1)

$Id: julius.html.en,v 2007/01/10 08:01:57 kudravka_ Exp $