Chapter 3. Audio Input

Table of Contents

Audio Format
Number of bits
Number of channels
Sampling Rates
File input
Supported format
Live microphone input
Preparing microphone input
Notes for supported OS / devices
About Input Delay
Network and Socket inputs
original
esd
standard
DATLINK/NetAudio
Feature vector file input
Audio I/O Extension by Plugin

Julius accepts waveform input and extracted feature vector input. Waveform data can be given as either a audio file recording speech, or live audio stream via a capture device. You can also use a feature vector input in HTK format.

This chapter describes the specification of audio input in Julius and related tools. For more details about runtime options relating audio input, see the "Audio input" section of the reference manual.

Audio Format

Number of bits

Quantization bits of the input speech should be 16 bit. No support for 8bit or 24bit input currently.

Number of channels

Number of channels in the recorded data should be one. On live recognition via microphone, the device should support 1 channel recording. Exception is that if you are using OSS interface on Linux (-input oss) and only 2-channel recording (stereo) is available, Julius tries to record with the two channel and use only its left channel data.

Sampling Rates

The sampling rate of the input should be given explicitly. The default sampling rate if no option was given is 16,000 Hz. Option -smpFreq or -smpPeriod can be used to specify the sampling rate in either Hz or 100ns unit respectively. Another way is to use -htkconf option to give Julius the HTK Config file you used at AM training, in which case the value of SOURCERATE in the Config file will be set.

You should give the correct sampling rate based on the acoustic model you are going to use for recognition. The sampling rate of the input should be the same as the training condition of the acoustic model.

If you are going to use multiple acoustic models with different acoustic conditions, their sampling rate should be the same. You should give the same sampling rate parameters for all the acoustic models and (if you have) GMMs. For more details, see the next chapter about feature extraction.

The given sampling rate will work as requirement to the input. If you use a kind of live input like microphone capture, the given sampling rate will be set to the device and capturing will begin with the sampling rate. Julius will gets error when the sampling rate is not supported on the device. On the other hand, if you are recognizing an audio file, the sampling frequency of the input file is examined against the given sampling rate, and will be rejected if they do not match. [1]

File input

Option -input rawfile tells Julius to read an audio input from file. You can give a file name to be processed to the standard input of Julius. Multiple files can be processed one by one by listing the file names to a text file and specify it by -filelist

By default, Julius will assume one file as one sentence utterance, with silence part at the beginning and end of the file. But you can apply voice activity detection, silence cutting and other functions normally used for the microphone input by specifying some options. You can also use a Julius function called "short-pause segmentation" to do successive recognition of a long audio stream. See the corresponding chapter of this book for details.

Supported format

Julius can read the following audio file format by default:

  • Microsoft WAVE format WAV file (16bit, PCM (no compression), monaural)

  • RAW file: no header, signed short (16bit), Big Endian, monoral

If you use libsndfile with Julius, you can use additional formats like AU, NIST, ADPCM and so on. The libsndfile will be use in Julius if you have libsndfile development files (headers and libraries) in your system when you compile a Julius from source.

You may pay some attentions to the RAW file format. Julius accepts only Big Endian format. If you give Little Endian format RAW file, Julius cannot detect it and outputs wrong result with no warning. You can convert the endianness using sox like this:

% sox -t .raw -s -w -c 1 infile -t .raw -s -w -c 1 -x outfile

Also you should be careful whether the RAW file has correct data (sampling rate etc.) for the acoustic model you use, since RAW file does not have any header information in itself and Julius can not check them automatically.

Live microphone input

Option -input mic will tell Julius to get the audio input from a raw audio device like microphone or line input. This feature is OS dependent, and supported in Linux, Windows, Mac OS X, FreeBSD and Solaris. [2]

Detection of spoken region from continous input will be performed prior to the main recognition task. By default, a sound input will be detected by a simple level-based detection (level and zero-cross threshold), and then real-time recognition will be performed for each detected region. See the chapter of voice activity detection to tune the detection, or using advanced feature like GMM-based detection. Also see the notes for the real-time recognition.

Preparing microphone input

Julius does not handle any mixer setting of the machine. You should properly set its mixer setting such as recording volumes or capture device (microphone / line) etc.

The recording quality GREATLY affects the recognition performance. Less distortions and less noises will improve the accuracy. Also you should set a proper volume to avoid clipping at a loud voice.

You can check how Julius listens the input audio. If you have a runnning Julius, the best way is to specify an option -record dir to save the processed audio data per sentence into files. Another way is to use the tools in Julius distribution, adinrec and adintool, to record audio. They use the same function with Julius, so what they recorded is what Julius will hear.

Notes for supported OS / devices

Linux

Julius has two sound API interface for Linux:

  • ALSA

  • OSS

When specifying -input mic, Julius uses ALSA interface to capture audio. You can still explicitly specify which API to use by using option -input ALSA or -input oss.

The sound card should support 16-bit recording. Julius uses monaural (1-channel) recording by default, but if you are using OSS interface and only have stereo recording, Julius will recognize its left channel. You can also use USB audio devices.

Another devices can be selected by defining environmental variables. When using ALSA interface (this is default), the default device name string is "default". The device name can be altered by environment variable ALSADEV, for example, if you have multiple audio device and set ALSADEV="plughw:1,0", Julius will listen to the second sound card. When using OSS interface the default device name is /dev/dsp, and it can be changed by the environmental variable AUDIODEV.

Windows

On Windows, Julius uses DirectSound API via PortAudio library.

When using Portaudio V19, device will be searched in order of ASIO¡¤ DirectSound and MME. The record device can be specified by the environmental variable PORTAUDIO_DEV. When using portaudio v19, the instruction will be output into the log at audio initialization.

Mac OS

On Mac OS X, Julius uses CoreAudio API. It is confirmed to run on Mac OS X v10.3.9 and v10.4.1.

FreeBSD

On FreeBSD, Julius used the standard snd driver. If compilation fails, try --with-mictype=oss.

Sun Solaris

On Sun Solaris, the default device name is /dev/audio. It can be changed by setting an environment variable AUDIODEV. Unlike other OS, Julius on Solaris will automatically change the recording device to microphone. (It is an old feature of early development of Julius)

About Input Delay

You may encounter a time delay on audio input and may want to minimize it. This section describes the reason and show some method to improve it.

Since Most OS are not real-time system, Audio input is oftern buffered per a small chunk (or fragment) at kernel side. When a chunk is filled by the capture device, it will be transmitted to the user process. So the input will delay for the length of the chunk. You can set the size of a chunk by the environment variable LATENCY_MSEC (the value should be milliseconds, not the byte size!). The default value is dependent on OS, and will be output to the tty at startup time. Setting smaller value will decrease the delay, but CPU load will gets higher and may slow down the whole system.

Network and Socket inputs

original

-input adinnet makes Julius to receive audio stream via network socket. The protocol is a specific one sending just a sequence of audio sample streams per a small packet. There are no detailed document for the procotol, but it's a basic and very simple one, since it has no encription or encode/decode features.

adintool implements the protocol. You can test the adintool like this:

Run Julius with network input (it will stop for waiting connection)
% julius .... -input adinnet -freq srate
Run adintool to send audio to the Julius
(server_hostname should be the host where
the abobe Julius is running)
% adintool -in mic -out adinnet -server server_hostname -freq srate

esd

-input esd tells Julius to get audio input via EsounD daemon (esd) is supported on Linux. esd is an audio daemon used to share audio I/O among multiple applications. For more details, see the esd manual.

standard

Option -input stdin makes Julius to read input from standard input. Only RAW file format is supported with this option.

DATLINK/NetAudio

Julius supports reading direct input from DATLINK server. To use this feature, compile Julius with DATLINK/NetAudio libraries and headers and specify -input netaudio. See the Installation chapter to see how to specify the libraries.

Feature vector file input

Julius can read a feature vector file already extracted from a speech data by other applications such as HTK. You can use this feature to recognize with acoustic features unsupported by Julius. Supported file format is HTK feature file format.

-input htkparam or -input mfcfile tells Julius to read the file as feature vector file. Like audio file input, multiple file names can be given by listing file names one at a line into a text and specify the file by -filelist.

Given an input as feature vector file, its feature type is examined against the acoustic model you are going to use. When they does not match, Julius first checks the difference. If their base forms (basic types) are the same and only the qualifier below is different, Julius modifies the input vector to match the acoustic model and use it, else Julius outputs error and ignore the input.

  • addition / removal of delta coef. (_D)

  • addition / removal of accelleration coef. (_A)

  • supression of energy (_N)

Please note that this checking can be disabled by the option -notypecheck

Audio I/O Extension by Plugin

Julius ver.4.1 and later is capable of extending its audio interface by external plugin. When your target OS is not supported by Julius, or you want add some network-based input into Julius, you can develop a plugin to enable it. See the chapter describing plugin development for more details.



[1] Please note that this sampling rate check does not work at RAW file input, since RAW file has no header information.