PVOCEX File Format

PVOC-EX

File format for Phase Vocoder data, based on
WAVE_FORMAT_EXTENSIBLE.

Preliminary specification.

Rationale.

The PVOC-EX file format seeks to provide a cross-platform and robust format for standard
fixed-overlap phase vocoder analysis files. Many implementations of the phase vocoder exist, notably in Csound,
the CDP system (based on the CARL implementation (Moore/Dolson)), Soundhack (Tom Erbe), and the Princeton-hosted
PVC package (Paul Koonce). The differences between these formats are minor, and consist of different headers, and varying scale factors for amplitude. More importantly, the Csound format is not defined fully, and uses the byte-order of the
host platform. The Soundhack format is based closely on the Csound format, and similarly does not define word-order, though by being hosted on the Macintosh platform, it will invariably be written in big-endian format.

Uniquely, the PVC implementation supports multi-channel sources (stereo and beyond), and Soundhack supports at least stereo files. Csound opcodes currently only support mono analysis files.

While it would be a simple matter to converge the header elements of these existing formats, and define a byte order,
I have felt that the introduction by Microsoft of WAVE_FORMAT_EXTENSIBLE (WAVE-EX), which by definition supports custom extensions, offered a good opportunity to define a format based on an existing standard. One reason for choosing this route is that it enables rendering information to be fully incorporated into the format, by inheriting the WAVEFORMATEX component of WAVE-EX. With the power of modern PCs, it is not only possible, but easy, to stream more than one channel of analysis data in real time. The proposed format is intended to support use of analysis data in a real-time streaming environment.

For big-endian platforms, the SDIF initiative based at CNMAT is likely to prove of lasting significance, though a format for phase vocoder data has not yet been defined. SDIF would appear to offer the natural format for big-endian platforms. The SDIF format is extremely flexible, supporting frame-based data with arbitrary time-stamps, and multiple types of data within a single file. A probable problem with this is that in many cases file conversion can be only one-way, as while a simple phase vocoder format can be converted into an SDIF file, this cannot be guaranteed in the other direction. Also, SDIF does not currently define a format for multi-channel analysis files. For the reasons outlined above, that many programs already share all the important aspects of a single format, PVOC-EX is designed to ensure two-way conversion for at least CDP, Csound, Soundhack and PVC. It is my hope that the format, once consolidated, will be able to be incorporated into all these applications. Since SDIF is designed to support advanced and research-oriented applications, I feel that PVOC-EX is best kept to a minimum specification compatible with effective use.

The Format.

This document presumes a basic knowledge of WAVE-EX.

To extend WAVE-EX, a unique identifier, or GUID, is required. Applications which do not recognise, or cannot handle, files with this GUID will reject the file. The GUID defined for PVOC-EX is:

{8312B9C2-2E6E-11d4-A824-DE5B96C3AB21}

The complete format chunk for PVOC-EX is:

typedef struct {
WAVEFORMATEXTENSIBLE wxFormat;
DWORD dwVersion;                                      /* initial version is 1*/
DWORD dwDataSize;                                  /* sizeof PVOCDATA data block */
PVOCDATA data;                                       /* 32 byte block */
} WAVEFORMATPVOCEX;

The total size of WAVEFORMATPVOCEX is 80 bytes, thus respecting the requirements of WAVE-EX that the format chunk support alignment to 8-byte boundaries.

wxFormat:
contains the information required to synthesize the file as originally analysed. The full scope of WAVE-EX is available, including the definition of speaker positions.This information can be ignored by a renderer, though certain fields are important.

wxFormat.dwChannelMask:
if this is zero, signifying no assigned speakers, an application has the choice whether to create a plain soundfile (WAVE,AIFF,etc), or a WAVE-EX file. However, where speaker positions are defined, it is recommended that a WAVE-EX file be created. Although AIFF does define a few surround sound formats, they are ambiguous for four channels, and the six-channel format is archaic, so it is not recommended in this case.

wxFormat.Format.nChannels
Number of channels in the file (mono, stereo, etc)

wFormat.Format.nSamplesPerSec
Sample Rate of the source. Informs applications of the Nyquist frequency for the analysis data.

In circumstances where the analysis data has been synthesized directly, these are the two essential pieces of information. It is then a matter of choice what output sample format is specified, though for synthetic data, use of the floating-point format is recommended. The data for the full WAVEFORMATEX block should be set correctly as for any WAVE file.

All information specific to the phase vocoder is contained within the PVOCDATA block. This is defined by the structure:

typedef struct pvoc_data {
WORD wWordFormat;    /* IEEE_FLOAT or IEEE_DOUBLE */
WORD wAnalFormat;    /*PVOC_AMP_FREQ, PVOC_AMP_PHASE, PVOC_COMPLEX */
WORD wSourceFormat;    /* WAVE_FORMAT_PCM or WAVE_FORMAT_IEEE_FLOAT*/
WORD wWindowType;    /* defines the standard analysis window used, or a custom window */
DWORD nAnalysisBins; /* number of analysis channels. The FFT window size is derived from this */
DWORD dwWinlen;     /* analysis window length, in samples */
DWORD dwOverlap;     /* window overlap length in samples (decimation) */
DWORD dwFrameAlign;    /* usually nAnalysisBins * 2 * sizeof(float) */
float fAnalysisRate;            /* sample rate / Overlap */
float fWindowParam;    /* parameter associated with some window types: default 0.0f unless needed */
} PVOCDATA;

Notes on some PVOCDATA fields.

wWordFormat:
I expect that IEEE_FLOAT will be used almost always. I recognize that some advanced applications may wish to be able to use doubles; the issue is that more than one f/p format exists for doubles, and it will be important to eliminate all possibility of ambiguity here.

wAnalFormat:
Csound, CDP/CARL, and PVC all write analysis channels as amplitude and frequency. Soundhack writes a format as amplitude and phase (listed within Csound but not implemented). Another relevant format is pure complex (real-imaginary). This is the format of the SDIF 1STF file, for eample. The inclusion of this mode is arguable, as it is relatively remote from the usual application of the phase vocoder (for most non-trivial tasks, the data will need to be converted to the Amp-Freq format). Other representations are possible, but I feel that specifying too many alternative formats adds complexity to a receiving application. The following frame formats are currently defined for PVOCEX:

PVOC_AMP_FREQ (the most usual)
PVOC_AMP_PHASE
PVOC_COMPLEX

PVOC_AMP_PHASE is defined for compatibility with Soundhack. Accordingly, the data represents raw or 'wrapped' pahse values derived directly from the fft analysis. A rendering application will need to unwrap the phases, to convert to PVOC_AMP_FREQ. This is a necessary step for time-scaling procedures. This is demonstrated in the example application pvconv.

Similarly, the PVOC_COMPLEX frame contains raw 'wrapped' real-imaginary numbers.

wSourceFormat
This is required to disambiguate a 32bit source sample size as defined in WAVEFORMATEX. Since wFormatTag is WAVE_FORMAT_EXTENSIBLE, and a custom GUID is used, the distinction between integer and floating-point samples is lost.

wWindowType:
One of the arguable aspects of the specification. It is possible to identify a large number of analysis windows. However, in current phase vocoder implementations, one of a small set of standard windows is used. The following window types have been defined for PVOC-EX so far:

PVOC_HAMMING
PVOC_HANNING
PVOC_KAISER
PVOC_RECT
PVOC_CUSTOM

The Kaiser window has an associated parameter, 'beta', which can be given in the fWindowParam field. If this is zero, the default value of 6.8 will be assumed.

The provision of PVOC-CUSTOM is possibly contentious. If this is specified, the format chunk must be followed, before the 'data' chunk, by a special chunk containing the window data, of length dwWinlen. The samples must be of the same type as the analysis data itself, as given by wWordFormat. The data must be normalised so that the peak sample (centre of the window) is 1.0.

nAnalysisBins
Number of analysis channels. This relates directly to the fft size used in the analysis:
nAnalysisBins = (fft_size / 2) + 1.

One unusual aspect of the canonical CARL implementation is that FFT sizes need not be powers-of-two, but they must be even. The use of the -F flag in pvoc and pvocex, to tune the analysis to a known fundamental frequency, will generally lead to a non power-of-two FFT size.This is currently allowed in PVOC-EX, so that the full scope of the phase vocoder is available. A non power-of-two size must nevertheless be regarded as exceptional (and may not be supported by some implementations), and is best avoided for analysis files intended for general distribution. However, it is worth noting that such a format can still be rendered using a conventional oscillator bank.

Note that the format supports the use of window sizes, given by dwWinlen, greater than the fft size. The canonical implementation of this is CARL pvoc, as demonstrated in the example implementations indicated below.

dwOverlap

This is defined as 'Decimation' in most implementations. It is very commonly a power-of-two fraction of the window size (e.g the standard overlap in CARL pvoc is 1/8th of the window). However, dwOverlap is not limited to such sizes. Both Csound and PVC typically use overlaps that are not power-of-two fractions.

dwFrameAlign

This is a slightly arguable inclusion. Since all analysis data consists of a pair of 32bit words, all frames are automatically assured alignment to 8-byte boundaries (this is also the required alignment for SDIF files). However, for some platforms even this may not be ideal (for example, alignment to 16byte boundaries), in which case the usual odd number of analysis bins will require padding after the data. The dwFrameAlign will store the full size of the analysis frame, plus any padding. Pad words should be written as zeroes. The example implementations do not currently test this value, or calculate padding. Developers using relevant platforms are invited to test and comment on this feature.

Custom Window Chunk.

Currently this is the most speculative element of the format, and comments are invited.
    It is very simple:
    <PVXW>
    <chunk-size in bytes, excluding tag and size field>
    < window data, dwWinlen samples>

This may well not be adequate. Possible additions include a 4-byte ident (or a longer ASCII title), and a floats field specifying amplitude, though normative practice would be to provide a normalized window (peak value = 1.0). Note that the WAVE-EX spec encourages all chunks to support 8-byte alignment.

No other chunks, apart from the data chunk itself, are required for PVOC-EX. Where the renderer sepcifies floating-point samples, the PEAK chunk can be used in the usual way. This will be especially relevant where a custom window is used, as amplitude levels cannot be presumed. The Custom Window Chunk is not yet implemented in the example programs.

The 'data' chunk.

Analysis frames are interleaved according to nChannels, i.e.:

for a stereo file:

<frame 0 Ch 0>
<frame0 Ch 1>
<frame 1 Ch 0>
<frame 1 Ch 1>
etc...

Frames amplitudes are expected to be normalized close to 1.0. Thus, where the source is a full-amplitude sinewave, the peak amplitude in the nearest bin will be close to 1.0. Later versions of this document will develop this aspect further. Suffice it to say here that both the CARL and Soundhack formats provide analysis windows in this form, while Csound and PVC require scale factors. The example implementations accompanying this release is based on the CARL distribution, and a further program pvconv demomnstrates conversion from the current Csound format to the PVOC-EX format. It is important that this aspect of the format is tested thoroughly, as the format is required to support independent analysis and synthesis implementations.

File Extension

Athough, strictly speaking, it cannot be illegal to give a file based on WAVE-FORMAT_EXTENSIBLE the .wav extension, there is little practical advantage in doing so. Most applications check the file format by reading the header, but some do not.

The recommended file extension for the PVOC-EX format is .pvx. This is enforced by the example application pvocex2.

Example Implementations (command-line programs): pvocex, pvocex2, pvconv

These now both use the fast FFTW libraries, for a significant increase in speed (typ > 25%), though the programs are much larger. Distributions for Linux as well as Windows are now available.

pvocex

This is a direct adaptation of the CARL 'pvoc' program, and uses C code entirely. Only mono files are supported. A majority of the CARL command flags is supported. The full range of time and pitch scaling can be applied.

Preliminary benchmarks: source is 30-second mono floatsam file at SR=44100. Full analysis-resynthesis.
pvocex -N1024 infile.wav outfile.wav
(default overlap 128 samples )
Pentium II 333MHz, 128MB RAM:
Windows2000 (VC++ 5.0): 12.24 secs
Linux (Redhat 6.0, pgcc -O6 etc) 13.67 secs

Pentium III 500Mhz, 128MB RAM
Linux (Redhat 6.1, gcc -O6 etc) 9.06 secs

pvocex2

This uses crude C++ wrappers to encapsulate the pvoc analysis and synthesis stages, partly to emphasize that, in principle, the two stages could be implemented independently. This version supports stereo files (it will eventually be extended to support generic multi-channel files), and a main subset of the CARL flags. Time and pitch scaling is supported. A flag is also available to report the format information of a PVOC-EX file.

Further development of PVOC-EX facilities will be concentrated in this program, and derivatives, so that pvocex itself can remain close to the CARL implementaion. It follows that none of the code for pvocex2 is frozen, and will almost certainly change. In particular, the functions in the file pvpp.cpp should not be considered as library or API functions.

Preliminary benchmarks: source is 30second mono floatsam file at SR=44100. Full analysis-resynthesis.
pvocex2 -N1024 infile.wav outfile.wav
(default overlap 128 samples )
Pentium II 333MHz, 128MB RAM:
Windows2000 (VC++ 5.0) 14.1 secs
Linux (Redhat 6.0 pgcc -O6 etc) 15.68 secs

Pentium III 500Mhz, 128MB RAM
Linuix (Redhat 6.1, gcc -O6, etc) 10.19 secs

Source for pvocex mono analysis-resynthesis (accepts WAVE, AIFF and AIFF-C files):
pvocex_src012.zip (49KB). Includes project files for VC++5.0
pvocex_src-0.12.tar.gz (44KB)

Source for pvocex2 stereo analysis-restnthesis (accepts WAVE, AIFF and AIFF-C files):
pvocex2_src011.zip (45KB) Include project files for VC++ 5.0
pvocex2_src-0.11.tar.gz (44KB)

Source for pvconv application (convert Csound and Soundhack analysis files to pvoc-ex):
pvconv_src01.zip (5KB) NB requires pvocex_src012.zip

Other conversion programs will probably become available in time.

Executables for Pentium Pro or better systems (pvocex, pvocex2 and pvconv):
Windows:
pvocex_bin012.zip (172KB)
pvocex2_bin011.zip 171KB)
pvconv_exe.zip (27KB)

Linux:
pvocex_bin-0.12.tar.gz (214KB)
pvocex2_bin-0.11.tar.gz (200KB)
[pvconv to come]

Return to NOS-DREAM home page

Richard Dobson 5th August 2000