File format for Phase Vocoder data, based onPVOC-EX
Example Programs Download NOS_DREAM
Preliminary specification.
Rationale.
The PVOC-EX file format seeks to provide a cross-platform and robust
format for standard
fixed-overlap phase vocoder analysis files. Many implementations of
the phase vocoder exist, notably in Csound,
the CDP system
(based on the CARL implementation (Moore/Dolson)), Soundhack (Tom
Erbe), and the Princeton-hosted
PVC package (Paul Koonce). The differences between these formats are
minor, and consist of different headers, and varying scale factors for
amplitude. More importantly, the Csound format is not defined fully,
and uses the byte-order of the
host platform. The Soundhack format is based closely on the Csound
format, and similarly does not define word-order, though by being hosted
on the Macintosh platform, it will invariably be written in big-endian
format.
Uniquely, the PVC implementation supports multi-channel sources (stereo and beyond), and Soundhack supports at least stereo files. Csound opcodes currently only support mono analysis files.
While it would be a simple matter to converge the header elements of
these existing formats, and define a byte order,
I have felt that the introduction by Microsoft of WAVE_FORMAT_EXTENSIBLE
(WAVE-EX), which by definition supports custom extensions, offered a good
opportunity to define a format based on an existing standard. One reason
for choosing this route is that it enables rendering information to be
fully incorporated into the format, by inheriting the WAVEFORMATEX component
of WAVE-EX. With the power of modern PCs, it is not only possible,
but easy, to stream more than one channel of analysis data in real time.
The proposed format is intended to support use of analysis data in a real-time
streaming environment.
For big-endian platforms, the SDIF
initiative based at CNMAT is likely to prove of lasting significance, though
a format for phase vocoder data has not yet been defined. SDIF would appear
to offer the natural format for big-endian platforms. The SDIF format is
extremely flexible, supporting frame-based data with arbitrary time-stamps,
and multiple types of data within a single file. A probable problem with
this is that in many cases file conversion can be only one-way, as while
a simple phase vocoder format can be converted into an SDIF file, this
cannot be guaranteed in the other direction. Also, SDIF does not currently
define a format for multi-channel analysis files. For the reasons outlined
above, that many programs already share all the important aspects of a
single format, PVOC-EX is designed to ensure two-way conversion for at
least CDP, Csound, Soundhack and PVC. It is my hope that the format, once
consolidated, will be able to be incorporated into all these applications.
Since SDIF is designed to support advanced and research-oriented applications,
I feel that PVOC-EX is best kept to a minimum specification compatible
with effective use.
The Format.
This document presumes a basic knowledge of WAVE-EX.
To extend WAVE-EX, a unique identifier, or GUID, is required. Applications which do not recognise, or cannot handle, files with this GUID will reject the file. The GUID defined for PVOC-EX is:
{8312B9C2-2E6E-11d4-A824-DE5B96C3AB21}
The complete format chunk for PVOC-EX is:
typedef struct {
WAVEFORMATEXTENSIBLE wxFormat;
DWORD dwVersion;
/* initial version is 1*/
DWORD dwDataSize;
/* sizeof PVOCDATA data block */
PVOCDATA data;
/* 32 byte block */
} WAVEFORMATPVOCEX;
The total size of WAVEFORMATPVOCEX is 80 bytes, thus respecting the requirements of WAVE-EX that the format chunk support alignment to 8-byte boundaries.
wxFormat:
contains the information required to synthesize
the file as originally analysed. The full scope of WAVE-EX is available,
including the definition of speaker positions.This information can be ignored
by a renderer, though certain fields are important.
wxFormat.dwChannelMask:
if this is zero, signifying no assigned speakers,
an application has the choice whether to create a plain soundfile (WAVE,AIFF,etc),
or a WAVE-EX file. However, where speaker positions are defined, it is
recommended that a WAVE-EX file be created. Although AIFF does define a
few surround sound formats, they are ambiguous for four channels, and the
six-channel format is archaic, so it is not recommended in this case.
wxFormat.Format.nChannels
Number of channels in the file (mono,
stereo, etc)
wFormat.Format.nSamplesPerSec
Sample Rate of the source. Informs applications
of the Nyquist frequency for the analysis data.
In circumstances where the analysis data has been synthesized directly,
these are the two essential pieces of information. It is then a matter
of choice what output sample format is specified, though for synthetic
data, use of the floating-point format is recommended. The data for
the full WAVEFORMATEX block should be set correctly as for any WAVE file.
All information specific to the phase vocoder is contained within the PVOCDATA block. This is defined by the structure:
typedef struct pvoc_data {
WORD wWordFormat; /* IEEE_FLOAT or IEEE_DOUBLE
*/
WORD wAnalFormat; /*PVOC_AMP_FREQ, PVOC_AMP_PHASE,
PVOC_COMPLEX */
WORD wSourceFormat; /* WAVE_FORMAT_PCM or WAVE_FORMAT_IEEE_FLOAT*/
WORD wWindowType; /* defines the standard analysis
window used, or a custom window */
DWORD nAnalysisBins; /* number of analysis channels. The
FFT window size is derived from this */
DWORD dwWinlen; /* analysis window length,
in samples */
DWORD dwOverlap; /* window overlap length
in samples (decimation) */
DWORD dwFrameAlign; /* usually nAnalysisBins
* 2 * sizeof(float) */
float fAnalysisRate;
/* sample rate / Overlap */
float fWindowParam; /* parameter associated
with some window types: default 0.0f unless needed */
} PVOCDATA;
Notes on some PVOCDATA fields.
wWordFormat:
I expect that IEEE_FLOAT will be used almost always.
I recognize that some advanced applications may wish to be able to use
doubles; the issue is that more than one f/p format exists for doubles,
and it will be important to eliminate all possibility of ambiguity here.
wAnalFormat:
Csound, CDP/CARL, and PVC all write analysis channels
as amplitude and frequency. Soundhack writes a format as amplitude and
phase (listed within Csound but not implemented). Another relevant format
is pure complex (real-imaginary). This is the format of the SDIF 1STF file,
for eample. The inclusion of this mode is arguable, as it is relatively
remote from the usual application of the phase vocoder (for most non-trivial
tasks, the data will need to be converted to the Amp-Freq format). Other
representations are possible, but I feel that specifying too many alternative
formats adds complexity to a receiving application. The following frame
formats are currently defined for PVOCEX:
PVOC_AMP_FREQ (the most usual)
PVOC_AMP_PHASE
PVOC_COMPLEX
PVOC_AMP_PHASE is defined for compatibility with Soundhack. Accordingly, the data represents raw or 'wrapped' pahse values derived directly from the fft analysis. A rendering application will need to unwrap the phases, to convert to PVOC_AMP_FREQ. This is a necessary step for time-scaling procedures. This is demonstrated in the example application pvconv.
Similarly, the PVOC_COMPLEX frame contains raw 'wrapped' real-imaginary numbers.
wSourceFormat
This is required to disambiguate a 32bit
source sample size as defined in WAVEFORMATEX. Since wFormatTag is WAVE_FORMAT_EXTENSIBLE,
and a custom GUID is used, the distinction between integer and floating-point
samples is lost.
wWindowType:
One of the arguable aspects of the specification.
It is possible to identify a large number of analysis windows. However,
in current phase vocoder implementations, one of a small set of standard
windows is used. The following window types have been defined for PVOC-EX
so far:
PVOC_HAMMING
PVOC_HANNING
PVOC_KAISER
PVOC_RECT
PVOC_CUSTOM
The Kaiser window has an associated parameter, 'beta', which can be given in the fWindowParam field. If this is zero, the default value of 6.8 will be assumed.
The provision of PVOC-CUSTOM is possibly contentious. If this is specified, the format chunk must be followed, before the 'data' chunk, by a special chunk containing the window data, of length dwWinlen. The samples must be of the same type as the analysis data itself, as given by wWordFormat. The data must be normalised so that the peak sample (centre of the window) is 1.0.
nAnalysisBins
Number of analysis channels. This relates directly
to the fft size used in the analysis:
nAnalysisBins = (fft_size / 2) + 1.
One unusual aspect of the canonical CARL implementation is that FFT sizes need not be powers-of-two, but they must be even. The use of the -F flag in pvoc and pvocex, to tune the analysis to a known fundamental frequency, will generally lead to a non power-of-two FFT size.This is currently allowed in PVOC-EX, so that the full scope of the phase vocoder is available. A non power-of-two size must nevertheless be regarded as exceptional (and may not be supported by some implementations), and is best avoided for analysis files intended for general distribution. However, it is worth noting that such a format can still be rendered using a conventional oscillator bank.
Note that the format supports the use of window sizes, given by dwWinlen, greater than the fft size. The canonical implementation of this is CARL pvoc, as demonstrated in the example implementations indicated below.
dwOverlap
This is defined as 'Decimation' in most implementations. It is very
commonly a power-of-two fraction of the window size (e.g the standard overlap
in CARL pvoc is 1/8th of the window). However, dwOverlap is not
limited to such sizes. Both Csound and PVC typically use overlaps that
are not power-of-two fractions.
dwFrameAlign
This is a slightly arguable inclusion. Since all analysis data consists of a pair of 32bit words, all frames are automatically assured alignment to 8-byte boundaries (this is also the required alignment for SDIF files). However, for some platforms even this may not be ideal (for example, alignment to 16byte boundaries), in which case the usual odd number of analysis bins will require padding after the data. The dwFrameAlign will store the full size of the analysis frame, plus any padding. Pad words should be written as zeroes. The example implementations do not currently test this value, or calculate padding. Developers using relevant platforms are invited to test and comment on this feature.
Custom Window Chunk.
Currently this is the most speculative element of the format, and comments
are invited.
It is very simple:
<PVXW>
<chunk-size in bytes, excluding tag and size
field>
< window data, dwWinlen samples>
This may well not be adequate. Possible additions include a 4-byte ident (or a longer ASCII title), and a floats field specifying amplitude, though normative practice would be to provide a normalized window (peak value = 1.0). Note that the WAVE-EX spec encourages all chunks to support 8-byte alignment.
No other chunks, apart from the data chunk itself, are required for
PVOC-EX. Where the renderer sepcifies floating-point samples, the PEAK
chunk can be used in the usual way. This will be especially relevant where
a custom window is used, as amplitude levels cannot be presumed. The Custom
Window Chunk is not yet implemented in the example programs.
The 'data' chunk.
Analysis frames are interleaved according to nChannels, i.e.:
for a stereo file:
<frame 0 Ch 0>
<frame0 Ch 1>
<frame 1 Ch 0>
<frame 1 Ch 1>
etc...
Frames amplitudes are expected to be normalized close to 1.0. Thus, where the source is a full-amplitude sinewave, the peak amplitude in the nearest bin will be close to 1.0. Later versions of this document will develop this aspect further. Suffice it to say here that both the CARL and Soundhack formats provide analysis windows in this form, while Csound and PVC require scale factors. The example implementations accompanying this release is based on the CARL distribution, and a further program pvconv demomnstrates conversion from the current Csound format to the PVOC-EX format. It is important that this aspect of the format is tested thoroughly, as the format is required to support independent analysis and synthesis implementations.
File Extension
Athough, strictly speaking, it cannot be illegal to give a file based on WAVE-FORMAT_EXTENSIBLE the .wav extension, there is little practical advantage in doing so. Most applications check the file format by reading the header, but some do not.
The recommended file extension for the PVOC-EX format is .pvx.
This is enforced by the example application pvocex2.
Example Implementations (command-line programs): pvocex, pvocex2, pvconv
These now both use the fast FFTW libraries, for a significant increase in speed (typ > 25%), though the programs are much larger. Distributions for Linux as well as Windows are now available.
pvocex
This is a direct adaptation of the CARL 'pvoc' program, and uses C code entirely. Only mono files are supported. A majority of the CARL command flags is supported. The full range of time and pitch scaling can be applied.
Preliminary benchmarks: source is 30-second mono floatsam file
at SR=44100. Full analysis-resynthesis.
pvocex -N1024
infile.wav outfile.wav
(default overlap 128 samples )
Pentium II 333MHz, 128MB RAM:
Windows2000 (VC++ 5.0):
12.24 secs
Linux (Redhat 6.0, pgcc -O6 etc) 13.67 secs
Pentium III 500Mhz, 128MB RAM
Linux (Redhat 6.1, gcc -O6 etc)
9.06 secs
pvocex2
This uses crude C++ wrappers to encapsulate the pvoc analysis and synthesis stages, partly to emphasize that, in principle, the two stages could be implemented independently. This version supports stereo files (it will eventually be extended to support generic multi-channel files), and a main subset of the CARL flags. Time and pitch scaling is supported. A flag is also available to report the format information of a PVOC-EX file.
Further development of PVOC-EX facilities will be concentrated in this program, and derivatives, so that pvocex itself can remain close to the CARL implementaion. It follows that none of the code for pvocex2 is frozen, and will almost certainly change. In particular, the functions in the file pvpp.cpp should not be considered as library or API functions.
Preliminary benchmarks: source is 30second mono floatsam file at SR=44100.
Full analysis-resynthesis.
pvocex2 -N1024
infile.wav outfile.wav
(default overlap 128 samples )
Pentium II 333MHz, 128MB RAM:
Windows2000 (VC++ 5.0)
14.1 secs
Linux (Redhat 6.0 pgcc -O6 etc)
15.68 secs
Pentium III 500Mhz, 128MB RAM
Linuix (Redhat 6.1, gcc -O6, etc) 10.19
secs
Source for pvocex mono analysis-resynthesis
(accepts WAVE, AIFF and AIFF-C files):
pvocex_src012.zip
(49KB). Includes project files for VC++5.0
pvocex_src-0.12.tar.gz
(44KB)
Source for pvocex2 stereo analysis-restnthesis (accepts WAVE, AIFF and
AIFF-C files):
pvocex2_src011.zip
(45KB) Include project files for VC++ 5.0
pvocex2_src-0.11.tar.gz
(44KB)
Source for pvconv application (convert Csound and Soundhack analysis
files to pvoc-ex):
pvconv_src01.zip
(5KB) NB requires pvocex_src012.zip
Other conversion programs will probably become available in time.
Executables for Pentium Pro or better systems (pvocex,
pvocex2 and pvconv):
Windows:
pvocex_bin012.zip
(172KB)
pvocex2_bin011.zip
171KB)
pvconv_exe.zip
(27KB)
Linux:
pvocex_bin-0.12.tar.gz
(214KB)
pvocex2_bin-0.11.tar.gz
(200KB)
[pvconv to come]
Return to NOS-DREAM home page
Richard Dobson 5th August 2000