Audio Post-Processing

When preparing audio for AI voice cloning data or inference, the goal is to maintain the natural character and clarity of the target voice while ensuring consistency and cleanliness in the audio. Here’s a guide to subtle yet effective post-processing:

Achieve consistent volume: Use a volume rider or automation to make sure volume levels are consistent across your entire dataset. The aim is to create a consistent volume level across the recording while keeping the dynamics within sections.
Smooth out peaks: Use a transparent compressor or limiter with a fast attack to smooth out the peaks within sections. Try to limit dynamic range to around 5db.
Phase balance your audio: Using a tool like Izotope RX, you can phase balance your audio to maximize headroom before normalization.
Normalize to -3db: After smoothing out peaks and phase balancing, normalize your dataset to -3db.
EQ for Cleanliness: EQ primarily to remove unwanted frequencies. Cut out low-end rumble, mid-range muddiness, or high-end hiss to prevent your model from learning those unwanted qualities. Be cautious not to overdo it; subtle cuts are often enough.
EQ to Match Vocal Style: If needed, make EQ adjustments to enhance the vocal character. Add a slight boost for ‘air’ in the high frequencies for clarity or a small boost in the lower frequencies for ‘weight.’ Remember, the goal is to retain the vocal’s natural tone, not to alter it significantly.
Prevent Clipping: Ensure the vocal track does not clip at any point. Clipping introduces digital distortion. Keep an eye on the meters to ensure levels stay below 0 dB.
No Time-Based Effects: Refrain from adding reverb, delay, or other time-based effects. These effects can obscure the clarity of the vocals and confuse the AI model’s pitch detection.
Avoid Hard Cuts: When editing, avoid hard cuts that can create abrupt starts or ends. Such cuts can introduce clicks or pops. Use smooth fades at the beginning and end of the vocal track for a more natural transition.
Don’t Layer Vocals: Multi-layered vocals can complicate the AI’s analysis. Stick to a single vocal track to ensure the AI can accurately process and learn from the recording.
No Copying and Pasting Sections: Avoid duplicating or artificially extending sections of the vocal track. The AI model benefits from the natural variation and imperfections of a continuous, unaltered performance.

An example of what to avoid. Notice the tall peak on the left and the noisy silence at the right.

Your audio should look more like this. Consistent levels and noiseless silence.

Sections

Prepare a Dataset