30-60 minutes of clean and varied audio will result in the highest-quality voice models.
The highest-quality voice models are recorded:
with a quality microphone into an audio interface
And processed:
with consistent dynamics across the whole dataset
with light EQ to remove any muddiness, hiss, etc.
with compression/limiting to smooth out peaks
with no reverb, delay, or doubling