Target speaker extraction (TSE) is essential for various speech processing applications, particularly with complex acoustic environments. However, current TSE systems lack robustness under real world conditions due to limited training data diversity and unrealistic noise. To address these challenges, we first constructed Libri2Vox, a new dataset combining clean target speech from LibriTTS with interference speech from VoxCeleb2 that contains real acoustic variations, channel effects, and ambient conditions. To increase speaker variability, we augmented Libri2Vox with synthetic speakers generated by developing two speech anonymization methods: SynVox2 and SALT (speaker anonymization through latent space transformation). Further, we propose a three-stage curriculum learning approach that progressively introduces synthetic speakers after training a seed TSE model on real data with varying speaker similarity levels. Experiments with four different neural TSE models show that Libri2Vox’s rich acoustic variations and synthetic speaker integration through curriculum learning consistently improve performance across common evaluation metrics. We also confirmed that the proper ratio of synthetic speakers to real speakers is important for improving the performance.