Can this architecture learn the timbre of source speech simultaneously? 

Hi, I read the paper and found this architecture has a composable and fixed HiFi-GAN vocoder to do the final speech synthesis. I wonder if there is possiblility to incorporate this component into the final traning objective to learn the speak's timbre and intonation?