Expensive in terms of compute? I figured I could do it in short batches, still using Collab. I’d imagine I could reconfigure some of the VQGAn+CLIP code, and implement some code someone else has written for converting speech to text, but it would have to be input into the GAN in a real time fashion (or close enough thereto).
Although actually it wouldn’t need to be real time at all. It would just need to be dynamic, I.e having the text input parameter cycle forward with the ongoing speech, if even with a time delay.
That is, we may need each iteration to run on a different sequence of words, in a continuous fashion.