Hello ,
Is obtaining metadata about the generated speech supported? I would like to get timestamps of where the pronunciation of individual words starts and ends. I cannot find anything about it in the documentation.
If not, is there any plan to bring the support in?
Thanks!