June 29, 2026•3 min read•from Machine Learning

I'm trying to implement CALM paper, and I have some questions. [P]

Our take

Implementing advanced text-to-speech models like Pocket TTS presents substantial challenges, as evidenced by your experience. Initial struggles with smaller datasets like LJSpeech, characterized by unstable training and limited expressiveness, are common. Scaling to larger datasets like LibriSpeech introduces a different set of trade-offs, notably the interplay between voice cloning fidelity and text generation quality. Consider exploring the insights from "I built a demo agricultural planning system with an AI advisor," which highlights the power of leveraging diverse data sources.

I'm trying to implement CALM paper, and I have some questions. [P]

The challenges faced by u/No-Motor-6274 in implementing the Pocket TTS model highlight a persistent hurdle in the rapidly evolving field of generative AI: replicating state-of-the-art results often requires resources and datasets far beyond the reach of individual researchers or smaller teams. Their struggle to achieve meaningful speech generation, even with relatively modest parameter sizes and datasets like LJSpeech and LibriSpeech, underscores the importance of scale in modern text-to-speech (TTS) models. The observed tradeoffs between text quality and voice cloning fidelity, alongside the unstable training dynamics (spiky loss and exploding gradients), are common indicators of a system struggling to converge, often a consequence of insufficient data or architectural nuances not fully understood. It’s encouraging to see the community grappling with these complexities, as evidenced by discussions around recursive self-improvement [What do you think of Recursive Self Improvement ? [D]], demonstrating a growing interest in tackling the challenges inherent in creating truly advanced AI systems.

The core issue likely lies in the vast disparity between the data used to train the original Pocket TTS model (88,000 hours of publicly available data) and the datasets u/No-Motor-6274 experimented with. While LJSpeech and LibriSpeech are valuable resources, their scale is simply not comparable. This echoes a broader trend in AI development, where performance frequently correlates directly with the size of the training dataset. The user's experimentation with different techniques, such as scheduled sampling and noise addition, further illustrates the iterative and often frustrating process of fine-tuning these complex architectures. Their assertion that increasing the dataset is the next logical step is likely correct, but the concern about GPU costs is a valid and common one for many practitioners. The exploration of agricultural planning systems using NASA data [I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]] shows how even seemingly unrelated datasets can be leveraged for AI development, suggesting creative solutions for data augmentation might be worth investigating. Furthermore, the focus on historical swordfighting and dataset creation [I do historical swordfighting and noticed AI struggles to track it. I’m building an open dataset to help fix this. Does my schema make sense? [P]] highlights the need for specialized datasets to improve AI performance in niche areas.

The fact that u/No-Motor-6274 successfully extracted the Mimi Audio Encoder from the original model is a testament to their technical skill and determination. However, replicating the entire system’s performance is proving difficult, indicating that the success of Pocket TTS likely stems from more than just the encoder architecture. It's probable that specific training strategies, data preprocessing techniques, or architectural details not fully documented in the paper are contributing significantly to the model's capabilities. The exploding gradients and spiky loss functions, especially when combined with the observed tradeoffs, suggest a potential instability in the overall training process, perhaps related to the interplay between the text, audio, and latent representations. Thoroughly reviewing the paper again, paying close attention to implementation details and potential regularization techniques, remains a worthwhile endeavor.

Ultimately, u/No-Motor-6274’s experience serves as a cautionary tale and a valuable learning opportunity for the AI community. It emphasizes the importance of understanding the full scope of resources required to replicate state-of-the-art results and the iterative, often unpredictable nature of generative AI development. The question now becomes: as these powerful models become increasingly reliant on massive datasets and specialized hardware, how can we democratize access to the tools and resources needed to innovate and contribute to this rapidly evolving field? Will we see the rise of federated learning approaches or more efficient training techniques that allow researchers with limited resources to effectively participate in the development of next-generation TTS models, or will the barrier to entry continue to rise?

Hello, I'm trying to implement the Pocket TTS by kyutai-labs represented by this paper. Since they have didn't released the training/fine-tuning code. I'm trying to implement it on my own for learning some stuff. I have read the paper, tried to implement it with much more smaller parameters with smaller amount of data. I implemented this text to speech with one speaker on LJSpeech (1) and LibriSpeech clean subset but its hardly failing.

For (1), Since it's a single speaker dataset I didn't added the voice cloning just simple text and target latents. flow matching loss became nearly 0.20 mse , EOS loss became very low like (x)e-(y) levels. But when infer with the model saved at 2800th epoch, It barily generating a meaningfull text even the text within its training set. Tried different techniques like Scheduled sampling for eliminate exposure bias (model was hallucinating sometimes and repeats same phrases twice), it didn't worked. Added std gaussian noise to ground truths, didn't worked. After struggling with lots of implementation I decided to move forward with quite larger dataset LibriSpeech because I thought that scale of the data was small.

For (2), I read the paper again. No scheduled sampling, added the head multiplication etc, and implemented the paper in the librispeech dataset. I tried audio condition+ text tokens + BOS + target latents, and swapped the audio prompt with text tokens. I observed a tradeoff in this setup: if I put text tokens near to target latents, model generates better text but voice is not even close to audio prompt,and gibberish speak with better voice cloning when I put audio condition tokens near to target latents. And found out that loss is very spiky, and grad norm is exploding too you can see below the images.

loss and lr values for setup 1 (LJSpeech)

values for setup 2 (LibriSpeech)

I used Pocket TTS' orijinal Mimi Audio Encoder by extracting it from Original model.

What is your suggestions? Should I read paper over and over again? Should I increase the data amount by collecting from different sources(authors says that they used 88.000 hours of publicly available data)? Any system design problem? Trainings performed on RTX 5080 desktop gpu.

I want to move on to bigger dataset but can't burn GPU credits for non-expected result. When should I increase dataset and start training on bigger clusters that could give me satisfyable results?

submitted by /u/No-Motor-6274
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#generative AI for data analysis#Excel alternatives for data analysis#financial modeling with spreadsheets#natural language processing for spreadsheets#large dataset processing#big data management in spreadsheets#conversational data analysis#real-time data collaboration#intelligent data visualization#data visualization tools#enterprise data management#big data performance#data analysis tools#data cleaning solutions#rows.com#AI formula generation techniques#machine learning in spreadsheet applications#no-code spreadsheet solutions#CALM#Pocket TTS