Conversational AI foundational to CX in 2026


SOURCE: NOJITTER.COM
JAN 30, 2026

Matt Vartabedian, Senior Editor

January 28, 2026

11 Min Read

Conversational AI has been around for decades, going back to the first instances of automatic speech recognition (ASR) and natural language understanding and processing (NLU/NLP). These technologies remain the plumbing of today’s voice AI which increasingly relies on generative AI to interact with people in a more human-like fashion.

“Conversational AI is any system designed to engage users in natural, multi-turn dialogue whether through text or voice,” wrote Mila D’Antonio, a principal analyst with Omdia in an emailed statement. “These systems can have a back-and-forth conversation with a human.”

The problem with voice systems in customer experience

The first interactive voice response (IVR) systems used dual-tone multi-frequency (DTMF) systems where callers could only press buttons – one for sales, two for service, etc. In the 1990s, subsequent generations of IVRs used ASR and NLU/NLP to allow callers to say why they’re calling in a more natural manner. Today, these systems coexist (press or say one, etc.). Speech-enabled IVRs became the norm for providing 24x7 coverage and deflecting calls to self-service.

Related:AI summarization carries some significant risks

Over time, these systems got better at recognizing words and were able to provide some verbal responses back to callers. Despite being conversational in nature – responsive to turn-by-turn dialog – these systems were built on rules and scripts. This involved programming intents (why someone’s calling) and entities (names, locations, persons, products) and identifying simple, common service issues, figuring out different ways customers would ask questions about those issues, training the machine learning (ML) engine to recognize them and then supply the caller with the correct response.

“I could build you one question and answer pair in six weeks. Then we’d have to test and tune it, and I could give you another pair in six more weeks,” said Max Ball, a principal analyst with Forrester. But then, if a customer wanted to climb back up the dialog tree, “you’d either spend months building air handling to do that – or the whole thing would melt down because it was so brittle.”

If customers used words the system wasn’t programmed to deal with, or it simply didn’t understand the caller, the automated system would transfer the call to a human agent.

“Rules-based dialog design systems required lots of resources, people, money and time,” said Derek Top, Principal Analyst and Research Director for Opus Research. “And there was the question of it being worth it, since you’re only sending some customers into the automated system and you’re not even getting all the questions because the long-tail questions are never going to be answered.”

Related:Salesforce’s Slack deepens Anthropic’s Claude integration

Long-tail questions are those inquiries that are rarely asked and thus never programmed.

How LLMs change the IVR experience

Large language models (LLMs) are changing that game because they are “smarter” than previous generation ML models, despite relying on the same ASR, NLU/NLP plumbing. “Modern conversational AI is powered by LLMs which enable reasoning, context and open dialogue,” wrote D’Antonio. “When conversational AI is delivered through speech, a combination of speech recognition, speech to text and NLP is involved – that’s voice AI. Basically, voice AI is conversational AI with a speech interface.”

Conversational AI describes the multi-turn experience which could just as easily happen in a text chat or email as with voice. These are different channels or modes. Multi-turn was possible with the pre-LLM technologies, it was just laborious to program. Since LLMs are better at understanding and processing language, some of the human labor needed to build conversation trees is essentially automated and the customer-facing interaction with the IVR is more human-like.

Related:Medallia-Ada partnership brings context to AI agents

There is a difference between generative AI-based systems that basically provide a more human-like interaction and agentic AI systems which replicate, to an extent, what a human contact center agent can do.

“I can tell an agentic system: you’re a friendly claims adjuster. Get these 10 pieces of information from the customer, then put that into our claim system. Your goal: Set up a claim and give the customer a confirmation number,” Ball said. “I did this in 1993, but it was all screen scraping – going to the green screen and specifying certain fields – and I had to detail every sub-step in the entire process.”

With agentic AI, the user provides a high-level goal along with a series of instructions on how to get there, then the agentic AI system “just goes and does it. If it finds that it doesn’t have something, it can go back to the [agent or customer] and say that it needs an additional piece of information. You don’t need to explicitly program it to do that,” Ball said, with the caveat that there is very little current deployment of agentic AI in contact centers.

How the voice AI model is becoming faster and more accurate

Regardless of how agentic the AI system is, the voice AI model converts speech from analog to digital to phonemes, runs those through a model that compares them to common sentences, words and phrases which then are turned into text for the LLM to analyze. This process happens in reverse when the system responds to the caller and there’s a step where a voice is synthesized.

While this describes the traditional (cascading) approach to voice AI, newer speech to speech models skip the intermediate text transcription step. Additionally, streaming voice AI models can process speech as it comes in, allowing them to listen continuously to the speaker and thus avoid the traditional turn-taking (batch) approach.

Regardless, accuracy is critical. There are many challenges associated with speech to text such as audio line and microphone quality, background noise, accents, dialects, languages and code-switching (alternating between languages), speaker age, gender and ethnicity, understanding context, detecting emotion and nuance, the caller interrupting or revising what they said, and the caller adding or revising intents. Gladia’s explainer provides a good overview of common accuracy metrics like word error rate and word accuracy rate.

“There always was, and there still is, a competitive battle to say who’s the most accurate because you still run into the problem of the systems not understanding every word,” Top said. “That’s a big frustration with IVRs where people ‘agent out,’ because it couldn’t understand or do what they wanted.”

Latency – how long it takes for the model to understand and generate a reply – is another issue. The rule of thumb is 300 milliseconds; that’s the average, natural pause length in human conversation. If the pause stretches too long, the caller grows frustrated.

Traditional voice AI processes in batches – i.e., it waits for the caller to finish speaking. Streaming models process speech as the caller talks, which offers lower latency relative to the traditional approach. Model size and where the model resides can also affect latency. Small and close likely means lower latency versus big and far away. Concurrency – the number of simultaneous audio streams – is also a key production metric that can impact overall reliability and latency.

“When it comes to voice, there's a cadence and timing, as well as expectations around when to answer and when to stop,” Top said. “The more accurate the model, the longer the latency. That's a simple way to think about it.”

You can build it, but will enterprises use it?

Klearcom built a SaaS platform that allows its customers to test and validate speech in more than 100 countries, both before deployment and after. “AI can be dazzling in a demo, but [consider] real customers with accents, spikes in volume, edge cases and connectivity issues and your CCaaS making APIs call into other systems – all of these things have so many links and applications that if there are issues will affect your customers’ experience,” said Liam Dunne, Co-founder and CEO of Klearcom.

Moreover, enterprises may be oblivious to the poor experiences customers may be having with a self-service system. “Enterprises often don’t even know about customer experience [CX] issues because they can only see what’s happening on their own corporate network,” Dunne said. “The CCaaS provider is the exact same, as is the carrier. It’s like a Russian doll.”

Aside from the voice portion, businesses also need to make sure the LLM “brain” can do whatever the caller is asking about – schedule an appointment, authorize the return, book the weekend stay. This involves well-maintained integrations to back-end systems and flawless execution. If not, the business has just reinvented the “agent, agent” refrain.

Voice AI models can also be fine-tuned to fit industry use cases, which can improve accuracy. A medical practice, sporting goods retailer and hospitality chain probably have some overlapping questions, and certainly the same need to understand what customers are saying, but specific inquiries might diverge. A customer is more likely to schedule lab tests with clinic versus return their sneakers or book a weekend stay.

Voice AI models come from various players. The foundation AI companies – OpenAI, Google, Amazon, Anthropic – all have their own voice models. Pure play vendors specializing in voice include Ameila (bought by SoundHound AI), Cartesia, DeepGram, ElevenLabs, Hume, Omilia, PlayAI (bought by Meta), PolyAI, Parloa and Voximplant.

Contact center and CX vendors sometimes have their own voice models, or they partner with foundation model or pure play vendors. Regardless, conversational voice AI models or automation do not have to be a part of the contact center platform.

Klearcom’s Dunne described two main stages to conversational AI, the first being the decades-old ML-based systems. The industry is at point now where it’s talking about ripping out those existing systems and replacing them with generative AI-based solutions.

“Conversational AI is what a traditional IVR was built to do – steward a call in the right direction in the right way. But the technology limited that, and customers had to listen to messaging before being able to press a button to move on,” Dunne said. “Now, a customer can just say, ‘hey, my computer screen is broken,’ and that cuts through them having to listen to three different messages. They’re not sitting on the call, and they feel like they’re progressing sooner.”

Hard benefits can accrue from replacing a traditional DTMF-based IVR. One of Klearcom’s customers has saved more than $20 million a year in telephony costs alone, Dunne said.

Enterprises can roll out AI voice, but will customers embrace it?

Top anticipates that consumer resistance to using conversational AI systems will continue declining. “The way people always wanted to get out of these systems is decreasing based on repetition and age. Younger people tend to be okay with it,” Top said. Additionally, he expects voice AI and agentic capabilities will continue improving.

Ball, however, disagreed. “I think we'll start seeing more truly generative bot interactions, voice or text. And I think consumers will be reluctant, because how often do you relish the idea of talking to a chat bot?”

Another analyst offered an optimistic prediction for the coming year. “One of the things that's likely to happen is somebody will build the killer voice to voice model application – an AI agent customer service app that is so good it will blow people's minds,” Coshow said. “It will be as good as an AI agent that is using Gemini 3; it’ll bring in just the right data and always give you the right answer, but it's going to sound like Miles.”

Miles/Maya is a conversational voice research project from Sesame with the goal of achieving “voice presence – the magical quality that makes spoken interactions feel real, understood and valued.” Sesame’s Miles/Maya demo is persistently offline, but this YouTube video compares Miles with OpenAI’s ChatGPT Advanced Voice Mode.

“Miles is very human in the way [Sesame] makes you feel like you can hear him thinking – and I just called it a ‘him,’ by the way,” Coshow said. “I know it’s just a text output from an LLM, but they put these embeddings in to tell the model how to produce the sound: ‘pause, say hmm, and then start with well.’ Then you hear Miles do that.”

Expect to see voice AI as the center of CX

Whether or not AI voices enchant customers is perhaps beside the point. “Conversational AI and voice AI are becoming foundational to how contact centers are modernizing their digital and voice channels,” D’Antonio wrote. “Together, these systems are helping to transform contact centers from transactional engagement to experiential.”

D’Antonio expects to see conversational AI and voice AI eventually become the operational backbone of customer engagement. “We will start to see customer engagement platforms, CCaaS and conversational AI effectively converging into one category centered on intelligent orchestration,” she wrote, such that the same AI that resolves a support issue will be “expected to personalize marketing, enable conversational commerce and drive revenue-generating interactions.”

Disclosure: Informa TechTarget, the publisher behind No Jitter, also owns Omdia. Informa TechTarget has no influence over No Jitter’s coverage.

Read more about:

Conversational AIVoice AI

About the Author

Matt Vartabedian

Matt Vartabedian

Senior Editor

As the Senior Editor for No Jitter, Matt covers AI (predictive, generative and agentic AI) as it pertains to the enterprise communications space