What questions should I ask when evaluating AI support vendors?

You have budget approval. You have a shortlist. Now you are sitting across from vendor sales teams, and every one of them has a polished demo showing their AI resolving tickets flawlessly. The demo is not the product. The questions you ask — and the answers you demand — are what separate a good vendor decision from a costly mistake.

This is a checklist of 15 questions that every Head of Support should ask during an AI support vendor evaluation. For each question, we define what a good answer looks like and what should raise a red flag. These are not theoretical — they are drawn from real evaluation processes at mid-market and enterprise support organizations.

TL;DR: AI support vendor evaluation requires 15 essential questions across pricing, quality metrics, escalation workflows, and security compliance. Key areas include transparent cost structure, resolution rate benchmarks, context-preserving handoffs, and data protection standards. Vendors should provide specific metrics rather than polished demos, with good answers including measurable performance data and clear implementation timelines.

Key takeaways:

Vendor demos don't reflect actual product performance
Demand specific metrics over polished presentations
15 questions cover pricing, quality, escalation, and security
Good vendors provide measurable benchmarks and implementation timelines

Print this out. Bring it to your next demo. Score every vendor against it.

The Evaluation Table

#	Question	Good Answer	Red Flag
1	How do you measure the quality of AI responses?	Per-response quality scoring across multiple dimensions (accuracy, completeness, tone, policy compliance). Automated self-evaluation on every response.	"We use CSAT scores" or "Our accuracy is 95%" without explaining how accuracy is measured.
2	What is your pricing model, and what is the total cost at my volume?	Clear, published pricing. Willingness to model cost at your specific ticket volume. Transparent per-ticket or per-resolution cost.	"Contact sales for pricing" with no ballpark. Annual contract with unclear overage charges.
3	How long does it take to go live with real tickets?	Specific timeline: "30 minutes to first response" or "2 weeks to full deployment." Backed by customer references.	"It depends" without specifics. Any timeline over 8 weeks for a standard deployment.
4	What happens when the AI cannot answer a question?	Defined escalation triggers. Configurable confidence thresholds. Full context passed to the human agent. Measurable escalation rate.	"The AI always tries to answer." No configurable escalation logic. Context is lost in handoff.
5	How do you handle AI hallucinations?	RAG-grounded responses with citation to source documents. Automated hallucination detection. Response withheld when confidence is low.	"Our model doesn't hallucinate" or no specific hallucination mitigation strategy.
6	What integrations do you support, and how deep are they?	30+ integrations with specific documentation on data flow for each. Bi-directional sync with major help desks.	"We integrate with Zendesk" without specifics. One-way data push only. Fewer than 10 integrations.
7	Who manages the AI after deployment?	Clear ownership model — either vendor-managed or defined customer responsibilities with tooling to support them.	Ambiguous ownership. "Your team manages it" without adequate tooling or documentation.
8	What security certifications do you hold?	SOC 2 Type II at minimum. Clear data residency policies. No training on customer data without explicit opt-in. PII detection and redaction.	No SOC 2. Vague data handling policies. Model trained on customer data by default.
9	Can I see real performance data from a similar customer?	Named case studies with specific metrics: resolution rate, handle time reduction, quality scores, escalation rate.	Only aggregate statistics. No named references. "We can't share due to NDAs" for every customer.
10	What is your escalation false-negative rate?	A specific number (e.g., "Less than 3% of tickets that should have been escalated were not"). Methodology for measuring it.	"I don't know" or "We don't track that."
11	How do you handle knowledge base updates?	Automatic re-ingestion on a defined schedule (hourly, daily). Manual refresh option. Change detection and versioning.	Manual re-upload required. No version control. Updates take days to propagate.
12	What is the contract term, and what are the exit terms?	Month-to-month or per-ticket with no minimum term. Or annual with a 30-day termination clause. Data export included.	24-month minimum. Auto-renewal with 90+ day notice requirement. No data portability.
13	How do you handle multi-language support?	Specific list of supported languages. Quality metrics broken down by language. Native language processing, not just translation.	"We support all languages" without specifics. Translation-only approach for non-English.
14	What happens if your company is acquired?	Change-of-control clause in the contract. Commitment to 12+ months of service continuity. Data portability guarantee.	No change-of-control clause. No continuity commitment. (See what happened with Forethought's acquisition by Zendesk.)
15	How do you evaluate and improve over time?	Continuous learning from resolved tickets. Regular quality reports. Proactive recommendations for knowledge base gaps. Defined improvement cadence.	"The model improves automatically" without specifics. No structured improvement process. No reporting.

Deep Dive: Each Question Explained

Question 1: How Do You Measure Quality?

This is the single most important question, and it is the one most vendors fumble.

The industry standard for years has been CSAT — a survey sent after ticket resolution. The problems with CSAT for AI evaluation are well documented: response rates are 5–15%, respondents skew toward extremes, and the score arrives days after the interaction. By the time you know the AI gave a bad answer, 200 more customers have received the same bad answer.

What you need is per-response quality evaluation — an automated system that scores every AI response on multiple dimensions the moment it is generated. The dimensions that matter most:

Accuracy: Is the information factually correct?
Completeness: Does the response fully address the customer's question?
Tone: Is the response professional, empathetic, and brand-appropriate?
Policy compliance: Does the response adhere to your company's policies?
Source grounding: Is the response supported by your knowledge base?
Safety: Does the response avoid harmful, misleading, or legally risky content?
Actionability: Does the response give the customer a clear next step?

Twig's 7-dimension quality scoring is one implementation of this approach. Other vendors may have their own frameworks. The specific dimensions matter less than the principle: every response should be evaluated, automatically, before you rely on CSAT to tell you something went wrong.

Question 2: What Is the Real Cost?

Vendor pricing in the AI support market is notoriously opaque. Here is a reference framework:

Vendor	Pricing Model	Annual Cost Range	What Is Included
Decagon	Annual contract	$95K–$590K	Platform, implementation, Agent Engineers, custom workflows
Sierra AI	Annual contract	$150K–$350K+	Platform, implementation, CSM, multi-model architecture
Twig	Per-ticket ($5/ticket, free tier)	Scales with volume	Managed AI Specialists, 30+ integrations, 7-dimension quality scoring, SOC 2 Type II
Ada	Annual contract	$100K–$400K	Platform, multilingual support, proactive messaging
Zendesk AI	Add-on to Zendesk plans	Varies	Native deflection, triage, agent assist
Intercom Fin	Per-resolution	~$0.99/resolution	Native to Intercom, basic deflection

When evaluating, calculate the total cost at three volume levels: your current volume, 2x volume, and 0.5x volume. This reveals how pricing scales and whether you are exposed to cost spikes during busy periods.

See Twig's pricing page for a transparent per-ticket model you can model against your own data.

Question 3: How Long to Go Live?

Time to value is not just a convenience metric. Every week your AI is not live is a week of tickets handled manually. If your team handles 3,000 tickets per month and AI could resolve 40% of them, a 4-week delay costs you 3,000 manually-handled tickets that could have been deflected.

The range across vendors is enormous:

Vendor Type	Typical Time to First AI Response	Typical Time to Full Deployment
Managed service (e.g., Twig)	30 minutes to 24 hours	1–5 days
Self-serve platform (e.g., Decagon, Ada)	1–3 weeks	4–8 weeks
Enterprise platform (e.g., Sierra)	2–4 weeks	6–12 weeks
Custom build	8–16 weeks	16–26 weeks

Ask for the median deployment time, not the best case. And ask for a reference customer who went live recently, not one from 18 months ago.

Question 4: What Happens When the AI Cannot Answer?

Ninety percent of support teams report struggling with AI-to-human handoffs. The handoff is where customer experience breaks down. A customer explains their problem to an AI, the AI fails, and the customer is transferred to a human agent who has no context and asks the customer to start over.

A good vendor provides:

Configurable confidence thresholds — you define at what confidence level the AI should escalate vs attempt a response.
Full context transfer — the human agent receives the full conversation, the AI's attempted response, the relevant knowledge base articles, and the reason for escalation.
Escalation categorization — you can see why tickets are being escalated (knowledge gap, policy question, emotional customer, multi-step request) and address root causes.

Question 5: How Do You Handle Hallucinations?

Hallucination — the AI generating plausible-sounding but factually incorrect information — is the existential risk of AI support. One confidently wrong answer about a billing policy or product safety issue can create legal liability, customer churn, and brand damage.

The best mitigation is RAG grounding: every AI response must be traceable to a specific source document. If the AI cannot find a relevant source, it should say "I don't have information on that" rather than guess.

Ask the vendor to demonstrate what happens when you ask a question that is not covered in your knowledge base. If the AI generates an answer anyway, that is a red flag.

Question 6: How Deep Are the Integrations?

"We integrate with Zendesk" can mean anything from "we read tickets via API" to "we have bi-directional sync with custom fields, triggers, automations, macros, and SLA policies." The difference matters enormously.

Questions to ask about each integration:

Is it read-only or bi-directional?
Does it sync custom fields?
Does it respect your existing routing rules and automations?
How frequently does it sync (real-time, hourly, daily)?
Is it a native integration or does it require a middleware like Zapier?

Twig offers 30+ integrations across help desks, CRMs, knowledge bases, and internal tools. Other vendors may have similar breadth. What matters is depth.

Question 7: Who Manages the AI After Deployment?

This question reveals the vendor's operating model and your hidden costs. There are three common models:

Model	Vendor Responsibility	Your Responsibility	Hidden Cost
Fully managed	Training, tuning, monitoring, quality, updates	Review reports, approve changes, update knowledge base	Low
Shared responsibility	Infrastructure, platform, basic monitoring	Configuration, workflow management, quality review, prompt tuning	0.5–1 FTE
Self-serve platform	Infrastructure, documentation	Everything else	1–2 FTE

If the vendor says "your team manages it," calculate the fully-loaded cost of the internal resource needed. That $150K annual contract might actually cost $225K when you add a half-time operations person.

Question 8: What About Security?

SOC 2 Type II is the baseline. If a vendor does not have it, they are either too early-stage or not treating security seriously. Either way, it is a risk.

Beyond SOC 2, ask:

Where is data stored? Can you specify region?
Is your data used to train the vendor's models? Is it opt-in or opt-out?
How is PII detected, redacted, and handled?
What happens to your data if you cancel the contract?
Do you support SSO and role-based access control?

Twig's security posture includes SOC 2 Type II certification, but verify any vendor's claims independently. Ask for the audit report, not just the badge on the website.

Question 9: Show Me Real Performance Data

Every vendor will tell you their AI resolves 40–70% of tickets. The question is whether those numbers hold up for customers with your ticket complexity, your knowledge base quality, and your customer expectations.

Ask for:

A case study from a company in your industry and size range.
Specific metrics: resolution rate, average handle time, escalation rate, quality scores.
The timeline from deployment to those metrics (week 1 performance is very different from month 6 performance).
Permission to speak with the reference customer directly.

Question 10: What Is the Escalation False-Negative Rate?

A false negative in escalation is when the AI should have handed off to a human but did not. It continued trying to resolve a ticket it could not handle, frustrating the customer and delaying resolution.

This metric is rarely discussed in sales calls but it is one of the most important operational metrics for AI support. A good vendor tracks it, reports on it, and can tell you their benchmark. If they cannot, they probably are not measuring it, which means they cannot improve it.

Question 11: Knowledge Base Updates

Your products change. Your policies change. Your pricing changes. When they do, how quickly does the AI learn?

The best case is automatic re-ingestion — the vendor monitors your knowledge base and help center for changes and updates the AI's knowledge within hours. The worst case is manual re-upload, where you have to export documents, format them, and push them to the vendor's system.

Ask specifically what happens when you update a help article in Zendesk or Intercom. How long until the AI gives the updated answer?

Question 12: Contract Terms and Exit

AI vendor contracts are getting more aggressive. Some vendors require 24-month minimums with auto-renewal clauses that kick in 90+ days before expiration. If you forget to send a cancellation notice, you are locked in for another two years.

Good contract terms include:

Monthly or annual billing with 30-day cancellation
Data export in standard formats (CSV, JSON) upon termination
No penalty for volume decrease
Clear SLAs with financial remedies for downtime

Per-ticket pricing models, like Twig's, inherently offer more flexibility — you pay for what you use and can scale down without renegotiating. See Twig's pricing for details.

Question 13: Multi-Language Support

If you support customers in multiple languages, this question is critical. There is a meaningful difference between:

Native language processing: The AI understands and responds in the target language natively, with cultural nuance and idiomatic accuracy.
Translation layer: The AI processes everything in English and translates input/output. This works for simple queries but fails on nuance, idioms, and technical terminology.

Ask for quality metrics broken down by language. A vendor that reports 85% resolution rate overall might be at 90% in English and 60% in Japanese. The aggregate hides the gap.

Question 14: What If You Get Acquired?

This question felt theoretical until Zendesk acquired Forethought on March 11, 2026. Now it is practical. If your AI support vendor gets acquired by a platform you do not use, your investment is at risk.

The protection is contractual: a change-of-control clause that gives you the right to exit without penalty if the vendor is acquired, plus a commitment to service continuity for a defined period (12–24 months minimum).

We analyzed the implications of the Forethought acquisition in detail: What Zendesk's Acquisition of Forethought Means for the AI Support Market.

Question 15: Continuous Improvement

AI support is not a set-it-and-forget-it deployment. The AI should get better over time as it learns from resolved tickets, identifies knowledge gaps, and adapts to new question patterns.

Ask the vendor:

How often are models updated or fine-tuned?
Do you proactively identify knowledge base gaps?
Is there a regular review cadence (weekly, monthly) with performance reports?
Can you show me a sample improvement report from an existing customer?

The difference between a good AI support deployment and a great one is the improvement loop. Initial deployment gets you to 40% resolution. Continuous improvement gets you to 70%.

How to Score Vendors

Use this scoring framework during your evaluation:

Category	Weight	Questions
Quality and safety	30%	Q1, Q5, Q10
Pricing and contracts	20%	Q2, Q12
Implementation and operations	20%	Q3, Q7, Q11, Q15
Integration and flexibility	15%	Q6, Q13
Security and risk	15%	Q8, Q9, Q14

Score each question 1–5 (1 = red flag, 5 = excellent). Weight the category scores and compare vendors on a single composite number. This will not make the decision for you, but it will make the decision defensible.

Recommended Next Steps

Download or bookmark this checklist. Bring it to every vendor demo.
Send these questions in advance. A good vendor will welcome them. A bad vendor will stall.
Compare at least 3 vendors. Include Decagon, Sierra AI, Twig, and at least one platform-native option. Comparing across pricing models (annual contract vs per-ticket) will sharpen your understanding of total cost.
Run a paid pilot, not a free trial. Free trials get deprioritized internally. A paid pilot with defined success criteria forces both you and the vendor to take it seriously.
Check the Agents Playbook for tactical guidance on deploying and managing AI support agents after you have chosen a vendor.

The AI support market is large, growing fast, and full of vendors who can demo beautifully. The questions you ask — and how rigorously you evaluate the answers — are what determine whether your investment delivers real results or becomes another line item you regret at renewal.

This evaluation framework reflects best practices as of March 2026 and incorporates input from CX leaders at companies ranging from 10 to 5,000+ agents. Vendor capabilities change frequently — always verify claims directly.

Key Takeaways

The Evaluation Table

Deep Dive: Each Question Explained

Question 1: How Do You Measure Quality?

Question 2: What Is the Real Cost?

Question 3: How Long to Go Live?

Question 4: What Happens When the AI Cannot Answer?

Question 5: How Do You Handle Hallucinations?

Question 6: How Deep Are the Integrations?

Question 7: Who Manages the AI After Deployment?

Question 8: What About Security?

Question 9: Show Me Real Performance Data

Question 10: What Is the Escalation False-Negative Rate?

Question 11: Knowledge Base Updates

Question 12: Contract Terms and Exit

Question 13: Multi-Language Support

Question 14: What If You Get Acquired?

Question 15: Continuous Improvement

How to Score Vendors

Recommended Next Steps

Related Pages

Integrations

Industries

Comparisons

See how Twig resolves tickets automatically

Related Articles

After the Salesforce-Qualified Deal: What's Changed for B2B SaaS Support Buyers

AI Agents That Work With HubSpot, Salesforce, Pipedrive, and Zoho — The CRM-Agnostic Shortlist

AI SDR vs AI Support Agent: A Buyer's Guide to Not Confusing the Two