customer support

Evaluating AI Support Vendors: 15 Questions Every Head of Support Should Ask

The definitive buyer's checklist for AI support — 15 questions on pricing, quality, escalation, and security with good/bad benchmarks.

Twig TeamMarch 29, 202615 min read

You have budget approval. You have a shortlist. Now you are sitting across from vendor sales teams, and every one of them has a polished demo showing their AI resolving tickets flawlessly. The demo is not the product. The questions you ask — and the answers you demand — are what separate a good vendor decision from a costly mistake.

This is a checklist of 15 questions that every Head of Support should ask during an AI support vendor evaluation. For each question, we define what a good answer looks like and what should raise a red flag. These are not theoretical — they are drawn from real evaluation processes at mid-market and enterprise support organizations.

Print this out. Bring it to your next demo. Score every vendor against it.

The Evaluation Table

#QuestionGood AnswerRed Flag
1How do you measure the quality of AI responses?Per-response quality scoring across multiple dimensions (accuracy, completeness, tone, policy compliance). Automated self-evaluation on every response."We use CSAT scores" or "Our accuracy is 95%" without explaining how accuracy is measured.
2What is your pricing model, and what is the total cost at my volume?Clear, published pricing. Willingness to model cost at your specific ticket volume. Transparent per-ticket or per-resolution cost."Contact sales for pricing" with no ballpark. Annual contract with unclear overage charges.
3How long does it take to go live with real tickets?Specific timeline: "30 minutes to first response" or "2 weeks to full deployment." Backed by customer references."It depends" without specifics. Any timeline over 8 weeks for a standard deployment.
4What happens when the AI cannot answer a question?Defined escalation triggers. Configurable confidence thresholds. Full context passed to the human agent. Measurable escalation rate."The AI always tries to answer." No configurable escalation logic. Context is lost in handoff.
5How do you handle AI hallucinations?RAG-grounded responses with citation to source documents. Automated hallucination detection. Response withheld when confidence is low."Our model doesn't hallucinate" or no specific hallucination mitigation strategy.
6What integrations do you support, and how deep are they?30+ integrations with specific documentation on data flow for each. Bi-directional sync with major help desks."We integrate with Zendesk" without specifics. One-way data push only. Fewer than 10 integrations.
7Who manages the AI after deployment?Clear ownership model — either vendor-managed or defined customer responsibilities with tooling to support them.Ambiguous ownership. "Your team manages it" without adequate tooling or documentation.
8What security certifications do you hold?SOC 2 Type II at minimum. Clear data residency policies. No training on customer data without explicit opt-in. PII detection and redaction.No SOC 2. Vague data handling policies. Model trained on customer data by default.
9Can I see real performance data from a similar customer?Named case studies with specific metrics: resolution rate, handle time reduction, quality scores, escalation rate.Only aggregate statistics. No named references. "We can't share due to NDAs" for every customer.
10What is your escalation false-negative rate?A specific number (e.g., "Less than 3% of tickets that should have been escalated were not"). Methodology for measuring it."I don't know" or "We don't track that."
11How do you handle knowledge base updates?Automatic re-ingestion on a defined schedule (hourly, daily). Manual refresh option. Change detection and versioning.Manual re-upload required. No version control. Updates take days to propagate.
12What is the contract term, and what are the exit terms?Month-to-month or per-ticket with no minimum term. Or annual with a 30-day termination clause. Data export included.24-month minimum. Auto-renewal with 90+ day notice requirement. No data portability.
13How do you handle multi-language support?Specific list of supported languages. Quality metrics broken down by language. Native language processing, not just translation."We support all languages" without specifics. Translation-only approach for non-English.
14What happens if your company is acquired?Change-of-control clause in the contract. Commitment to 12+ months of service continuity. Data portability guarantee.No change-of-control clause. No continuity commitment. (See what happened with Forethought's acquisition by Zendesk.)
15How do you evaluate and improve over time?Continuous learning from resolved tickets. Regular quality reports. Proactive recommendations for knowledge base gaps. Defined improvement cadence."The model improves automatically" without specifics. No structured improvement process. No reporting.

Deep Dive: Each Question Explained

Question 1: How Do You Measure Quality?

This is the single most important question, and it is the one most vendors fumble.

The industry standard for years has been CSAT — a survey sent after ticket resolution. The problems with CSAT for AI evaluation are well documented: response rates are 5–15%, respondents skew toward extremes, and the score arrives days after the interaction. By the time you know the AI gave a bad answer, 200 more customers have received the same bad answer.

What you need is per-response quality evaluation — an automated system that scores every AI response on multiple dimensions the moment it is generated. The dimensions that matter most:

  • Accuracy: Is the information factually correct?
  • Completeness: Does the response fully address the customer's question?
  • Tone: Is the response professional, empathetic, and brand-appropriate?
  • Policy compliance: Does the response adhere to your company's policies?
  • Source grounding: Is the response supported by your knowledge base?
  • Safety: Does the response avoid harmful, misleading, or legally risky content?
  • Actionability: Does the response give the customer a clear next step?

Twig's 7-dimension quality scoring is one implementation of this approach. Other vendors may have their own frameworks. The specific dimensions matter less than the principle: every response should be evaluated, automatically, before you rely on CSAT to tell you something went wrong.

Question 2: What Is the Real Cost?

Vendor pricing in the AI support market is notoriously opaque. Here is a reference framework:

VendorPricing ModelAnnual Cost RangeWhat Is Included
DecagonAnnual contract$95K–$590KPlatform, implementation, Agent Engineers, custom workflows
Sierra AIAnnual contract$150K–$350K+Platform, implementation, CSM, multi-model architecture
TwigPer-ticket ($5/ticket, free tier)Scales with volumeManaged AI Specialists, 30+ integrations, 7-dimension quality scoring, SOC 2 Type II
AdaAnnual contract$100K–$400KPlatform, multilingual support, proactive messaging
Zendesk AIAdd-on to Zendesk plansVariesNative deflection, triage, agent assist
Intercom FinPer-resolution~$0.99/resolutionNative to Intercom, basic deflection

When evaluating, calculate the total cost at three volume levels: your current volume, 2x volume, and 0.5x volume. This reveals how pricing scales and whether you are exposed to cost spikes during busy periods.

See Twig's pricing page for a transparent per-ticket model you can model against your own data.

Question 3: How Long to Go Live?

Time to value is not just a convenience metric. Every week your AI is not live is a week of tickets handled manually. If your team handles 3,000 tickets per month and AI could resolve 40% of them, a 4-week delay costs you 3,000 manually-handled tickets that could have been deflected.

The range across vendors is enormous:

Vendor TypeTypical Time to First AI ResponseTypical Time to Full Deployment
Managed service (e.g., Twig)30 minutes to 24 hours1–5 days
Self-serve platform (e.g., Decagon, Ada)1–3 weeks4–8 weeks
Enterprise platform (e.g., Sierra)2–4 weeks6–12 weeks
Custom build8–16 weeks16–26 weeks

Ask for the median deployment time, not the best case. And ask for a reference customer who went live recently, not one from 18 months ago.

Question 4: What Happens When the AI Cannot Answer?

Ninety percent of support teams report struggling with AI-to-human handoffs. The handoff is where customer experience breaks down. A customer explains their problem to an AI, the AI fails, and the customer is transferred to a human agent who has no context and asks the customer to start over.

A good vendor provides:

  • Configurable confidence thresholds — you define at what confidence level the AI should escalate vs attempt a response.
  • Full context transfer — the human agent receives the full conversation, the AI's attempted response, the relevant knowledge base articles, and the reason for escalation.
  • Escalation categorization — you can see why tickets are being escalated (knowledge gap, policy question, emotional customer, multi-step request) and address root causes.

Question 5: How Do You Handle Hallucinations?

Hallucination — the AI generating plausible-sounding but factually incorrect information — is the existential risk of AI support. One confidently wrong answer about a billing policy or product safety issue can create legal liability, customer churn, and brand damage.

The best mitigation is RAG grounding: every AI response must be traceable to a specific source document. If the AI cannot find a relevant source, it should say "I don't have information on that" rather than guess.

Ask the vendor to demonstrate what happens when you ask a question that is not covered in your knowledge base. If the AI generates an answer anyway, that is a red flag.

Question 6: How Deep Are the Integrations?

"We integrate with Zendesk" can mean anything from "we read tickets via API" to "we have bi-directional sync with custom fields, triggers, automations, macros, and SLA policies." The difference matters enormously.

Questions to ask about each integration:

  • Is it read-only or bi-directional?
  • Does it sync custom fields?
  • Does it respect your existing routing rules and automations?
  • How frequently does it sync (real-time, hourly, daily)?
  • Is it a native integration or does it require a middleware like Zapier?

Twig offers 30+ integrations across help desks, CRMs, knowledge bases, and internal tools. Other vendors may have similar breadth. What matters is depth.

Question 7: Who Manages the AI After Deployment?

This question reveals the vendor's operating model and your hidden costs. There are three common models:

ModelVendor ResponsibilityYour ResponsibilityHidden Cost
Fully managedTraining, tuning, monitoring, quality, updatesReview reports, approve changes, update knowledge baseLow
Shared responsibilityInfrastructure, platform, basic monitoringConfiguration, workflow management, quality review, prompt tuning0.5–1 FTE
Self-serve platformInfrastructure, documentationEverything else1–2 FTE

If the vendor says "your team manages it," calculate the fully-loaded cost of the internal resource needed. That $150K annual contract might actually cost $225K when you add a half-time operations person.

Question 8: What About Security?

SOC 2 Type II is the baseline. If a vendor does not have it, they are either too early-stage or not treating security seriously. Either way, it is a risk.

Beyond SOC 2, ask:

  • Where is data stored? Can you specify region?
  • Is your data used to train the vendor's models? Is it opt-in or opt-out?
  • How is PII detected, redacted, and handled?
  • What happens to your data if you cancel the contract?
  • Do you support SSO and role-based access control?

Twig's security posture includes SOC 2 Type II certification, but verify any vendor's claims independently. Ask for the audit report, not just the badge on the website.

Question 9: Show Me Real Performance Data

Every vendor will tell you their AI resolves 40–70% of tickets. The question is whether those numbers hold up for customers with your ticket complexity, your knowledge base quality, and your customer expectations.

Ask for:

  • A case study from a company in your industry and size range.
  • Specific metrics: resolution rate, average handle time, escalation rate, quality scores.
  • The timeline from deployment to those metrics (week 1 performance is very different from month 6 performance).
  • Permission to speak with the reference customer directly.

Question 10: What Is the Escalation False-Negative Rate?

A false negative in escalation is when the AI should have handed off to a human but did not. It continued trying to resolve a ticket it could not handle, frustrating the customer and delaying resolution.

This metric is rarely discussed in sales calls but it is one of the most important operational metrics for AI support. A good vendor tracks it, reports on it, and can tell you their benchmark. If they cannot, they probably are not measuring it, which means they cannot improve it.

Question 11: Knowledge Base Updates

Your products change. Your policies change. Your pricing changes. When they do, how quickly does the AI learn?

The best case is automatic re-ingestion — the vendor monitors your knowledge base and help center for changes and updates the AI's knowledge within hours. The worst case is manual re-upload, where you have to export documents, format them, and push them to the vendor's system.

Ask specifically what happens when you update a help article in Zendesk or Intercom. How long until the AI gives the updated answer?

Question 12: Contract Terms and Exit

AI vendor contracts are getting more aggressive. Some vendors require 24-month minimums with auto-renewal clauses that kick in 90+ days before expiration. If you forget to send a cancellation notice, you are locked in for another two years.

Good contract terms include:

  • Monthly or annual billing with 30-day cancellation
  • Data export in standard formats (CSV, JSON) upon termination
  • No penalty for volume decrease
  • Clear SLAs with financial remedies for downtime

Per-ticket pricing models, like Twig's, inherently offer more flexibility — you pay for what you use and can scale down without renegotiating. See Twig's pricing for details.

Question 13: Multi-Language Support

If you support customers in multiple languages, this question is critical. There is a meaningful difference between:

  • Native language processing: The AI understands and responds in the target language natively, with cultural nuance and idiomatic accuracy.
  • Translation layer: The AI processes everything in English and translates input/output. This works for simple queries but fails on nuance, idioms, and technical terminology.

Ask for quality metrics broken down by language. A vendor that reports 85% resolution rate overall might be at 90% in English and 60% in Japanese. The aggregate hides the gap.

Question 14: What If You Get Acquired?

This question felt theoretical until Zendesk acquired Forethought on March 11, 2026. Now it is practical. If your AI support vendor gets acquired by a platform you do not use, your investment is at risk.

The protection is contractual: a change-of-control clause that gives you the right to exit without penalty if the vendor is acquired, plus a commitment to service continuity for a defined period (12–24 months minimum).

We analyzed the implications of the Forethought acquisition in detail: What Zendesk's Acquisition of Forethought Means for the AI Support Market.

Question 15: Continuous Improvement

AI support is not a set-it-and-forget-it deployment. The AI should get better over time as it learns from resolved tickets, identifies knowledge gaps, and adapts to new question patterns.

Ask the vendor:

  • How often are models updated or fine-tuned?
  • Do you proactively identify knowledge base gaps?
  • Is there a regular review cadence (weekly, monthly) with performance reports?
  • Can you show me a sample improvement report from an existing customer?

The difference between a good AI support deployment and a great one is the improvement loop. Initial deployment gets you to 40% resolution. Continuous improvement gets you to 70%.

How to Score Vendors

Use this scoring framework during your evaluation:

CategoryWeightQuestions
Quality and safety30%Q1, Q5, Q10
Pricing and contracts20%Q2, Q12
Implementation and operations20%Q3, Q7, Q11, Q15
Integration and flexibility15%Q6, Q13
Security and risk15%Q8, Q9, Q14

Score each question 1–5 (1 = red flag, 5 = excellent). Weight the category scores and compare vendors on a single composite number. This will not make the decision for you, but it will make the decision defensible.

  1. Download or bookmark this checklist. Bring it to every vendor demo.
  2. Send these questions in advance. A good vendor will welcome them. A bad vendor will stall.
  3. Compare at least 3 vendors. Include Decagon, Sierra AI, Twig, and at least one platform-native option. Comparing across pricing models (annual contract vs per-ticket) will sharpen your understanding of total cost.
  4. Run a paid pilot, not a free trial. Free trials get deprioritized internally. A paid pilot with defined success criteria forces both you and the vendor to take it seriously.
  5. Check the Agents Playbook for tactical guidance on deploying and managing AI support agents after you have chosen a vendor.

The AI support market is large, growing fast, and full of vendors who can demo beautifully. The questions you ask — and how rigorously you evaluate the answers — are what determine whether your investment delivers real results or becomes another line item you regret at renewal.


This evaluation framework reflects best practices as of March 2026 and incorporates input from CX leaders at companies ranging from 10 to 5,000+ agents. Vendor capabilities change frequently — always verify claims directly.

See how Twig resolves tickets automatically

30-minute setup · Free tier available · No credit card required

Related Articles