Evaluate a Shopify agency. Seven questions.
Seven questions, two red flags, and a reference-call script that surfaces what the proposal hides. The honest framework for picking the right Shopify partner.
Seven questions, two red flags.
Evaluating a Shopify agency well is a 4-to-6 week process that turns on seven specific questions and a disciplined reference-call script. The non-negotiable three: name your lead engineer and designer (not the salesperson), show me three case studies in my revenue band with verifiable outcomes, walk me through the deliverables on a comparable past engagement with the actual artifacts. The other four cover ongoing engagement mix, scope-change handling, code review and QA process, and what month 12 looks like for a typical client. Red flags to walk away from: vague scope, fixed-price without discovery, generalist we-do-everything positioning, no named team, cheapest-in-market framing, and references the agency mediates rather than ones you call directly. Run reference calls in 30 minutes with five questions: what they did not tell you upfront, how they handled a missed deadline, who showed up to meetings six months in, what changed by month six, would you hire them again. The last question's hesitation, even half a second, is your buy-or-no-buy signal. Score finalists on a 7-criterion weighted matrix: quality of work 25 percent, team continuity 20 percent, scope discipline 15 percent, references 15 percent, timeline 10 percent, pricing transparency 10 percent, cultural fit 5 percent. After the engagement, score the actual experience on the same criteria to refine your framework for next time. Switching agencies mid-engagement costs 60 to 90 days and 10 to 20 percent of budget, but it is cheaper than finishing badly with the wrong partner.
Three non-negotiable, four diagnostic.
Question one: show me three case studies in my revenue band. An agency that has only shipped 500K-revenue brands cannot reliably scope a 15M brand, and an agency built for 50M enterprises will under-deliver on the operational subtlety a 3M founder needs. Ask for case studies that match your revenue, your category (fashion, beauty, B2B, F&B, home), and your platform tier (Shopify Plus vs standard). Verifiable means a named merchant, a public URL, and an outcome metric you can sanity-check against tools like the public Shopify Partner directory. If the case studies are anonymized "we worked with a leading beauty brand," that is marketing fiction, not evidence.
Question two: who is the actual lead engineer and designer on my account. The senior sales person who pitches you is rarely the person who writes the code or designs the experience. Ask for names, LinkedIn URLs, and a 10-minute conversation with each before signing; check their Shopify dev contributions and public work where possible. If the agency declines this request or substitutes an "account director" for the actual builder, the engagement will run with senior pitch theater and junior delivery. The pattern is so consistent that it is the single best filter in this list.
Question three: walk me through the deliverables on a comparable past engagement, with the actual artifacts. Real artifacts means the actual discovery document, the technical specification, the QA checklist, the launch runbook, and the post-launch retro from a past project. Names redacted, content visible. An agency that ships these consistently can produce them on request; one that does not will deflect with "we follow a custom process for each client." That deflection means they do not document, and your engagement will inherit that fog.
Question four: what is your ongoing-engagement to project-only mix. A 90 percent project-only agency will treat your retainer as a side gig and your team will get whichever engineers are not busy on the big project. A 60 percent retainer-led agency has built the operational rhythm for ongoing work and will treat you accordingly. Most quality Shopify agencies sit at 50-50 or 60-40 retainer-heavy; pure project shops are usually optimizing for hit-and-run cash rather than client lifetime value.
Question five: how do you handle scope changes mid-project. The honest answer is "we have a change-order process with a one-week turnaround on estimates and a single sign-off threshold above which the client approves before we proceed." Any answer that is vaguer than that is a future scope dispute. Ask to see a sample change order from a past engagement; the best agencies will share an anonymized example tracked in Linear or a comparable system, with the actual estimate-to-actual delta. The clarity of the document tells you everything about how the agency handles the inevitable mid-project complexity.
Question six: show me your code-review and QA process. For Shopify work this means: how do you handle code reviews on GitHub pull requests, what is your automated testing approach on theme code (most agencies do little here, which is honest if they admit it), how do you handle staging and preview environments, and what does QA sign-off look like before deploy. A serious agency has documented answers; a less serious one will improvise on the call. Cross-reference what they say with the actual GitHub workflow you will inherit.
Question seven: what does month 12 look like for your typical client. This question filters out agencies that are great at launches and weak at the year that follows. Good answers describe a steady monthly cadence of feature work, quarterly performance audits referencing Core Web Vitals field data, ongoing optimization on conversion-relevant pages, and a strategic review every six months. Bad answers describe "we are always available when you need us," which means you will be initiating every conversation and getting reactive output.
Six patterns that predict regret.
Red flag one: vague scope. The proposal says "we will redesign your Shopify storefront and improve conversion" without listing the pages in scope, the user flows being changed, the integrations being touched, the data being migrated, or the success metrics being measured. Vague scope is the single most expensive defect in agency engagements because every gap becomes a change-order conversation later. A good proposal reads like a project plan, not a sales letter; if it reads like a sales letter, it is a sales letter.
Red flag two: fixed-price quote without a discovery phase. No honest agency can price a complex Shopify project without first reviewing the existing codebase, talking to the stakeholders, and understanding the operational constraints (inventory systems, ERP integrations to platforms like Stripe Connect or NetSuite, returns, customer-service workflow). A fixed-price quote with no discovery either means the agency will pad heavily and pocket the surplus, or means they will quote low and recover via change orders. Either pattern punishes the client. Insist on a paid discovery (typically 5K to 15K, 2 to 4 weeks) before any fixed-price proposal on engagements above 50K.
Red flag three: we-do-everything generalist positioning. An agency claiming deep expertise across Shopify Plus, custom SaaS, mobile, AI, paid acquisition, brand strategy, video production, and SEO is not telling you the truth. Real expertise has a sharp focus; the breadth-claim usually means the senior partners are skilled in two areas and they staff the rest with junior or freelance labor. Ask which two practices the founders personally built and led; that is where the institutional depth actually lives.
Red flag four: no named team members on the proposal. Big agencies sometimes do this on purpose to preserve flexibility on staffing, but for an engagement above 50K you should know the names. If the agency refuses, ask why; the answer is usually telling. If you cannot name your lead engineer and your lead designer at proposal stage, you will not name them at delivery either, and you will work with whoever is on the bench that week.
Red flag five: cheapest-in-the-market framing. Shopify Plus engineering at quality is not a price-led market. An agency pitching primarily on being 30 to 40 percent below market is either using less-experienced engineers, taking shortcuts on QA and documentation, or planning to recover margin via change orders. The total cost of a cheap-but-flawed engagement is usually higher than the premium agency by month 12. The right framing for cost discussion is value: what does the agency deliver per dollar, not what does it cost.
Red flag six: references the agency hand-picks and mediates rather than letting you call them directly. A confident agency provides a list of past clients and invites you to pick three from anywhere on the list; they will introduce you and then step out of the call. An agency that insists on being on the reference call, or that picks the references for you, is curating the message. Insist on direct conversations without agency presence, and pick references from outside their suggested shortlist (often easy to find via Shopify public merchant lists or LinkedIn).
Five questions, thirty minutes.
Reference call question one: what did the agency not tell you upfront that you wished you knew on day one. This is the single best question on any reference call. The answer surfaces hidden patterns the agency hides from new prospects: how change orders actually work, whether the senior team rotates off after kickoff, whether the timeline estimates were honest or optimistic. The reference is on your side here; they have nothing to gain from sugar-coating, and the question gives them permission to be honest.
Question two: how did they handle the first missed deadline or major scope change. Every multi-month project has at least one of these. The interesting data is not whether it happened but how the agency responded. Good agencies surface the slip early, propose a recovery plan, and absorb some of the cost when their estimating was off. Weak agencies blame the client, raise prices, or quietly extend the timeline without disclosure. Ask the reference to walk you through the specific incident; the texture of the answer is the data.
Question three: who actually showed up to meetings six months in. Agency rosters rotate, and the senior team that pitched is often not the team that delivers in month six. A clean answer is "the same lead engineer and designer the whole way through." A worse answer is "we ended up with different people in month four than the kickoff team." This single question correlates more strongly with engagement quality than almost any other reference data point because team continuity drives velocity, knowledge retention, and trust.
Question four: what changed in month six versus month one. Communication frequency, response time, engineering attention, willingness to push back on bad ideas, willingness to suggest changes the client did not ask for. The pattern to watch for: agencies that start strong and decay by month four are common; agencies that maintain or improve by month six are rare and valuable. References notice the decay before they consciously articulate it; the question makes them put words to the feeling.
Question five: would you hire them again with the same scope and budget. The answer matters less than the hesitation. A confident yes inside one second is a strong signal. A confident yes after two seconds of thought is a moderate signal. Any hesitation past three seconds, any qualifying language like "with some changes," any pause where the reference looks for the right words, is your warning. The agency has clients who would re-hire and clients who would not; you want to work with an agency where most clients answer the question in under a second.
Seven criteria, weighted.
The seven criteria and their weights, used at every agency selection we run at Digital Heroes for our Shopify Plus engagements. Quality of work, weighted 25 percent: the highest single weight because everything else compounds off this. Score 1 to 5 based on the case studies you reviewed, the code quality you can see in their public repos on GitHub, and the technical depth of conversations with the lead engineer.
Team continuity, 20 percent. The risk that the senior team you meet at pitch rotates off in month four is real. Score on the agency's track record of keeping the same lead engineer and designer across multi-quarter engagements, their tenure data (the senior team's average years at the agency), and what references told you about who showed up in month six. Agencies above 4 here have built operational discipline around retention; agencies below 3 will deliver junior labor under senior labels.
Scope discipline, 15 percent. How well does the agency define, document, and change scope. Score on the specificity of the initial proposal, the clarity of the change-order process, the existence of a documented sign-off threshold, and whether references reported scope disputes. Soft scope is the most common source of agency-client friction; agencies that ship discipline here are dramatically easier to work with.
References quality, 15 percent. Not just whether the references were positive but whether they were direct, unmediated, and from accounts matching your stage and category. Three direct references from comparable accounts is worth ten anonymous testimonials. Score on the number, specificity, recency, and the buy-again signal from the reference calls.
Timeline reliability, 10 percent. How often does the agency hit their own timelines. Score based on what references told you, and the agency's own willingness to share timeline-vs-actual data from past engagements. Most agencies hit timelines roughly 60 to 70 percent of the time; ones above 80 percent are exceptional and worth a premium.
Pricing transparency, 10 percent. Are line items defined, are assumptions named, is the change-order math clear. Score the proposal on a 1-to-5 basis: 1 is "single number with no breakdown," 5 is "every line item priced with effort estimates and named assumptions." Most agencies sit at 3; finding a 5 is worth the search.
Cultural fit, 5 percent. The lowest weight on purpose: operators over-weight this during the pitch because the senior team is charming and the chemistry feels good. Cultural fit matters but it cannot compensate for weak execution. A one-point gap on quality of work is worth more than a five-point gap on cultural fit. Score it, weight it, but do not let it drive the decision.
Predicted vs actual.
After the engagement closes (or after the first 12 months on a long retainer), score the actual experience on the same seven criteria. Use the same 1-to-5 scale and the same weights. The gap between your predicted score and your actual score is the data that makes your next evaluation sharper. If you scored an agency 4.5 weighted at pitch and they delivered 3.2, the difference is concentrated in one or two criteria. Identify which ones, and adjust how you weight or test those criteria for the next agency selection.
Common gaps in practice. Quality of work usually scores lower in actual than predicted because the case studies you reviewed were curated by the agency and the real codebase had quirks the case studies hid. Team continuity scores lower in actual than predicted because the senior team you met at pitch attended kickoff and then rotated off; the only way to test this in advance is to require named-engineer commitment in the contract with a change-of-personnel clause. Scope discipline scores higher in actual than predicted because the change-order process turned out to be more functional than the proposal suggested, or lower because soft scope blew up at the first complexity surprise.
Use the gap analysis to update your weighting for the next agency search. If quality of work gapped down by more than one point, weight it 30 percent next time instead of 25. If team continuity gapped down, weight it 25 percent and add a contractual requirement for named-engineer commitment. The point of the scorecard is not to grade the agency; it is to train your own evaluation reflexes so the next decision is sharper. Related reading: how to choose the perfect Shopify Plus agency for the strategy-side framing of the same decision, including how revenue stage interacts with the evaluation criteria here.
Six answers.
What are the most important questions to ask a Shopify agency?
Three are non-negotiable. One, who is the named lead engineer and designer on my account, not the salesperson who is pitching me. Two, show me three case studies in my revenue band with verifiable outcomes and named merchants we can contact. Three, walk me through the deliverables on a comparable past engagement with the actual artifacts (the discovery doc, the technical spec, the QA checklist, the launch plan). Each of these reveals execution capability that pitch decks hide. The other four good questions cover ongoing engagement mix, scope-change handling, code review and QA process, and what month 12 looks like for a typical client. Skip these at your peril; agency-pitch theater is designed to deflect them.
What red flags should I look for in agency proposals?
Six common ones. Vague scope statements that refuse to commit to specific deliverables. Fixed-price quotes without a discovery phase; no honest agency can price a complex project without first understanding the codebase and the stakeholders. We-do-everything generalist positioning where the agency claims expertise across Shopify, custom SaaS, mobile, AI, paid acquisition, and brand strategy without depth in any. No named team members on the proposal cover sheet. Cheapest-in-the-market positioning, because Shopify Plus engineering at quality is not a price-led market. References the agency hand-picks and mediates rather than letting you call them directly. Any one is a yellow flag; two or more should end the conversation.
How do I run a useful reference call with an agency's past client?
Five questions in 30 minutes. First, what did the agency not tell you upfront that you wished you knew on day one. Second, how did they handle the first missed deadline or major scope change; this surfaces operational maturity better than any pitch. Third, who actually showed up to meetings six months in; agency rosters rotate and the senior team that pitched often is not the team that ships. Fourth, what changed in month six versus month one in terms of communication, velocity, and engineer attention. Fifth, would you hire them again with the same scope and budget. Any hesitation on the last question, even a half-second pause, is your buy-or-no-buy signal. Listen for it.
How long should an agency evaluation take?
Four to six weeks for a meaningful engagement; longer if the scope is over 200K. Week one is brief writing and shortlist. Weeks two to three are proposals back, technical conversations with each shortlisted agency, and architecture discussions. Weeks three to four are reference calls (three per agency minimum) and any homework deliverables you have asked for, like a sample sprint plan or a teardown of your current site. Weeks five to six are contract negotiation, statement of work, and kickoff scheduling. Faster usually means cutting reference calls or skipping technical discovery; both increase post-signing regret. The four-to-six-week investment pays for itself in the first quarter of a multi-quarter engagement.
What's the right way to compare two agency proposals?
Use a 7-criterion weighted scorecard. Quality of work 25 percent, team continuity 20 percent, scope discipline 15 percent, references quality 15 percent, timeline reliability 10 percent, pricing transparency 10 percent, cultural fit 5 percent. Score each agency 1 to 5 on each criterion, multiply by the weight, total the weighted scores. Write the rationale next to each score so you can defend it later. The discipline of the scorecard forces structured comparison and surfaces tradeoffs you would otherwise wave away. Two warnings: cultural fit gets the lowest weight on purpose because operators over-weight it during the pitch; and a one-point gap on quality of work is worth more than a five-point gap on cultural fit.
When should I walk away from an agency I'm already engaged with?
Three triggers. One, the named lead engineer rotates off your account without proactive notice; this signals the agency is reassigning their best people to a higher-priority client or losing key staff. Two, missed deadlines on two consecutive milestones with the same explanation; one slip is bad luck or scope drift, two is process failure. Three, scope-change conversations turn adversarial, where the agency frames every change as a budget threat rather than a tradeoff conversation. Switching agencies mid-project costs roughly 60 to 90 days of velocity and 10 to 20 percent of the project budget. That cost is real, but smaller than finishing a project badly with a vendor that no longer fits.
Pick the partner that ships.
Our Shopify Plus engagements include named-engineer commitments, documented scope-change processes, and transparent line-item pricing. Run the seven questions against us; we will not flinch. Scoped quote in 48 hours.

More from Shubham.
Senior Full Stack Developer · Lucknow · 2 posts on the site
Published .