で見るのを減らして、より多く読む。

    任意のYouTubeビデオをPDFまたはKindle対応記事に変換。

    How Morgan Stanley deploys AI that actually works

    Sep 16, 2025

    17130 文字

    11分で読めます

    SUMMARY

    Kaitlin Elliott, Morgan Stanley's Executive Director of Firmwide GenAI, discusses the firm's early adoption of generative AI in 2022, focusing on knowledge management solutions, rigorous evaluation frameworks, and overcoming change management hurdles for enterprise deployment.

    STATEMENTS

    • Morgan Stanley has long invested in technology as part of its core DNA, emphasizing tools that enhance employee productivity and client services.
    • In early 2022, executives were captivated by generative AI's demo capabilities, such as writing poems, prompting early investment before ChatGPT's public release.
    • The firm faced a persistent knowledge management challenge, investing in data curation and tagging for conversational AI virtual assistants serving employees and clients.
    • Traditional virtual assistants, built over years with 10,000 FAQs, only answered 10-20% of employee queries effectively.
    • Morgan Stanley's first GenAI use case targeted wealth management procedures, process documents, and research reports to address knowledge gaps.
    • Initial experiments with GPT-3 involved feeding documents directly, inadvertently creating a Retrieval-Augmented Generation (RAG) system before the term was common.
    • Early tests revealed significant query coverage improvements, leading to a firm-wide commitment to GenAI investment.
    • Pre-existing data infrastructure for knowledge management enabled rapid prototyping and experimentation, combining strategic appetite with practical problem-solving.
    • Demos of GenAI were magical but inconsistent in production, highlighting issues like hallucinations that required iterative improvements.
    • To validate usefulness, the team created a lab environment where financial advisors tested the tool, rating responses for accuracy, completeness, and hallucinations.
    • Early evaluations included thematic bucketing of issues, such as search problems, and adding source citations to reduce hallucinations.
    • A head-to-head test showed the AI answering 25 procedure questions in under an hour with higher quality than human experts, building leadership confidence.
    • As a financial institution, Morgan Stanley prioritizes accuracy with little room for error, unlike startups that can deploy with disclaimers.
    • The first pilot lasted nine months, allowing time to develop a foundational evaluation framework, taxonomy, and regression suite of 500 questions.
    • Traditional ML evaluations like cosine similarity failed, so they relied on human-reviewed regression testing for changes in prompts or search rules.
    • The evaluation framework established baselines for accuracy (e.g., 80%), enabling faster governance reviews for subsequent use cases.
    • Post-deployment, a team of annotators reviews a daily subset of interactions using rubrics to grade accuracy, completeness, and source relevance.
    • Human annotation identifies core issues like untimely article retrieval, informing business rules; AI-assisted reviews scale this to thousands of interactions.
    • For agentic AI, evaluations must evolve to supervisory models, treating agents like employees with registries for purpose, credentials, and human sign-offs.
    • Change management involves updating governance for GenAI, distinct from traditional ML, with control teams adapting review processes to avoid endless loops.
    • Employee adoption strategies include self-service resources, advanced videos from prompt engineering to agentic solutions, and hands-on desk-side guidance.
    • Building habits, like daily market summaries via the assistant, unlocks advanced uses and boosts tool familiarity among diverse global employees.

    IDEAS

    • Generative AI's initial "magic" in demos often masks production inconsistencies, requiring real-user testing to uncover value despite flaws.
    • Pre-existing data investments can accelerate AI adoption by turning knowledge management challenges into quick-win prototypes.
    • Hallucinations in early AI were mitigated by inventing RAG-like techniques, such as source citations, before they became standard.
    • Comparing AI outputs directly to human performance reveals that AI can surpass humans in speed and coverage without needing perfection.
    • Financial firms must balance innovation with zero-tolerance accuracy, using pilots to build trust rather than rushing flawed deployments.
    • A taxonomy of errors (e.g., inaccuracy vs. incompleteness) standardizes evaluations, preventing vague metrics like "80% accuracy" from misleading teams.
    • Regression suites of benchmark questions ensure consistent quality during iterations, avoiding breakage from updates like prompt engineering.
    • Post-launch monitoring via sampled annotations detects drifts, enabling rapid triage of issues like failed search filters.
    • Agentic AI demands employee-like oversight, including registries for actions and hybrid human-AI guardians to manage complexity.
    • Change management succeeds through cultural shifts, framing AI as a co-pilot akin to Google, where users verify outputs like research sources.
    • Hands-on habit-building, such as daily AI queries for market insights, transforms passive tools into workflow staples.
    • Employees, not central teams, drive novel AI applications, so adoption barriers stem from people, not technology.
    • Governance evolution for GenAI prevents paralysis, with control partners rethinking reviews to match AI's unique risks.
    • Pilots thrive on an "innovation mindset," where bad responses are opportunities for refinement, fostering experimentation.
    • Global enterprises face adoption variances by region due to regulatory differences, requiring tailored education phases.

    INSIGHTS

    • Early AI enthusiasm must be tempered by structured evaluations to bridge the gap between dazzling demos and reliable enterprise tools.
    • Investing in data hygiene beforehand creates a flywheel for AI success, turning legacy problems into competitive advantages.
    • Human-AI comparisons demystify perfectionism, showing that even imperfect AI outperforms baselines, much like cloud storage beat local drives despite risks.
    • Standardized taxonomies in testing abstract away complexity, allowing scalable governance across use cases without reinventing frameworks.
    • Annotation blends human critical thinking with AI scaling, ensuring post-deployment vigilance catches subtle drifts before they escalate.
    • Treating agents as supervised employees highlights the need for credentialed registries, preventing unchecked autonomy in high-stakes environments.
    • Change management reframes AI as an extension of familiar tools like search engines, easing adoption by emphasizing verification habits.
    • Desk-side guidance builds intuitive habits that reveal AI's latent potential, unlocking employee-driven innovations beyond initial designs.
    • Prioritizing accuracy in finance isn't caution—it's a non-negotiable principle that safeguards against inevitable industry missteps elsewhere.
    • Barriers to AI lie in human inertia, not tech hurdles; empowering users through education catalyzes organic, creative applications.
    • Pilots with explicit failure tolerance cultivate an experimental culture, accelerating from novelty to productivity without fear of errors.
    • Global scaling demands nuanced approaches, recognizing that readiness varies by tenure, region, and role to sustain momentum.

    QUOTES

    • "When we first saw GPT I think we were using like GPT3 we thought we could like quite literally just give it the documents and it would like get it right right we were like totally in in the dark."
    • "We had actually built what is now rag but we really didn't know what rag was right um and I would say that almost immediately we were able to discover that the coverage that we had was significant."
    • "The AI was actually able to answer all questions in 25 minutes, I mean in 25 questions in an hour and the humans weren't."
    • "We're trying to get perfection and we don't expect that of humans, right?"
    • "When you're piloting, if you get bad responses that's great. like we're trying to really create this like experimental like innovation type of mindset."
    • "You have to think of it the same way you did when you first learned how to use Google for a research report, right? Like we were taught you can't believe everything that you see or read on the internet."
    • "I know you probably have a particular client that calls you every single day asking you what is Morgan Stanley's view... Why don't you come in every day and type into the assistant tell me the things that I should know today."
    • "The biggest barrier to enterprise AI isn't technical, it's people."
    • "For us, it's just a core principle that we are not budging on. Our bosses might be very different."

    HABITS

    • Daily use of AI assistants for market summaries and firm views, replacing manual Google searches to build routine integration into workflows.
    • Hands-on desk-side sessions with advisors to simulate real tasks, fostering immediate tool familiarity and habit formation.
    • Self-service engagement with internal videos progressing from basic prompt engineering to advanced agentic solutions.
    • Verification of AI outputs like citing sources, mirroring critical habits from internet research to ensure reliability.
    • Experimental piloting mindset, where users report errors daily to encourage iterative feedback and continuous refinement.
    • Thematic bucketing of issues in evaluations to habitually triage problems, such as search failures, for quick fixes.

    FACTS

    • Morgan Stanley piloted its first GenAI wealth management assistant in September 2022, months before ChatGPT's November launch.
    • Traditional virtual assistants with 10,000 FAQs covered only 10-20% of employee queries despite years of development.
    • A nine-month pilot established an 80% accuracy baseline through human-reviewed regression testing on 500 questions.
    • AI outperformed humans in answering 25 procedure questions, completing them in under an hour with superior quality.
    • The firm employs a global workforce of over 80,000, complicating adoption across regions with varying regulations.
    • Early GenAI experiments inadvertently pioneered RAG techniques, improving query coverage significantly over prior systems.
    • Daily post-deployment reviews sample hundreds of interactions, blending human and AI annotation to maintain performance.

    REFERENCES

    • GPT-3: Early large language model used in initial experiments for document-based query responses.
    • ChatGPT: Public release in late 2022, post-dating Morgan Stanley's pilot by months.
    • Retrieval-Augmented Generation (RAG): Technique discovered organically in experiments before formal naming.
    • Wealth Management Assistant: Internal GenAI tool for procedures, processes, and research reports.
    • Virtual Assistants: Pre-GenAI conversational AI for employees and clients, limited by FAQ coverage.
    • Scale.com/enterprise/agentic-solutions: Resource for building practical enterprise AI systems.
    • Human in the Loop Podcast: Series discussing real-world AI deployment, hosted by Clement Finkel and Sam Denton.

    HOW TO APPLY

    • Prepare Data Foundations: Curate and tag internal content like procedures and reports to address knowledge gaps, enabling quick AI prototyping without starting from scratch.
    • Build an Experimental Lab: Recruit end-users like financial advisors to test AI tools, gathering ratings on accuracy, completeness, and hallucinations to identify early issues.
    • Implement Regression Testing: Create a suite of 500 benchmark questions to evaluate changes in prompts, search rules, or models, ensuring updates don't degrade performance.
    • Conduct Head-to-Head Comparisons: Have AI and humans answer the same set of real queries simultaneously, measuring speed, coverage, and quality to demonstrate value to stakeholders.
    • Develop a Taxonomy of Errors: Categorize failures thematically (e.g., inaccuracy, incompleteness, search issues) to standardize evaluations and streamline triage during iterations.
    • Roll Out with Pilots and Monitoring: Launch internal pilots with disclaimers, sample daily interactions for human-AI annotation, and use drifts to trigger fixes like business rule adjustments.
    • Foster Habits Through Guidance: Provide desk-side training to integrate AI into daily routines, such as querying market insights, and create self-service videos for progressive learning.

    ONE-SENTENCE TAKEAWAY

    Morgan Stanley's GenAI success stems from rigorous evaluations, human-AI hybrids, and habit-building to deploy reliable tools enterprise-wide.

    RECOMMENDATIONS

    • Invest in data curation early to transform knowledge challenges into AI advantages, avoiding common enterprise hesitations.
    • Prioritize user labs over demos for honest feedback, focusing on real queries to validate utility despite imperfections.
    • Establish accuracy taxonomies and regression suites as baselines, expediting governance for all future use cases.
    • Treat AI as a co-pilot requiring verification, educating users like Google habits to build trust and adoption.
    • Monitor deployments with sampled annotations, combining human insight and AI scaling to catch issues proactively.
    • For agents, design supervisory frameworks with registries and sign-offs, evolving evaluations like employee oversight.
    • Drive change via hands-on sessions and advanced resources, empowering employees to innovate beyond central teams.
    • Embrace pilot failures as innovation fuel, setting 80% accuracy thresholds to balance speed and reliability in finance.

    MEMO

    Morgan Stanley's embrace of generative AI began in early 2022, well before ChatGPT captivated the world. Kaitlin Elliott, the firm's Executive Director of Firmwide GenAI, recalls executives mesmerized by demos—like AI composing poems—that hinted at transformative potential. Yet, the real driver was a nagging knowledge management crisis: years of effort had yielded virtual assistants powered by 10,000 FAQs, yet they answered just 10-20% of employee queries. With curated data already in place for wealth management procedures and research reports, the firm saw an opportunity to leapfrog limitations using early models like GPT-3.

    Initial experiments were naive—simply feeding documents to the AI—but quickly evolved into what Elliott now recognizes as Retrieval-Augmented Generation (RAG), a technique that boosted query coverage dramatically. Skepticism arose fast; demos dazzled, but production revealed hallucinations and inconsistencies. To counter this, the team built a lab where financial advisors tested the wealth management assistant, rating responses for accuracy and completeness. Thematic bucketing of errors, plus adding source citations, refined the system iteratively. A pivotal head-to-head trial sealed buy-in: AI handled 25 procedure questions in under an hour, outperforming humans in quality and speed.

    As a financial giant, Morgan Stanley couldn't afford errors, so a nine-month pilot forged a robust evaluation framework. Traditional metrics like cosine similarity flopped; instead, a 500-question regression suite became the backbone, tested after every tweak to prompts or search rules. This established an 80% accuracy baseline and taxonomy, streamlining governance with risk teams. "We're trying to get perfection and we don't expect that of humans," Elliott notes, echoing comparisons that revealed AI's edge over imperfect human baselines, much like cloud storage's reliability surpassing local drives despite initial fears.

    Deployment demanded ongoing vigilance. A team of annotators now reviews sampled daily interactions, grading for relevance and spotting drifts—like outdated articles from failed filters. Human judgment informs AI-assisted scaling, handling thousands of checks efficiently. For emerging agentic AI, where models act autonomously, Elliott envisions employee-like supervision: registries for agent credentials, human sign-offs at key decisions, and guardian AIs to triage failures. This cautious scaling ensures the firm's 80,000 global employees wield tools that enhance, not endanger, workflows.

    Change management proved as crucial as the tech. Governance, built for traditional machine learning, required overhaul to avoid endless reviews. On the employee front, strategies blend self-service videos—from prompt basics to agentic builds—with hands-on guidance. Elliott describes sitting at advisors' desks, urging daily AI queries for market views over Google hunts. This habit-building unlocked creativity: users discovered advanced uses, proving employees, not central teams, spark innovation. Yet barriers persist—long-tenured staff resist, regions vary by regulation—underscoring that people, not tech, remain the hurdle.

    Ultimately, Morgan Stanley's journey reveals AI's enterprise promise lies in disciplined evaluation and cultural shifts. By framing tools as verifiable co-pilots, the firm fosters an experimental mindset where errors fuel progress. As agentic systems loom, their foundational principles—accuracy without budging, humans in the loop—position them as leaders, turning early gambles into scalable reality.