3. Five lessons from Lilli’s development

The team—and McKinsey as a whole—learned a lot over the course of Lilli’s development and continues to do so as it expands Lilli’s capabilities. Below are five of the many.

Data privacy and intellectual property issues rightfully stand top of mind for organizations and need to be sufficiently addressed. Lilli’s data strategy team, which includes a product manager, data life cycle director, and legal and risk professionals, among others, plays a central role in ensuring Lilli’s compliance and security.

But the importance of equipping a gen AI platform with clean, relevant, comprehensive data sets cannot be overstated—without it, a platform like Lilli cannot provide value at scale. Relevancy is particularly critical for both getting quality outputs and controlling cost. While an organization’s proprietary data is a key driver of competitive advantage through gen AI, not all of an organization’s data is needed to build an effective application. Given that both data storage and processing drive a significant proportion of data costs, it’s important to spend sufficient time up front separating the wheat from the chaff.

Be vigilant about data curation

As the Lilli team experimented with off-the-shelf LLMs, it found that no single one delivered the level of specialization needed to accommodate McKinsey-specific content. For example, the word “impact” means something entirely different to a consultant than what it might mean to, say, a worker at an auto manufacturer. And employing a large model for some simpler tasks wasn’t cost-efficient.

To solve for these issues, our engineers developed a patented orchestration layer that routes requests to different LLMs or other types of AI to better recognize user intent, optimize cost, and deliver high-quality responses. The layer provides the added benefit of enabling experimentation with different LLMs that easily “plug” into the system. Many organizations already face similar issues and would benefit from investing in this critical element of their gen AI systems.

Invest in an orchestration layer

As documented time and again, using gen AI–based applications demands an entirely new way of working. We prioritized user-adoption programming from Lilli’s inception and have embedded it everywhere possible. It began with the platform design, which prioritized an intuitive interface and self-learning.

To make it relevant, we codeveloped Lilli with our industry and functional practices, and experts from these groups continue to participate in all Lilli squads, including those dedicated to reskilling colleagues and rewiring processes to accelerate adoption.

A continuously evolving prompt library provides sample prompts for every use case, and dedicated Lilli support staff offer on-demand help. Leaders act as role models for the change through their use of Lilli, and they send personalized emails encouraging colleagues to engage with the platform. Users see changes implemented based on their feedback, ensuring that Lilli is always evolving with their needs in mind.

It’s never just tech

Our experience building Lilli taught us to prioritize testing over development. Given the nascency of LLMs, we built active learning loops into our development to enable swift adjustments. There were certainly bumps along the way. For example, at one point early in the rollout, we changed our chunking strategy (breaking data sets into smaller pieces to improve processing). It caused the model to start hallucinating. We quickly paused deployment to course correct.

In another instance, we discovered our intent model was misunderstanding prompts due to an inaccurate assumption we made about the way our consultants classify information.

In a postmortem, we realized that additional quality assurance with crystal-clear alignment on accountabilities and more frequent milestone sign-offs were needed. While this governance can slow development a bit, in the end it speeds it up by reducing errors and rework.

Test and test again—and again

Prompts matter—a lot

Prompt engineering is a new skill, and, even with training, software engineers alone won’t be able to do it effectively; domain experts must be involved in the process. And that process is more of an art than a science. Development teams need to iterate continually to incorporate feedback and emerging best practices, and define metrics to measure the impact of version changes and experiments.

Users, too, must learn the art of prompting. “Prompt anxiety,” or not knowing what to ask Lilli, stood as a major barrier to adoption initially. Just one hour of prompt training boosted our colleagues’ usage substantially.

Prompts matter—a lot

Be vigilant about data curation

Invest in an orchestration layer

Test and test again—and again

It’s never just tech