Gearing Up for the AI Race: Preparing Data to Go the Distance
Learn how human oversight and strategic tagging are setting the pace, and how Morningstar built a global classification system for thematic funds.

When we talk about artificial intelligence (AI) in investing, the conversation often jumps straight to models, predictions, and machine-learning breakthroughs. But behind every reliable AI application is something less flashy, yet far more important: well-prepared data. Without structured, explainable, high-quality data, even the most advanced AI systems can be prone to hallucination—delivering flawed or misleading results.
I’m currently training for multiple ultra-endurance solo cycling world records, and it has me reflecting on the similarities between that kind of preparation and the work we do preparing data to power AI. In both cases, success hinges on meticulous planning, the ability to adapt mid-course, and some serious legwork.
As thematic investing accelerates—increasing from USD 269 billion to USD 562 billion in the past five years—and asset managers demand sharper insights from increasingly vast datasets, Morningstar has developed a global classification system that addresses the core challenge of making investment data truly AI-ready.
In this article, I’ll share how we created a thematic fund dataset from the ground up; how we blended human expertise with machine learning to do it; and why high-quality, explainable data isn’t just good hygiene—it’s a competitive edge.
Data Is the Fuel, but It's Often Unfit for the Ride
Just as a cyclist can’t perform without the right fuel for their body, AI systems can’t perform without quality data. The reality is that many financial datasets are messy, unstructured, or inconsistent. They lack the explainability and traceability required for investment-grade analysis, and they’re often modeled retroactively to fit analysis tools that assume a high degree of standardization.
This is especially true in thematic investing, one of the fastest-growing corners of the fund landscape. Thematic funds aim to capture structural shifts in areas like AI, clean energy, aging demographics, or cybersecurity.
But unlike traditional sector or style funds, thematic funds don’t always fit neatly into standard categories—and that makes them harder to analyze, benchmark, or compare. At Morningstar, we’ve worked to bring more clarity and structure to that space.
Mapping the Mess: How We Built a Global Thematic Fund Classification System
Imagine plotting a 5,000-kilometer bike race through unfamiliar terrain. That’s what it felt like when we began categorizing over 50,000 thematic funds from scratch. The first challenge? Mapping the terrain by defining what actually qualifies as thematic.
We started by identifying funds based on intentionality—grouping together funds that say they invest in similar themes. We reviewed investment objectives, legal names, and prospectuses to build a universe of funds explicitly focused on themes like AI, blockchain, or water. Then we built a three-tiered classification system:
- Broad Theme (e.g., Social)
- Theme (e.g., Demographics)
- Sub Theme (e.g., Aging Populations)

This structure allows for flexibility; a fund tagged with Blockchain can also be grouped under broader categories like Fintech or Technology, depending on use case. Importantly, this wasn’t a purely machine-driven effort. Human analysts led the research, curating themes and validating categorizations. Only after we established a rigorous framework did we layer in natural language processing (NLP) to help scale the process.
The NLP–Human Loop: Navigating with Guardrails
Our NLP system crawls all new fund launches by using a curated library of terms drawn from thousands of fund documents to flag potential thematic funds and suggest tags.
But NLP isn’t enough on its own. Like a GPS, it can recommend a route—but it can’t decide which roads are washed out, or whether you should detour based on weather. That’s where our analysts come in. They review NLP-suggested categorizations and apply real-world expertise. For example, a fund name may strongly signal “Artificial Intelligence + Big Data,” but the strategy may use artificial intelligence technologies to select stocks—a traditional active stockpicker—rather than a thematic fund attempting to profit from the development of artificial intelligence technology as it ripples through the global economy.
This constant feedback loop—machine recommendations, human validation—ensures that the model gets smarter over time. And because the entire process is explainable, we can show clients exactly how and why each tag was applied.
Investment-Grade Data: What It Really Takes
I sit in a unique position—I help create this data, and I use it every day. As a power user, I need reliability and insist on that trustworthiness being the standard. Morningstar uses these thematic tags across products and teams, from our index team building investable indices, to our analysts writing research, to customers using our platform to screen for opportunities.


Having analyzed thematic products, I know firsthand how inconsistent classifications can muddy comparisons and mislead investors. That’s why I place so much emphasis on structure, reliability, and transparency.
So, what makes this data truly investment-grade? I’ve distilled it into five core checkpoints:
1. Human context
The thematic fund classification system was created and is maintained by analysts. This human touch allows the framework to stay relevant to the needs of a range of clients globally as the investment world evolves.
2. Scalable automation
Leveraging Morningstar’s vast datasets, relevant inputs are processed at scale and used to highlight thematic intentionality and recommend appropriate tags. NLP does the heavy lifting—allowing quality at scale.
3. Human oversight and validation
Every classification is reviewed by analysts, ensuring alignment with investor expectations.
4. Global consistency and local context
The classification system works globally, but accounts for regional naming nuances and disclosure differences. Local analysts validate where needed.
5. Traceability and explainability
Every tag and classification is fully auditable, with a record of sources and logic. This is essential not only for compliance, but also for trust.
Avoiding Blind Spots: How Explainability Beats Black Boxes
Many AI-driven data models function like black boxes—making opaque decisions that can’t easily be audited. That might work for casual applications, but not when institutional money is on the line.
We’ve taken steps to ensure our thematic tagging approach is fully transparent. Clients can see which terms were flagged, which inputs were used, and how the final decisions were made. In a world where regulators are increasingly scrutinizing how data feeds AI models, this kind of explainability is also a necessity.
When I’m validating a world record, the judges can’t just take my word for it. I must supply a range of supporting material including GPS, heart rate, and power data alongside photographs and witness statements. We believe investors should demand a similar level of transparency when it comes to data.
The Road Ahead: Continuous Optimization
Thematic investing isn’t going anywhere, and neither is the demand for AI-powered insights.
Our next steps include enhancing our global classification system to ensure it remains the gold standard in a rapidly changing world, building our local language capabilities to ensure the deepest local market coverage possible, and expanding our coverage to new fund structures.
But with more data comes more noise, so maintaining explainability and standards will only become more important. At Morningstar, our goal is to help investors navigate complexity with greater clarity. That means building structured data that’s not just big, but brilliant—explainable, and built to go the distance.
Because in both disciplines, whether it’s data or endurance training, the key to a successful race is preparation. And if you get that right, everything else follows.