Skip to Content

6 min read

Inside Morningstar’s Managed Investment Data Factory

Discover our Comparison Engine in Real Time.

Key Takeaways

  • Morningstar ingests and performs data intake and normalization on 200 million documents a month to enable downstream analytical operations.

  • Morningstar appends strategy data—a description of a fund manager’s investment decision-making process and portfolio construction approach, to almost every managed investment in our dataset.

  • The employment of automation and AI has allowed Morningstar to expand our data factory’s throughput.

The growth of the managed-investments industry has left asset managers with an extraordinary abundance of choice. Paradoxically, more choice has also made portfolio research and construction more difficult. From a data perspective, one challenge is coverage of a managed investments universe which has expanded at 8% a year since 2000. A more complex challenge is data integration. Asset managers need to compare one managed investment to another. This requires data that is described, tagged, enriched, stored, and retrieved in consistent ways. What this takes is a data factory.

At one end of the Morningstar data factory line is the raw data we obtain from regulatory filings, asset managers, websites, data aggregators and other sources. At the other end of the data factory, we distribute comparable data, analytics, and ratings on over 400,000 closed and open-ended funds, ETFs, separately managed accounts, and collective investment trusts. In between, we ingest, normalize, and store the data, and enrich it by calculating and appending thousands of derived analytics, as well as by assigning funds to categories and strategies, and ratings to funds.

Data Ingestion and Normalization

The first task is to add some light metadata to each of the 200 million documents we ingest into our data factory every month, tagging the document with the fund it describes (see illustration). Next, we assign the document to an operational queue for data extraction and standardization. The structure of the raw data we extract is highly variable: we have over 16,000 unique data profiles for receiving or extracting information from different document types like word-processing files, PDFs, website data, spreadsheets, and APIs.

We have queues for pricing data, portfolio holdings data and operational data (such as fees, fund managers, parent companies and legal structures). Collectively, these queues employ 700 people. (Across Morningstar’s operations, we employ 2,000 people in data extraction and normalization.)

Each queue presents a different level of complexity and uses different methods for data extraction. The pricing queue is the most automated, with 99.1% of the work done by machines: people handle exceptions. Operational data, on the other hand, is more complex. We might receive a single 300-page document containing data on over 100 funds.

Categories and Strategies for Comparison

Normalization preps our data for analytical operations downstream. First, however, we must assign each fund to a peer-group category. Asset managers, financial advisors and investors all share a need to compare one fund to others. Asset managers need comparative analytics to market their products. Financial advisors need to be able to make recommendations. Investors need to pick the fund that’s right for them.

In the U.S. we update fund categories twice a year. We have tools that help to give shape and structure to each category. But humans run the process. As providers of independent data, we act like an industry referee. Not all asset managers agree with our categorization of their funds, and we have an appeals process to handle disputes. We must represent both the supply and demand sides of the market and come to a decision that makes sense overall: human judgment is essential.

For almost every managed investment in our dataset, we also append strategy data—a description of a fund manager’s investment decision-making process and portfolio construction approach that is agnostic of vehicle or domicile. Strategy data sits above investment-level data, identifying and linking together management investments that employ the same strategy—an increasingly common scenario as the type and variety of investment vehicles has proliferated. 

We create strategy data using both qualitative and quantitative methods. Qualitatively, we analyze the text that describes the fund’s objective, comparing the language used to other funds from the same manager, and flagging funds for further inspection that look like they have a different objective.

Quantitatively, we build a view of the fund’s strategy from its portfolio holdings and the analytics we derive from them, such as where the fund falls in Morningstar’s equity style box, or its country exposures. Again, we compare the results to other funds from the same manager, asking whether the strategy is unique.

Better Coverage Through Data Automation

Once we assign categories and strategies, we can start calculating holdings-based metrics (such as the weighted average quality score for stocks in a fund) and returns-based metrics (such as the fund’s Sharpe ratio, or volatility).

The global universe of managed investments continues to grow. We strive for completeness in our data coverage and employ automation and AI judiciously to expand our data factory’s throughput. For example, we have used supervised learning to train a large language model to assign a forward-looking Morningstar Medalist Rating to managed investment funds. With humans alone, we were able to rate 40,000 funds, once per year. With our AI-assisted “human-in-the-loop” process, we can now rate 440,000 managed investments, every month.

You might also be interested in...