



Relevant news for the oil market can be quantified by applying natural language processing (NLP) to articles in Thomson Reuters
NLP allows one to identify the importance of news topics, and differentiate news impact based on the “newsiness” of the subject, and the positive or negative tone of the text
NLP measures of news have substantial forecasting power for oil futures returns, oil companies’ stock returns, oil price volatility, and other oil market outcomes
NLP measures provide reliable forecasting information not captured by traditional forecasting variables
NLP measures perform well in forecasting outcomes out of sample
Forecasting models based on NLP can be used to construct profitable rules for trading oil futures and oil company stocks
The NLP forecasting models can also be used to capture the implications of oil market news for the macroeconomy and for other sectors’ growth and other firms’ stock returns.
Last week Charles Calomiris and his coauthors, Nida Çakir Melek of the Federal Reserve Bank of Kansas City, and Harry Mamaysky of Columbia Business School, published “Big Data Meets the Turbulent Oil Market” in the Financial Analysts Journal. Fabio Natalucci sat down to interview Charles about the article.
Fabio: Charles, you are generally known for your work on banking, monetary economics, and corporate finance, so I was surprised to see you publish an article about the oil market. Is this your first study of the oil market?
Charles: Yes, but we have been working on it for about seven years, so I’m feeling like an oil market veteran.
Fabio: How did it come about? Why are you interested in the oil market?
Charles: Harry Mamaysky and I had published an article in the Journal of Financial Economics showing that a quantification of news flow—using Thomson-Reuters as the corpus and applying recent methods in natural language processing to measure news—could predict equity returns over the medium term surprisingly well. At the time, I was an occasional visitor to the Kansas City Fed. I got to know Nida Çakir Melek, who had been working on energy projects there for some time. Nida and I came up with the idea of applying similar techniques to the energy market, and we were able to get Harry interested in joining us.
Fabio: You say it took seven years to write the article. What took so long?
Charles: We faced several thorny econometric challenges, which took some time to resolve. But most importantly, it was a highly data-intensive project, which is putting it mildly. We constructed a database that combined every known predictor of the oil market, a complete list of oil market-related outcomes—including oil futures and spot returns, stock returns for the major oil producers, oil price volatility, and changes in oil production and inventories—and our newly constructed text-based measures of energy news flow, which are derived from millions of newspaper articles published by Thomson-Reuters. Along the way we also created an energy lexicon to help us identify words that would tend to be associated with energy news, and that was instrumental for being able to divide energy news into different topical categories based on word co-occurrence, which capture what the news is about—where topics focus on oil production, refining, exploration, etc. The data we collected are high frequency observations (weekly) over many years. We didn’t confine ourselves just to forecasting oil futures returns, but also looked at spot returns, oil volatility, oil production and inventories, and the stock returns of the oil majors. I think I can say with confidence that no one has tried to forecast all these variables before in an integrated analysis, and neither has anyone amassed data combining such a comprehensive list of text and non-text explanatory variables.
Fabio: Was that your intention from the outset?
Charles: Not exactly. From the beginning we wanted to have a comprehensive list of oil-related outcome variables to be forecasted, and the intention to collect comprehensive textual news measures that allowed for topical differences was there from the outset, given that Harry and I had found that topical approach useful in our equity returns paper. But the comparisons we ended up making to every prior forecasting study was something that came out of the refereeing process. The referees and the editor wanted us to show that our new text-based forecasters added predictive power after one controls for other predictors used in prior studies, and the only way to do that was to collect all those prior measures. We were gratified to find that our news measures were not spanned by prior studies’ measures; in other words, we were able to capture important news that mattered for forecasting in ways that other studies’ measures could not. In a horse race to identify useful forecasting variables we find that text-based news measures add a lot of forecasting power.
Fabio: What sort of text-based news measures did you construct? Can you provide examples?
Charles: Based on prior work in textual analysis—including studies that made use of topical modeling, such as the study that Mamaysky and I did of global equity markets— we knew that certain aspects of news were likely to be important, and those opinions were confirmed by our analysis of the oil market. The dimensions we explored included topical weight (how much this week’s news articles focused on a particular topic), and topical sentiment (whether the news about the topic tended to have a positive or negative tone). We also included measures of the “newsiness of the news,” which is called entropy or unusualness. Not all news is equally new or impactful. Capturing the newsiness of news is important if you are measuring market responses to news. Our measure of newsiness is derived by looking at all the phrases that appear in one week’s news articles and then comparing them to phrases that have appeared in prior weeks’ articles to see how frequently they appeared; then one aggregates across all the phrases’ unusualness to construct a composite unusualness measure for the week. We also tracked the overall amount of energy news appearing per week (the energy news article count). Additionally, we calculated the first principal component of our 14 topical measures. The first principal component of a set of variables captures their most important single common factor. We included the principal component as a way to cover the possibility that a parsimonious measure with less topic-specific differentiation might be better than each of the individual measures. We also included separate principal components for the seven topical frequency and seven topical sentiment series. In all, we derived 19 text-based measures that were candidates for inclusion in our forecasting models.
Fabio: How did you construct that horse race, given that there were so many variables to choose from?
Charles: We used a regression method called stepwise forward selection, which selects variables based on their explanatory power as measured by R-squared (the incremental percentage of a variable’s variance that the candidate forecasting variable explains). This variable selection process occurs sequentially. You begin by comparing the R-squared that would be achieved by each of the many candidate variables in isolation, and you pick the one variable that delivers the most explanatory power. Then, given that you have already selected that first variable, you then select a second one the same way, by finding the variable that adds most as the second variable in the forecasting regression. We found that a parsimonious forward-selection model, with only a few predicting variables, worked well.
Fabio: And you found that the text variables were frequently selected?
Charles: Yes. Non-text variables comprise 54 percent of the candidate pool but account for only 50 percent of selected predictors, and the text measures that are selected also tend to be more statistically significant, and have greater usefulness out-of-sample than the non-text measures that are selected.
Fabio: You mentioned that your approach measures news within different topical categories. What are those topic areas and how did you arrive at them?
Charles: We start by constructing a list of words or short two- or three-word phrases that we expected would generally appear in energy news. We began that task by looking at relevant glossaries and using our own knowledge of the industry. That gave us a start. Then we used the corpus of Thomson-Reuters articles to flesh out the list to include other words that frequently co-occur in articles with the words in the initial list. That resulted in our complete energy word list. Then we used the Louvain method of topical identification –an algorithm that has been developed to divide a list of words (in this case our energy lexicon) into topical categories based on which groups of words tend to cluster together in a news article. In Figure 2 from our paper, reproduced below, you can see each topic as a separate word cloud, with the size of each word in the word cloud indicating its relative frequency of appearance. The groupings and the total number of topics one ends up with are the result of co-occurrence within articles; the algorithm, not the authors, determines both. We end up with seven topical categories, which are readily interpretable. The labels we use for them are arbitrary, but if you look at the word clouds, you will immediately see that each cluster of words is related to the others, forming a common topic.
Based on an analysis of the entire sample as a whole, energy discussions divide into what we call: company news (Co), global oil market news (Gom), distribution network news (Dist), generation & environmental news (Ge), crude oil physical news (Bbl), refining and petrochemicals news (Rpc), and exploration and production news (Ep).
Fabio: Did some topical categories prove to be more important for some of the variables you are forecasting than for others?
Charles: Yes, and the patterns we find make sense. For example, physical market news, in terms of both its sentiment and frequency (sBbl, fBbl) helps forecast inventories and production. Oil company sentiment (sCo) is important for predicting oil futures and spot returns. News related to refining, petrochemicals, and distribution (fRpc, sDist) predicts the stock returns of ExxonMobil, BP, and Shell—all of which have operations in those lines of business. Those patterns highlight the interpretability of our topical framework and its value for practitioners and policymakers who want to forecast energy market outcomes.
Fabio: Did you find that the selected forecasting variables were the same throughout time – for the early and late parts of the sample?
Charles: No, things change over time, and that is not surprising. If one allows the model to have a limited period on which it is estimated, and then re-estimate it every week on a forward rolling basis, one finds big changes in which variables are selected over time. That makes sense when you think of how the importance of different types of news changes over time. Sometimes there are geopolitical events – a war in Iraq, for example – that dominate the news, but sometimes technological change, such as fracking, dominates. And over the long term, the oil market’s production technologies are subject to big changes. In light of these changes, the rolling estimation window one should use is short – a matter of only a few years – which unfortunately can limit the statistical power of estimation and create the potential for an over-fitting problem, where one appears to get a great in-sample statistical fit, but it doesn’t hold up in forecasts out of sample – not just because of structural change out-of-sample, but because the low statistical power necessitated by the short sample can make the estimates imprecise, and therefore, less reliable out-of-sample.
Fabio: Do you think this is a common problem in econometric studies?
Charles: I’d say it is a central challenge in time series analysis—which is the econometric approach one uses when modeling something like the globally integrated oil market— in contrast to more powerful panel data analysis that is possible in other contexts. Although there is more than one kind of oil traded, the movements of, say, WTI and Brent, while not identical, are closely related, and the same news series are important for both. In time series analysis, one measures a single outcome (like oil returns) at each point in time. That outcome changes over time but one doesn’t measure independent variation in that outcome across locations at a point in time. With panel data – for example, the study of global equity markets that Harry Mamaysky and I did before, where we had data on many countries in each month – we had many different observations at each point in time, and consequently there is enough statistical power to derive more reliable estimates despite short sample periods related to changes in behavior over the years. Given this inherent challenge of low statistical power for time series analysis, what may appear to be significant forecasters might instead be the result of random variation. With few observations it is easier to confuse the two.
Fabio: How do you address the problem?
Charles: First, when testing for out-of-sample robustness of our forecasts, we keep the models parsimonious, only allowing a small number of variables to enter the forecasting model. Second, we apply a high threshold of statistical significance for the inclusion of any variable. Third, we validate our model with out-of-sample testing, and we apply several criteria for judging the out-of-sample performance of the forecasts. One of those criteria measures the improvements of fit in terms of variance explained, another measures the profitability (or economic oomph) of a simple trading strategy based on the forecasting model. The model holds up well under both criteria. We also experimented with an additional out-of-sample approach, a novel way of further improving the robustness of forecasts, which we find makes the out-of-sample performance even better. That approach allows the forecaster to vary something you might call “forecasting ambition” over time—that is, we allow the forecasting model to purposefully choose a model with lower in-sample explanatory power if its recent out-of-sample performance is better.







Note: This figure shows the word clouds of the topics extracted from the energy corpus using the Louvain clustering algorithm applied to the full sample. A larger font indicates words that occur more frequently in a given topic.
Fabio: This quantification of news using natural language processing sounds incredibly interesting, but your measures might strike some people as implausible. How do you validate the news measures?
Charles: We think the best way to validate them is to show that the energy news events that people already believe to be important align well with the moments in time that our measures identify as times when that news was important. To do so, we identify the moments that our model tells us were very “newsy” stories (ones with high entropy scores because they contain a lot of unusual phrases that one doesn’t usually see in prior news stories) that also have high topical weights with one of our imputed topic areas. When we make a list of those moments in the data, they align very well with major episodes of large price change, and those episodes were recognizable to us as news stories that we remembered reading about. The topical characterizations our model applies to them also made sense to us. Those examples are listed in Table 3 of our paper, and they appear below, so readers can reach their own conclusions about whether the model is validated by their knowledge.
Fabio: How do you envision the world making use of your analysis? In which other sectors do you think your analysis would work well?
Charles: I guess there are three main categories of uses. First, the model can guide a trading strategy in the oil market. We show that a simple (non-optimized) trading rule that tells you to take either a long or short position in oil futures at all times (so you are always in the market as either long or short) delivers out-of-sample annualized Sharpe ratios in excess of 0.5 for trading oil futures either using only text data as forecasters or using a mix of text and non-text data. In the trading outcomes for oil futures and oil majors’ stocks, the highest Sharpe ratios tend to occur for models based on solely text measures. These trading strategies entail only several trades a year (going from long to short or vice versa), so they have small costs of execution. Optimizing these strategies to be more selective (only making a trade when you receive a sufficiently strong signal from the forecasting model) would give much better performance, so this simple trading rule is conservative, and very encouraging.
Second, people who track the oil industry can use the forecasting model as a way of quantifying the importance of news stories they read about the industry. Call it smart reading of the newspaper.
Third, one can use our model to incorporate its forecasts into a macroeconomic forecasting model, or one applied to particular sectors or firms. Given how important the oil market is for the economy, forecasts of oil market returns or volatility are likely to be very useful for predicting overall growth, and sector- or firm-specific performance. One way to think about oil’s importance is to note how much a stock market factor based on oil returns contributes to understanding differences in firms’ stock market returns. We show that, compared to commonly used stock market factors (e.g., the Fama-French factors, or momentum) oil is a more important factor than any other except the average market factor itself.
Fabio: Do you plan to use this methodology at the Institute to monitor oil market developments or apply it to other markets?
Charles: Both. One thing we could do is create a weekly or monthly report based on the forecasting model, combining the quantitative forecasts with qualitative interpretations of the news that underly them. I also think the forecasts are likely to have important further implications for stock returns of different firms and sectors, which could be a separate monthly report we could publish. Furthermore, the methodological innovations of the paper, especially with respect to out-of-sample testing, can be applied to other macroeconomic forecasting tools.
Fabio: One final question: What did you learn from this and other research you have done about the future directions of empirical research using LLMs, NLP, and AI?
Charles: What I find most interesting is that in the five studies I have done that make use of these new methods, I never exactly follow the method used in any prior paper, including my own. There is nothing mechanical about applying these new tools if you want to do it right. It requires thinking creatively about how the question one is asking is best addressed with the tools of NLP and AI, and there is always a need to innovate methodologically. I am almost done with two studies, one that looks at how to measure the effects of regulation on firms, the other that explores how to understand market reactions to Federal Open Market Committee releases. There is almost nothing about the tools we used in the oil paper that show up in the natural language processing approaches we take in those two papers. I find that reassuring because it means that human researchers like us will not be easy to replace. There is lots of judgment involved, lots of creativity that is required to make the most of the new tools.
©2025 Andersen Institute for Finance & Economics. All Rights Reserved. This material is confidential intellectual property of the Andersen Institute for Finance & Economics. The views expressed in this note are those of the authors and do not represent an official position of The Andersen Institute for Finance and Economics or affiliated organizations. By viewing this Andersen Institute Note, you agree that you will not directly or indirectly copy, modify, record, publish, or redistribute this material and the information therein, in whole or in part. No warranty or representation, express or implied, is made by the Andersen Institute or any of its affiliates, nor does Andersen accept any liability with respect to the information and data set forth herein. Distribution hereof does not constitute legal, tax, accounting, investment or other professional advice. The information provided herein is not intended to provide a sufficient basis on which to make an investment decision. Recipients should consult their own advisors, including tax advisors, before making any investment.