• Analytics Blog
Feb 4th, 2019
We've scraped thousands of transcripts from interviews with professional golfers dating back to 1997. All the data is available at this incredible resource. We show three short analyses here: first, for each player with a substantial number of interviews in the data, we've found which words they use the most relative to their peers. Second, we classified the sentiment of every interview on a scale from 0 (most negative) to 1 (most positive), and looked at differences in average sentiment across players and also at trends in sentiment over time for specific players. Third, we combined our interview data with PGA Tour scoring data from 2004-present to determine which players get interviewed more, or less, than their performances would predict.
Part 1: You are what you... speak?
In the word clouds below, the words that a golfer uses most often relative to their peers are bigger in size and lighter in colour.
Notes: The size and coloring of words are determined by comparing how often a player uses the word to how often everyone else uses it. For example, Spieth uses "Michael" in 0.25% of all his words, while his peers use "Michael" in just 0.01% of all their words, meaning Spieth uses the word 25 times as often as his peers. For this analysis a set of common "stop words" (e.g. "is", "at", "don't", etc.) are removed because they are not interesting. We also "stem" words, meaning we reduce them to their root form (e.g. "blessed" becomes "bless"); unfortunately this means some words got chopped that shouldn't have (e.g. Chambers from Chambers Bay became "chamber"). Finally, there are minimum thresholds on how often the word has to be used by the golfer to be included (~0.01% of all words). This visualization follows this example closely, and more generally is using this library.
Part 2: The story of player sentiment, as told by an algorithm
We classified the "sentiment" of every interview in our dataset on a scale from 0 to 1 (0 being very negative, 1 being very positive). There are a few different ways you could do this that don't require actually reading any of the interviews. The method we chose required first finding a public dataset containing segments of text that had been labelled as positive or negative. This has been done for a set of 50,000 movie reviews from IMDB (data is available in a nice format here). Humans have actually gone through and labelled each of these movie reviews as either positive or negative (sounds like a fun exercise). We use this data to fit a model that takes as its input a block of text, and gives as its output a number from 0 to 1 which represents how likely it is that the block of text was "positive" or "negative" in sentiment. The way the algorithm works is pretty simple and transparent. You feed in a new block of text, such as a golfer's interview transcript, and each word will be analyzed separately. From the movie review data, we have determined which words are associated with positive reviews and which are associated with negative reviews. For example, here are the 5 words most strongly associated with movie reviews labelled as positive: excellent, perfect, favorite, superb, wonderful; and here the top 5 negative words: worst, waste, awful, disappointment, poorly. The algorithm will then go through this new block of text, and depending on the incidence of "positive" and "negative" words, assign a number ranging from 0 to 1 that reflects the likelihood the block of text was "positive" in its sentiment.

A potential problem here is that we've used data on movie reviews to fit our sentiment model, but what we actually want to predict is the sentiment of golfers' interviews. There are some words that have a positive connotation in movie reviews, but don't carry the same meaning in an interview with a golfer (e.g. "must-see"). It's hard for us to guage the extent of this issue, but our sense is it's not a huge problem. Looking through the top 100 "positive" words in our model we find nearly all of their meanings are not specific to a movie-reviewing context.

The result of this analysis is a dataset from 1997-2019 containing the date of an interview and the name of the golfer interviewed, along with a sentiment score between 0 and 1.

The overall mean sentiment score in our set of interviews is 0.89; this can be roughly interpreted as the fraction of interviews that are labelled as "positive sentiment". It's hard to take this number too seriously, as sentiment is a relative concept, and we don't know how the people who labelled the movie reviews decided where the cutoff was between positive and negative. Therefore what is more interesting is looking at relative comparisons: average sentiment across different players, and also changes in sentiment for the same player over time.

Here are the 10 most positive and 10 most negative players in our dataset, along with the number of interviews and their average sentiment scores:
Notes: Only players with at least 70 interviews in our dataset were included. Values can be interpreted as the fraction of interviews the algorithm labelled as "positive".
Next we plot the 50-interview moving average for a select few players in our dataset; mainly comprised of golfers for whom we have a large sample of interviews. It's interesting to look at these plots and think about each golfer's performance trends. It's clear in Rory's plot that the periods of more negative interview sentiment coincided with periods of poor play. However this is not the case for Luke Donald: his sentiment has been hitting lifetime highs of late despite his performance tailing off.
Notes: Only players with at least 400 interviews were eligible. Plotted is the 50-interview moving average over time - that is, each data point represents the fraction of interviews we labelled "positive" in their 50 most recent interviews from that date.
Part 3: Who does the media like to talk to?
Brooks Koepka has at times felt like a forgotten man over the last 2 seasons, despite accumulating 3 majors and reaching #1 in the Official World Golf Rankings. By matching our data on interviews with our adjusted strokes-gained data, we should be able to shed some light on whether Koepka has reason to be feeling left out.

We construct a simple model to predict the likelihood of a golfer being interviewed following their round. In this model we only include measures of performance; the goal here is to predict the likelihood of a golfer being interviewed on a given day only taking into account how they performed on that day, and any relevant historical performance measures. From this we can calculate the expected number of interviews for each golfer based only off how they have performed. By comparing this expected value to the actual number of times a player was interviewed, we can get a sense of which golfers are over or under-interviewed based on their performance level. The inputs to the model include measures of long-term strokes-gained performance, number of recent wins by the golfer, the number of recent major wins, and of course how the golfer performed on the day under consideration (for more precise details see graph footnote).

Consider the graph below: plotted is the probability of a golfer being interviewed as a function of their strokes-gained that day. We plot separate lines for 4 different player types: these types are defined by a golfer's average strokes-gained over their last 60 rounds. (We hold constant the other model inputs, such as whether the golfer has won recently, at their average values). For example, the plot indicates that a golfer who has averaged +2 strokes-gained over their last 60 rounds (approximately the skill level of a top 5 player), and has a strokes-gained of +4 in their current round, has a 50% chance of being interviewed following that round. Similarly, a golfer who has historically averaged -1 strokes-gained, but gains 10 strokes in their current round, has nearly a 100% chance of being interviewed following that round.
Notes: "Skill level" indicates the average strokes-gained over the last 60 rounds played. Plotted are the predicted values from a logistic regression as a function of strokes-gained on a given day for each of the skill types. The inputs to the regression was a golfer's strokes-gained performance in the current round, their averages strokes-gained over the previous 60 rounds, their number of wins over the last 15 tournaments, their number of wins over their last 5 tournaments, and their number of major wins over their last 15 tournaments. For this plot, the omitted inputs (i.e. the various win frequency measures) are held constant at their average values.
With these predictions in hand, we can calculate the expected number of interviews for each golfer only taking into account measures of their performance. We do this exercise using data covering PGA Tour events from 2004-onwards; we focus on PGA Tour events because at tournaments of lower quality it takes lower quality performances to be interviewed. (The result is then that we find that many of the most "over-interviewed" players are on the European Tour.) We report the ratio of actual interviews to expected interviews ("over-under ratio") as our measure of which players are interviewed more, or less, than their performances warranted. The table below reports the 10 largest and 10 smallest over-under ratios among golfers who recorded at least 50 interviews in our data, as well as some notables:
Notes: Only players with at least 50 recorded interviews were eligible. This analysis does not include interviews that occured on days where golf was not played (i.e. pre-tournament). The predicted number of interviews is simply the sum of a golfer's (predicted) probability of being interviewed after each round they played in our data. Note also that some players had multiple interviews on the same day - for example, Tiger had about 60 days in the data with multiple interviews. Data includes all stroke-play PGA Tour events from 2004-present.
Notably absent from the "most under-interviewed" table is Brooks Koepka. Given Koepka's performance data, we expected him to be interviewed 93 times; in actuality, he has been interviewed 85 times, yielding a ratio of 0.91. Therefore it seems that Koepka has a point: given the quality of his performances he has been under-interviewed, especially compared to most of the top players in today's game. However, another notable star who has a ratio similar to Koepka is Justin Thomas, coming in at 0.92. Dustin Johnson, who seems similar to Koepka personality-wise, has a ratio of 0.99.