• Analytics Blog
Feb 4th, 2019
We've scraped thousands of transcripts from interviews with professional golfers dating back to
1997. All the data is available at this incredible resource
We show three short analyses here: first, for each player with a substantial number of interviews in the data,
we've found which words they use the most relative to their peers. Second, we classified the sentiment of every
interview on a scale from 0 (most negative) to 1 (most positive), and looked at differences in average sentiment across players
and also at trends in sentiment over time for specific players. Third, we combined our interview data with PGA Tour scoring data
from 2004-present to determine which players get interviewed more, or less, than their performances would predict.
Part 1: You are what you... speak?
In the word clouds below, the words that a golfer uses most often relative to their peers
are bigger in size and lighter in colour.
Part 2: The story of player sentiment, as told by an algorithm
We classified the "sentiment" of every interview in our dataset on a scale from 0 to 1 (0 being very negative, 1 being very positive).
There are a few different ways you could do this that don't require
actually reading any of the interviews. The method we chose required first finding a
public dataset containing segments of text that had been labelled as positive or negative. This has been
done for a set of 50,000 movie reviews from IMDB (data is available in a nice format
Humans have actually gone through and labelled each of these movie reviews as either positive or
negative (sounds like a fun exercise). We use this data to fit a model that takes as its input a block of text, and gives as its output
a number from 0 to 1 which represents how likely it is that the block of text was "positive" or "negative"
in sentiment. The way the algorithm works is pretty simple and transparent. You feed in a new block of text, such as a
golfer's interview transcript, and each word will be analyzed separately. From the movie review data, we have determined which words are associated
with positive reviews and which are associated with negative reviews. For example, here are the 5 words most strongly associated with movie reviews
labelled as positive: excellent, perfect, favorite, superb, wonderful
; and here the top 5 negative words:
worst, waste, awful, disappointment, poorly
. The algorithm will then go through this new block of text, and depending on
the incidence of "positive" and "negative"
words, assign a number ranging from 0 to 1 that reflects the likelihood the block of text was "positive" in its sentiment.
A potential problem here is that we've used data on movie reviews to fit our sentiment model, but what
we actually want to predict is the sentiment of golfers' interviews. There are some words
that have a positive connotation in movie reviews, but don't carry the same meaning
in an interview with a golfer (e.g. "must-see"). It's hard for us to guage the extent of this issue,
but our sense is it's not a huge problem. Looking through the top 100 "positive" words in our
model we find nearly all of their meanings are not specific to a movie-reviewing context.
The result of this analysis is a dataset from 1997-2019 containing the date of an interview and the name of the golfer interviewed,
along with a sentiment score between 0 and 1.
The overall mean sentiment score in our set of interviews is 0.89; this can be roughly interpreted as
the fraction of interviews that are labelled as "positive sentiment". It's hard to take this
number too seriously, as sentiment is a relative concept, and we don't
know how the people who labelled the movie reviews decided where the cutoff was between positive and negative.
Therefore what is more interesting is looking at relative comparisons
: average sentiment across different
players, and also changes in sentiment for the same player over time.
Here are the 10 most positive and 10 most negative players in our dataset, along with the number of interviews
and their average sentiment scores:
Next we plot the 50-interview moving average for a select few players in our dataset; mainly comprised of golfers for whom
we have a large sample of interviews. It's interesting to look at these plots and think about each golfer's performance
trends. It's clear in Rory's plot that the periods of more negative interview sentiment coincided with
periods of poor play. However this is not the case for Luke Donald: his sentiment has been hitting lifetime highs of late
despite his performance tailing off.
Part 3: Who does the media like to talk to?
Brooks Koepka has at times felt like a forgotten man over the last 2 seasons, despite accumulating 3 majors and reaching #1 in the Official World
Golf Rankings. By matching our data on interviews with our adjusted strokes-gained data, we should be able to shed some light on whether
Koepka has reason to be feeling left out.
We construct a simple model to predict the likelihood of a golfer being interviewed following their round.
In this model we only include measures of performance; the goal here is to predict the likelihood of a golfer being interviewed on a given day
only taking into account how they performed on that day, and any relevant historical performance measures. From this we can calculate the expected
number of interviews for each golfer based only off how they have performed. By comparing this expected value to the actual number of times a player was interviewed,
we can get a sense of which golfers are over or under-interviewed based on their performance level. The inputs to the model include
measures of long-term strokes-gained performance, number of recent wins by the golfer, the number of recent major wins, and of course how the golfer performed on the day
under consideration (for more precise details see graph footnote).
Consider the graph below: plotted is the probability of a golfer being interviewed as a function of their strokes-gained that day. We plot
separate lines for 4 different player types: these types are defined by a golfer's average strokes-gained over their last 60 rounds. (We hold constant the other model
inputs, such as whether the golfer has won recently, at their average values). For example, the plot indicates that a golfer who has averaged
+2 strokes-gained over their last 60 rounds (approximately the skill level of a top 5 player), and has a strokes-gained of +4 in their current round, has a 50% chance of being
interviewed following that round. Similarly, a golfer who has historically averaged -1 strokes-gained, but gains 10 strokes in their current
round, has nearly a 100% chance of being interviewed following that round.
With these predictions in hand, we can calculate the expected number of interviews for each golfer only taking into account measures of their
performance. We do this exercise using data covering PGA Tour events from 2004-onwards; we focus on PGA Tour events because at tournaments of lower quality it takes
lower quality performances to be interviewed. (The result is then that we find that many of the most "over-interviewed" players are on the
European Tour.) We report the ratio of actual interviews to expected interviews ("over-under ratio") as our measure
of which players are interviewed more, or less, than their performances warranted. The table below reports the 10 largest and 10 smallest
over-under ratios among golfers who recorded at least 50 interviews in our data, as well as some notables:
Notably absent from the "most under-interviewed" table is Brooks Koepka. Given Koepka's performance data,
we expected him to be interviewed 93 times; in actuality, he has been interviewed 85 times, yielding a ratio of 0.91. Therefore it seems
that Koepka has a point: given the quality of his performances he has
been under-interviewed, especially compared to most of the top players in today's game. However, another notable star
who has a ratio similar to Koepka is Justin Thomas, coming in at 0.92. Dustin Johnson, who seems similar to Koepka personality-wise,
has a ratio of 0.99.