Raw Data Archive Notes & UpdatesData Archive Notes
API Changelog
Changed course_name variable for European Tour events involving Randpark GC (event_id = 2020103, 2020006, 2019004, 2017098) to correct an inconsistency between the events involving a single course and events with multiple courses.
Added 2021 Masters strokes-gained category data. These SG numbers are generated using our own baseline functions (but are designed to be similar to what the PGA Tour uses).
Updated the filtering method for WDs and DQs. Affects all events before the tour-specific date cutoffs listed in the general notes below.
Corrected course_name at 2021 ISPS HANDA World Invitational (event_id=2021124) to account for multiple courses. This was a bug on our end.
Notes / Comments
General Notes
Only completed rounds are included in the data, with a few exceptions:
Round 1 of the 2020 PLAYERS Championship. Four players failed to finish their final hole of the first round before the event was cancelled; we assigned to these players the most likely score given their position on the hole.
Rounds 1&2 of the 2022 Joburg Open (eventid=2022100). Due to the sudden travel restrictions imposed on South Africa, 6 golfers withdrew after playing at least 15 holes in their last round. We assigned to these players the most likely score on their remaining holes.
The only tricky cases for determining round completion from our primary data sources are withdrawals and disqualifications. It is not that easy to identify — in an automated fashion — whether a round that resulted in a WD/DQ was completed or not. Currently we manually vet the data update each week from PGA, EUR, CHA, CAN, SAM, and CHAMP tours to ensure all completed rounds are included and all incomplete rounds are dropped. Listed below are the time periods (inclusive of the listed date) for which we didn't rely on an algorithm to filter WD/DQ rounds:
PGA: full time period
EURO: 2020-10-11 — present
KFT: 2020-10-11 — present
CHA: 2020-11-14 — present
CAN: 2021-06-26 — present
SAM: 2020-10-09 — present
CHAMP: 2020-08-21 — present
For rounds not included in these date ranges or from a tour not listed above, we apply a simple algorithm to label each round as complete/incomplete. If a player's tournament ends with a WD or DQ, only their last round played will be considered as a potentially incomplete round. We then look at their strokes-gained for the day (or, more accurately, the strokes-gained implied by their listed round score), and apply some basic filters: if they withdraw after rounds 2-4 and gain more than 2 strokes on the field in their final round, we drop the round; if they withdraw after round 1 and gain positive strokes on the field, we drop the round. This is a pretty conservative filter, as we feel the cost of including an incomplete round is higher than omitting a complete one.
All stroke-play tournaments are included (or the stroke-play portion of events with a Match Play component, e.g. 2019 ISPS Handa World Super 6 Perth). For the PGA Tour, a few tournaments are included only in select years: Reno-Tahoe (event_id=472) 2019 and later, Zurich Classic (event_id=18) before it became a team event (2016 and earlier).
Data Dictionary
round_score: Total score in a specific round.
sg_app, sg_arg, sg_ott, sg_putt: Strokes-gained categories. Only available at Shotlink-equipped PGA Tour events. The values reported here are directly pulled from the PGA Tour website. In theory, each SG category should have a mean of zero by tournament-round-course (i.e. the PGA Tour subtracts off the mean in each category). This is almost always true, however there are a few exceptions:
if a player completes their round and then withdraws / is DQ'd, sometimes (very rarely) their data from that round is not included in the PGA Tour's SG calculation. The most consequential example of this is Round 2 of the 2021 Arnold Palmer Invitational: Robert Gamez fired a 92 but was then DQ'd, and as a result he is not included when the SG categories are demeaned by the PGA Tour. Therefore, the strokes-gained categories have a mean of zero excluding Gamez but won't when he is included (as in our data).
2021 Olympic Golf Competition. Shot-level tracking was not administered by the PGA Tour, and the strokes-gained category data was not demeaned.
Ignoring the above cases, the 4 SG categories should add up to strokes-gained total (defined as the difference between a player's score and the average score for that round and course). However, there are a small number of what appear to be mistakes on the part of the PGA Tour. One somewhat common mistake is that when a player doesn't hit a shot around-the-green they are given a value of 0 for SG:ARG, meaning that the mean SG:ARG for the field was not subtracted off. To fix this you can calculate sg_arg as the difference between sg_t2g and sg_ott+sg_app. We don't make this correction in the API data, as it's meant to be raw data. The remaining errors (20-30 of them, all since the 2020 season) appear to be fairly idiosyncratic and the source of the discrepancy is not always clear. To see these for yourself, calculate the difference between sg_total and the sum of sg_ott, sg_app, sg_arg, sg_putt.
There is a single instance (as of September 2021) where SG category data was not reported for a golfer in a round that should have had Shotlink data:
Viktor Hovland at 2021 U.S. Open #2 (event_id=536) Round 1.
sg_t2g: Strokes-gained from tee to green. Defined as the sum of sg_ott, sg_app, and sg_arg. The values here are directly pulled from the PGA Tour website (i.e. we do not perform the calculation ourselves), which means we can't guarantee the individual components always add up.
sg_total: Total strokes-gained. Calculated (by us) as the difference between a player's score and the average score for that round and course (for events with multiple courses). When strokes-gained categories are available, this should be equal to the sum of sg_ott, sg_app, sg_arg, and sg_putt, but for the reasons detailed above this will not quite be true in all cases.
dg_id: Player ID. There is a single dg_id for each player. Please notify us if you find a player that has multiple dg_ids. Use dg_id when performing operations by player. Any changes we make retroactively to a player's dg_id will be posted in the changelog.
player_name: Player's name. Will not necessarily be the same for all data points for a given player, although it should be. Use dg_id instead of player_name when performing operations by player.
event_id: Tournament ID. For PGA, KFT, SAM, and CAN tours, event_id is constant across years. (However, note that in seasons where an event was played twice, e.g. the Masters in 2021, a new tournament number is used for the second playing of the event). For all other tours, event_id changes by year. Within a tour and season, or within a tour and calendar year, event_id uniquely identifies a tournament.
event_name: Event name. May change by year for any tour.
course_num: Course ID. For PGA, KFT, SAM, and CAN, course_num is constant for a given course. However, a course number may "change" if the course undergoes substantial changes. For example, Pebble Beach Golf Links at the 2019 U.S. Open is assigned a different number than its typical assignment for the AT&T Pro-Am. Here is the full list of PGA Tour courses with multiple IDs:
Muirfield Village (23, 893), Pebble Beach (5, 666), Quail Hollow (872, 241, 698), Ridgewood (745, 873), Hamilton (694, 874), Liberty National (762, 886), Bethpage Black (689, 880), Sea Island Plantation (231, 889), TPC Four Seasons (19, 882), Chambers Bay (818, 100), Liberty National (762, 886), Torrey Pines South Course (4, 744), Keene Trace (823, 884), Winged Foot (502, 891).
If you find a PGA Tour course that has multiple course numbers and is not listed above, please notify us.
When performing operations by course on the above-listed tours, use course_num. For all other tours, course_num varies by course-year. For tours other than PGA, EUR, KFT, CHA, CAN, SAM, and CHAMP, course_num is not meaningful, and therefore multi-course events are not distinguishable from single-course events.
course_name: Course name. For the European Tour (EUR), we have made course_name constant for a given course (i.e. spelling, naming convention is identical across years). Use course_name when performing operations by course on the European Tour. For all other tours, the course_name may not be constant for a given course. If you find a course that has different values for course_name, please notify us.
fin_text: Official finishing position.
season: Official season as defined by each tour.
year: Calendar year.
event_completed: Official date of the final round of the tournament (e.g. if the event is delayed 1 day to a Monday, this date will still be that of the Sunday).