Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Lately I wanted to play around with nflfastR. That’s a great package giving you access to NFL’s play-by-play data since 1999. It let’s you download all the data and store it in several different databases.
Unfortunately I ran into trouble when I tried to import the data to Google’s BigQuery.
Two problems
I’ve identified to problems.
- The first one occured when the destination table doesn’t exist in BigQuery.
- The second one occured when the 2001 season was imported.
My solutions
The open source way
The solution for the first bug was relative simple: Check if the table exists. If not don’t set the append
parameter to TRUE
when calling DBI::dbWriteTable()
. (See here).
The other one was a little bit more tricky. As it turned out the new table is created by using data consisting of
- the first season loaded (usually 1999) and
- a
default_play
. Both are binded together (see here) and then given toDBI::dbWriteTable()
.
The default_play
is somehow created by the package author.
So DBI
(or the database) guesses what column type should be created for each column by this data. And that’s where the trouble starts:
The column weather
is empty for seasons 1999 and 2000. So the only value provided for this column is in default_play
.
The value is a very long string of repeating “NA”.
As it turned out DBI::dbWriteTable()
or BigQuery uses column type BYTES
as the best fitting type for this value.
(Why this happens is not quite clear. It depends on the length of the string. See my question here.)
Unfortunately this column types can’t handle Unicode values which are used in the weather-column in seasons 2001 and younger.
So I provided two ways for fixing this issue:
- Change the value of
default_play$weather
, see commit - Load the data in reverse order so you provide at creation time correct data, see commit
Both solutions were rejected by the package maintainers. They argue that the DBI-implementation for BigQuery is the reason for the problems.
This might be true. But it doesn’t help someone who wants to use nflfastR
with BigQuery.
What to do else?
So if you want to use nflfastR
with BigQuery you can either use my fork of this package. I’ll try to update it when the upstream package is updated.
Pre-Import Script
But I looked for another solution tackling the main reason why the original packages fails.
The main reason the import of the data fails is that the database guesses what the best column types are for our data. I think that’s bad habit. I think it’s much better to be precise. So tell the database the correct column types for the new table. If this would be done in the original package all problems would be gone.
So here’s my script you can run before importing the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
library(tidyverse) library(DBI) project <- "bigquery-project-name" dataset <- "bigquery-dataset-name" billing <- "bigquery-project-name" table_name <- "bigquery-table-name" # Connection for DBI con <- DBI::dbConnect( bigrquery::bigquery(), project = project, dataset = dataset, billing = project ) field_types <- structure(list(name = c("play_id", "game_id", "old_game_id", "home_team", "away_team", "season_type", "week", "posteam", "posteam_type", "defteam", "side_of_field", "yardline_100", "game_date", "quarter_seconds_remaining", "half_seconds_remaining", "game_seconds_remaining", "game_half", "quarter_end", "drive", "sp", "qtr", "down", "goal_to_go", "time", "yrdln", "ydstogo", "ydsnet", "desc", "play_type", "yards_gained", "shotgun", "no_huddle", "qb_dropback", "qb_kneel", "qb_spike", "qb_scramble", "pass_length", "pass_location", "air_yards", "yards_after_catch", "run_location", "run_gap", "field_goal_result", "kick_distance", "extra_point_result", "two_point_conv_result", "home_timeouts_remaining", "away_timeouts_remaining", "timeout", "timeout_team", "td_team", "td_player_name", "td_player_id", "posteam_timeouts_remaining", "defteam_timeouts_remaining", "total_home_score", "total_away_score", "posteam_score", "defteam_score", "score_differential", "posteam_score_post", "defteam_score_post", "score_differential_post", "no_score_prob", "opp_fg_prob", "opp_safety_prob", "opp_td_prob", "fg_prob", "safety_prob", "td_prob", "extra_point_prob", "two_point_conversion_prob", "ep", "epa", "total_home_epa", "total_away_epa", "total_home_rush_epa", "total_away_rush_epa", "total_home_pass_epa", "total_away_pass_epa", "air_epa", "yac_epa", "comp_air_epa", "comp_yac_epa", "total_home_comp_air_epa", "total_away_comp_air_epa", "total_home_comp_yac_epa", "total_away_comp_yac_epa", "total_home_raw_air_epa", "total_away_raw_air_epa", "total_home_raw_yac_epa", "total_away_raw_yac_epa", "wp", "def_wp", "home_wp", "away_wp", "wpa", "vegas_wpa", "vegas_home_wpa", "home_wp_post", "away_wp_post", "vegas_wp", "vegas_home_wp", "total_home_rush_wpa", "total_away_rush_wpa", "total_home_pass_wpa", "total_away_pass_wpa", "air_wpa", "yac_wpa", "comp_air_wpa", "comp_yac_wpa", "total_home_comp_air_wpa", "total_away_comp_air_wpa", "total_home_comp_yac_wpa", "total_away_comp_yac_wpa", "total_home_raw_air_wpa", "total_away_raw_air_wpa", "total_home_raw_yac_wpa", "total_away_raw_yac_wpa", "punt_blocked", "first_down_rush", "first_down_pass", "first_down_penalty", "third_down_converted", "third_down_failed", "fourth_down_converted", "fourth_down_failed", "incomplete_pass", "touchback", "interception", "punt_inside_twenty", "punt_in_endzone", "punt_out_of_bounds", "punt_downed", "punt_fair_catch", "kickoff_inside_twenty", "kickoff_in_endzone", "kickoff_out_of_bounds", "kickoff_downed", "kickoff_fair_catch", "fumble_forced", "fumble_not_forced", "fumble_out_of_bounds", "solo_tackle", "safety", "penalty", "tackled_for_loss", "fumble_lost", "own_kickoff_recovery", "own_kickoff_recovery_td", "qb_hit", "rush_attempt", "pass_attempt", "sack", "touchdown", "pass_touchdown", "rush_touchdown", "return_touchdown", "extra_point_attempt", "two_point_attempt", "field_goal_attempt", "kickoff_attempt", "punt_attempt", "fumble", "complete_pass", "assist_tackle", "lateral_reception", "lateral_rush", "lateral_return", "lateral_recovery", "passer_player_id", "passer_player_name", "passing_yards", "receiver_player_id", "receiver_player_name", "receiving_yards", "rusher_player_id", "rusher_player_name", "rushing_yards", "lateral_receiver_player_id", "lateral_receiver_player_name", "lateral_receiving_yards", "lateral_rusher_player_id", "lateral_rusher_player_name", "lateral_rushing_yards", "lateral_sack_player_id", "lateral_sack_player_name", "interception_player_id", "interception_player_name", "lateral_interception_player_id", "lateral_interception_player_name", "punt_returner_player_id", "punt_returner_player_name", "lateral_punt_returner_player_id", "lateral_punt_returner_player_name", "kickoff_returner_player_name", "kickoff_returner_player_id", "lateral_kickoff_returner_player_id", "lateral_kickoff_returner_player_name", "punter_player_id", "punter_player_name", "kicker_player_name", "kicker_player_id", "own_kickoff_recovery_player_id", "own_kickoff_recovery_player_name", "blocked_player_id", "blocked_player_name", "tackle_for_loss_1_player_id", "tackle_for_loss_1_player_name", "tackle_for_loss_2_player_id", "tackle_for_loss_2_player_name", "qb_hit_1_player_id", "qb_hit_1_player_name", "qb_hit_2_player_id", "qb_hit_2_player_name", "forced_fumble_player_1_team", "forced_fumble_player_1_player_id", "forced_fumble_player_1_player_name", "forced_fumble_player_2_team", "forced_fumble_player_2_player_id", "forced_fumble_player_2_player_name", "solo_tackle_1_team", "solo_tackle_2_team", "solo_tackle_1_player_id", "solo_tackle_2_player_id", "solo_tackle_1_player_name", "solo_tackle_2_player_name", "assist_tackle_1_player_id", "assist_tackle_1_player_name", "assist_tackle_1_team", "assist_tackle_2_player_id", "assist_tackle_2_player_name", "assist_tackle_2_team", "assist_tackle_3_player_id", "assist_tackle_3_player_name", "assist_tackle_3_team", "assist_tackle_4_player_id", "assist_tackle_4_player_name", "assist_tackle_4_team", "tackle_with_assist", "tackle_with_assist_1_player_id", "tackle_with_assist_1_player_name", "tackle_with_assist_1_team", "tackle_with_assist_2_player_id", "tackle_with_assist_2_player_name", "tackle_with_assist_2_team", "pass_defense_1_player_id", "pass_defense_1_player_name", "pass_defense_2_player_id", "pass_defense_2_player_name", "fumbled_1_team", "fumbled_1_player_id", "fumbled_1_player_name", "fumbled_2_player_id", "fumbled_2_player_name", "fumbled_2_team", "fumble_recovery_1_team", "fumble_recovery_1_yards", "fumble_recovery_1_player_id", "fumble_recovery_1_player_name", "fumble_recovery_2_team", "fumble_recovery_2_yards", "fumble_recovery_2_player_id", "fumble_recovery_2_player_name", "sack_player_id", "sack_player_name", "half_sack_1_player_id", "half_sack_1_player_name", "half_sack_2_player_id", "half_sack_2_player_name", "return_team", "return_yards", "penalty_team", "penalty_player_id", "penalty_player_name", "penalty_yards", "replay_or_challenge", "replay_or_challenge_result", "penalty_type", "defensive_two_point_attempt", "defensive_two_point_conv", "defensive_extra_point_attempt", "defensive_extra_point_conv", "safety_player_name", "safety_player_id", "season", "cp", "cpoe", "series", "series_success", "series_result", "order_sequence", "start_time", "time_of_day", "stadium", "weather", "nfl_api_id", "play_clock", "play_deleted", "play_type_nfl", "special_teams_play", "st_play_type", "end_clock_time", "end_yard_line", "fixed_drive", "fixed_drive_result", "drive_real_start_time", "drive_play_count", "drive_time_of_possession", "drive_first_downs", "drive_inside20", "drive_ended_with_score", "drive_quarter_start", "drive_quarter_end", "drive_yards_penalized", "drive_start_transition", "drive_end_transition", "drive_game_clock_start", "drive_game_clock_end", "drive_start_yard_line", "drive_end_yard_line", "drive_play_id_started", "drive_play_id_ended", "away_score", "home_score", "location", "result", "total", "spread_line", "total_line", "div_game", "roof", "surface", "temp", "wind", "home_coach", "away_coach", "stadium_id", "game_stadium", "aborted_play", "success", "passer", "passer_jersey_number", "rusher", "rusher_jersey_number", "receiver", "receiver_jersey_number", "pass", "rush", "first_down", "special", "play", "passer_id", "rusher_id", "receiver_id", "name", "jersey_number", "id", "fantasy_player_name", "fantasy_player_id", "fantasy", "fantasy_id", "out_of_bounds", "home_opening_kickoff", "qb_epa", "xyac_epa", "xyac_mean_yardage", "xyac_median_yardage", "xyac_success", "xyac_fd", "xpass", "pass_oe" ), type = c("FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "INTEGER", "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "FLOAT", "FLOAT", "FLOAT", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "INTEGER", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "INTEGER", "INTEGER", "STRING", "INTEGER", "INTEGER", "FLOAT", "FLOAT", "INTEGER", "STRING", "STRING", "INTEGER", "INTEGER", "STRING", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "STRING", "INTEGER", "STRING", "INTEGER", "STRING", "INTEGER", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "STRING", "INTEGER", "STRING", "STRING", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT" )), class = "data.frame", row.names = c(NA, -372L)) data_source <- bigrquery::bq_table(project, dataset, table_name) # create table in BigQuery bigrquery::bq_table_create(data_source, fields = bigrquery::as_bq_fields( field_types %>% purrr::pmap(list) ) ) |
This might be a little bit tedious but it’s precise. The table is created in the way we want it to be.
Now we can use the original package and import all the data.
1 2 3 4 5 |
# Load data in BigQuery using nflfastR nflfastR::update_db( tblname = table_name, db_connection = con ) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.