Scrape¶
The scrapenhl2.scrape module contains methods useful for scraping.
Useful examples¶
Updating data:
from scrapenhl2.scrape import autoupdate
autoupdate.autoupdate()
Get the season schedule:
from scrapenhl2.scrape import schedules
schedules.get_season_schedule(2017)
Convert between player ID and player name:
from scrapenhl2.scrape import players
pname = 'Alex Ovechkin'
players.player_as_id(pname)
pid = 8471214
players.player_as_str(pid)
There’s much more, and feel free to submit pull requests with whatever you find useful.
Methods¶
The functions in these modules are organized pretty logically under the module names.
Autoupdate¶
This module contains methods for automatically scraping and parsing games.
-
scrapenhl2.scrape.autoupdate.
autoupdate
(season=None)¶ Run this method to update local data. It reads the schedule file for given season and scrapes and parses previously unscraped games that have gone final or are in progress. Use this for 2010 or later.
Parameters: season – int, the season. If None (default), will do current season Returns: nothing
-
scrapenhl2.scrape.autoupdate.
delete_game_html
(season, game)¶ Deletes html files. HTML files are used for live game charts, but deleted in favor of JSONs when games go final.
Parameters: - season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.autoupdate.
read_final_games
(games, season)¶ Parameters: - games –
- season –
Returns:
-
scrapenhl2.scrape.autoupdate.
read_inprogress_games
(inprogressgames, season)¶ Saves these games to file via html (for toi) and json (for pbp)
Parameters: inprogressgames – list of int Returns:
Events¶
This module contains methods related to PBP events.
-
scrapenhl2.scrape.events.
convert_event
(event)¶ Converts to a more convenient, standardized name (see get_event_dictionary)
Parameters: event – str, the event name Returns: str, shortened event name
-
scrapenhl2.scrape.events.
event_setup
()¶ Loads event dictionary into memory
Returns: nothing
-
scrapenhl2.scrape.events.
get_event_dictionary
()¶ Returns the abbreviation: long name event mapping (in lowercase)
Returns: dict of str:str
-
scrapenhl2.scrape.events.
get_event_longname
¶ A method for translating event abbreviations to full names (for pbp matching)
Parameters: eventname – str, the event name Returns: the non-abbreviated event name
Games¶
This module contains methods related to scraping games.
-
scrapenhl2.scrape.games.
find_recent_games
(team1, team2=None, limit=1)¶ A convenience function that lists the most recent in progress or final games for specified team(s)
Parameters: - team1 – str, a team
- team2 – str, a team (optional)
- limit – How many games to return
Returns: df with relevant rows
-
scrapenhl2.scrape.games.
get_player_5v5_log_filename
(season)¶ Gets the filename for the season’s player log file. Includes 5v5 CF, CA, TOI, and more.
Parameters: season – int, the season Returns: str, /scrape/data/other/[season]_player_log.feather
-
scrapenhl2.scrape.games.
most_recent_game_id
(team1, team2)¶ A convenience function to get the most recent game (this season) between two teams.
Parameters: - team1 – str, a team
- team2 – str, a team
Returns: int, a game number
General helpers¶
This module contains general helper methods. None of these methods have dependencies on other scrapenhl2 modules.
-
scrapenhl2.scrape.general_helpers.
add_sim_scores
(df, name)¶ Adds fuzzywuzzy’s token set similarity scores to provded dataframe
Parameters: - df – pandas dataframe with column Name
- name – str, name to compare to
Returns: df with an additional column SimScore
-
scrapenhl2.scrape.general_helpers.
anti_join
(df1, df2, **kwargs)¶ Anti-joins two dataframes.
Parameters: - df1 – dataframe
- df2 – dataframe
- kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’). Specifically, need join keys.
Returns: dataframe
-
scrapenhl2.scrape.general_helpers.
check_number
(obj)¶ A helper method to check if obj is int, float, np.int64, etc. This is frequently needed, so is helpful.
Parameters: obj – the object to check the type Returns: bool
-
scrapenhl2.scrape.general_helpers.
check_number_last_first_format
(name)¶ Checks if specified name looks like “8 Ovechkin, Alex”
Parameters: name – str Returns: bool
-
scrapenhl2.scrape.general_helpers.
check_types
(obj)¶ A helper method to check if obj is int, float, np.int64, or str. This is frequently needed, so is helpful.
Parameters: obj – the object to check the type Returns: bool
-
scrapenhl2.scrape.general_helpers.
fill_join
(df1, df2, **kwargs)¶ Uses data from df2 to fill in missing values from df1. Helpful when you have to join using multiple data sources. Preserves data order. Won’t work when joining introduces duplicates.
Parameters: - df1 – dataframe
- df2 – dataframe
- kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’ and ‘suffixes’)
Returns: dataframe
-
scrapenhl2.scrape.general_helpers.
flip_first_last
(name)¶ Changes Ovechkin, Alex to Alex Ovechkin. Also changes to title case.
Parameters: name – str Returns: str, flipped if applicable
-
scrapenhl2.scrape.general_helpers.
fuzzy_match_player
(name_provided, names, minimum_similarity=50)¶ This method checks similarity between each entry in names and the name_provided using token set matching and returns the entry that matches best. Returns None if no similarity is greater than minimum_similarity. (See e.g. http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)
Parameters: - name_provided – str, name to look for
- names – list (or ndarray, or similar) of
- minimum_similarity – int from 0 to 100, minimum similarity. If all are below this, returns None.
Returns: str, string in names that best matches name_provided
-
scrapenhl2.scrape.general_helpers.
get_initials
(pname)¶ Splits name on spaces and returns first letter from each part.
Parameters: pname – str, player name Returns: str, player initials
-
scrapenhl2.scrape.general_helpers.
get_lastname
(pname)¶ Splits name on first space and returns second part.
Parameters: pname – str, player name Returns: str, player last name
-
scrapenhl2.scrape.general_helpers.
infer_season_from_date
¶ Looks at a date and infers the season based on that: Year-1 if month is Aug or before; returns year otherwise.
Parameters: date – str, YYYY-MM-DD Returns: int, the season. 2007-08 would be 2007.
-
scrapenhl2.scrape.general_helpers.
intervals
(lst, interval_pct=10)¶ A method that divides list into intervals and returns tuples indicating each interval mark. Useful for giving updates when cycling through games.
Parameters: - lst – lst to divide
- interval_pct – int, pct for each interval to represent. e.g. 10 means it will mark every 10%.
Returns: a list of tuples of (index, value)
-
scrapenhl2.scrape.general_helpers.
log_exceptions
(fn)¶ A decorator that wraps the passed in function and logs exceptions should one occur
Parameters: function – the function Returns: nothing
-
scrapenhl2.scrape.general_helpers.
melt_helper
(df, **kwargs)¶ Earlier versions of pandas do not support pd.DataFrame.melt. This helps to bridge the gap. It first tries df.melt, and if that doesn’t work, it uses pd.melt.
Parameters: - df – dataframe
- kwargs – arguments to pd.melt or pd.DataFrame.melt.
Returns: melted dataframe
-
scrapenhl2.scrape.general_helpers.
mmss_to_secs
(strtime)¶ Converts time from mm:ss to seconds
Parameters: strtime – str, mm:ss Returns: int
-
scrapenhl2.scrape.general_helpers.
once_per_second
(fn, calls_per_second=1)¶ A decorator that sleeps for one second after executing the function. Used when scraping NHL site. This also means all functions that access the internet sleep for a second.
Parameters: fn – the function Returns: nothing
-
scrapenhl2.scrape.general_helpers.
period_contribution
(x)¶ Turns period–1, 2, 3, OT, etc–into # of seconds elapsed in game until start. :param x: str or int, 1, 2, 3, etc :return: int, number of seconds elapsed until start of specified period
-
scrapenhl2.scrape.general_helpers.
print_and_log
(message, level='info', print_and_log=True)¶ A helper method that prints message to console and also writes to log with specified level.
Parameters: - message – str, the message
- level – str, the level of log: info, warn, error, critical
- print_and_log – bool. If False, logs only.
Returns: nothing
-
scrapenhl2.scrape.general_helpers.
remove_leading_number
(string)¶ Will convert 8 Alex Ovechkin to Alex Ovechkin, or Alex Ovechkin to Alex Ovechkin
Parameters: string – a string Returns: string without leading numbers
-
scrapenhl2.scrape.general_helpers.
start_logging
()¶ Clears out logging folder, and starts the log in this folder
-
scrapenhl2.scrape.general_helpers.
try_to_access_dict
(base_dct, *keys, **kwargs)¶ A helper method that accesses base_dct using keys, one-by-one. Returns None if a key does not exist.
Parameters: - base_dct – dict, a dictionary
- keys – str, int, or other valid dict keys
- kwargs – can specify default using kwarg default_return=0, for example.
Returns: obj, base_dct[key1][key2][key3]… or None if a key is not in the dictionary
-
scrapenhl2.scrape.general_helpers.
try_url_n_times
(url, timeout=5, n=5)¶ A helper method that tries to access given url up to five times, returning the page.
Parameters: - url – str, the url to access
- timeout – int, number of secs to wait before timeout. Default 5.
- n – int, the max number of tries. Default 5.
Returns: bytes
Organization¶
This module contains paths to folders.
-
scrapenhl2.scrape.organization.
check_create_folder
(*args)¶ A helper method to create a folder if it doesn’t exist already
Parameters: args – list of str, the parts of the filepath. These are joined together with the base directory Returns: nothing
-
scrapenhl2.scrape.organization.
get_base_dir
()¶ Returns the base directory of this package (one directory up from this file)
Returns: str, the base directory
-
scrapenhl2.scrape.organization.
get_other_data_folder
()¶ Returns the folder containing other data
Returns: str, /scrape/data/other/
-
scrapenhl2.scrape.organization.
get_parsed_data_folder
()¶ Returns the folder containing parsed data
Returns: str, /scrape/data/parsed/
-
scrapenhl2.scrape.organization.
get_raw_data_folder
()¶ Returns the folder containing raw data
Returns: str, /scrape/data/raw/
-
scrapenhl2.scrape.organization.
get_season_parsed_pbp_folder
(season)¶ Returns the folder containing parsed pbp for given season
Parameters: season – int, current season Returns: str, /scrape/data/parsed/pbp/[season]/
-
scrapenhl2.scrape.organization.
get_season_parsed_toi_folder
(season)¶ Returns the folder containing parsed toi for given season
Parameters: season – int, current season Returns: str, /scrape/data/raw/toi/[season]/
-
scrapenhl2.scrape.organization.
get_season_raw_pbp_folder
(season)¶ Returns the folder containing raw pbp for given season
Parameters: season – int, current season Returns: str, /scrape/data/raw/pbp/[season]/
-
scrapenhl2.scrape.organization.
get_season_raw_toi_folder
(season)¶ Returns the folder containing raw toi for given season
Parameters: season – int, current season Returns: str, /scrape/data/raw/toi/[season]/
-
scrapenhl2.scrape.organization.
get_season_team_pbp_folder
(season)¶ Returns the folder containing team pbp logs for given season
Parameters: season – int, current season Returns: str, /scrape/data/teams/pbp/[season]/
-
scrapenhl2.scrape.organization.
get_season_team_toi_folder
(season)¶ Returns the folder containing team toi logs for given season
Parameters: season – int, current season Returns: str, /scrape/data/teams/toi/[season]/
-
scrapenhl2.scrape.organization.
get_team_data_folder
()¶ Returns the folder containing team log data
Returns: str, /scrape/data/teams/
-
scrapenhl2.scrape.organization.
organization_setup
()¶ Creates other folder if need be
Returns: nothing
Players¶
This module contains methods related to individual player info.
-
scrapenhl2.scrape.players.
check_default_player_id
(playername)¶ E.g. For Mike Green, I should automatically assume we mean 8471242 (WSH/DET), not 8468436. Returns None if not in dict. Ideally improve code so this isn’t needed.
Parameters: playername – str Returns: int, or None
-
scrapenhl2.scrape.players.
generate_player_ids_file
()¶ Creates a dataframe with these columns:
- ID: int, player ID
- Name: str, player name
- DOB: str, date of birth
- Hand: char, R or L
- Pos: char, one of C/R/L/D/G
It will be populated with Alex Ovechkin to start. :return: nothing
-
scrapenhl2.scrape.players.
generate_player_log_file
()¶ Run this when no player log file exists already. This is for getting the datatypes right. Adds Alex Ovechkin in Game 1 vs Pittsburgh in 2016-2017.
Returns: nothing
-
scrapenhl2.scrape.players.
get_player_handedness
¶ Retrieves handedness of player
Parameters: player – str or int, the player name or ID Returns: str, player hand (L or R)
-
scrapenhl2.scrape.players.
get_player_ids_file
()¶ Returns the player information file. This is stored as a feather file for fast read/write.
Returns: /scrape/data/other/PLAYER_INFO.feather
-
scrapenhl2.scrape.players.
get_player_info_from_url
(playerid)¶ Gets ID, Name, Hand, Pos, DOB, Height, Weight, and Nationality from the NHL API.
Parameters: playerid – int, the player id Returns: dict with player ID, name, handedness, position, etc
-
scrapenhl2.scrape.players.
get_player_log_file
()¶ Returns the player log file from memory.
Returns: dataframe, the log
-
scrapenhl2.scrape.players.
get_player_log_filename
()¶ Returns the player log filename.
Returns: str, /scrape/data/other/PLAYER_LOG.feather
-
scrapenhl2.scrape.players.
get_player_position
¶ Retrieves position of player
Parameters: player – str or int, the player name or ID Returns: str, player position (e.g. C, D, R, L, G)
-
scrapenhl2.scrape.players.
get_player_url
(playerid)¶ Gets the url for a page containing information for specified player from NHL API.
Parameters: playerid – int, the player ID Returns: str, https://statsapi.web.nhl.com/api/v1/people/[playerid]
-
scrapenhl2.scrape.players.
player_as_id
¶ A helper method. If player entered is int, returns that. If player is str, returns integer id of that player.
Parameters: - playername – int, or str, the player whose names you want to retrieve
- filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work.
- dob – yyyy-mm-dd, use to help when multiple players have the same name
Returns: int, the player ID
-
scrapenhl2.scrape.players.
player_as_str
¶ A helper method. If player is int, returns string name of that player. Else returns standardized name.
Parameters: - playerid – int, or str, player whose name you want to retrieve
- filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work. Probably not needed but you can use this method to go from part of the name to full name, in which case it may be helpful.
Returns: str, the player name
-
scrapenhl2.scrape.players.
player_setup
()¶ Loads team info file into memory.
Returns: nothing
-
scrapenhl2.scrape.players.
playerlst_as_id
(playerlst, exact=False, filterdf=None)¶ Similar to player_as_id, but less robust against errors, and works on a list of players.
Parameters: - players – a list of int, or str, players whose IDs you want to retrieve.
- exact – bool. If True, looks for exact matches. If False, does not, using player_as_id (but will be slower)
- filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns: a list of int/float
-
scrapenhl2.scrape.players.
playerlst_as_str
(players, filterdf=None)¶ Similar to player_as_str, but less robust against errors, and works on a list of players
Parameters: - players – a list of int, or str, players whose names you want to retrieve
- filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns: a list of str
-
scrapenhl2.scrape.players.
rescrape_player
(playerid)¶ If you notice that a player name, position, etc, is outdated, call this method on their ID. It will re-scrape their data from the NHL API.
Parameters: playerid – int, their ID. Also accepts str, their name. Returns: nothing
-
scrapenhl2.scrape.players.
update_player_ids_file
(playerids, force_overwrite=False)¶ Adds these entries to player IDs file if need be.
Parameters: - playerids – a list of IDs
- force_overwrite – bool. If True, will re-scrape data for all player ids. If False, only new ones.
Returns: nothing
-
scrapenhl2.scrape.players.
update_player_ids_from_page
(pbp)¶ Reads the list of players listed in the game file and adds to the player IDs file if they are not there already.
Parameters: pbp – json, the raw pbp Returns: nothing
-
scrapenhl2.scrape.players.
update_player_log_file
(playerids, seasons, games, teams, statuses)¶ Updates the player log file with given players. The player log file notes which players played in which games and whether they were scratched or played.
Parameters: - playerids – int or str or list of int
- seasons – int, the season, or list of int the same length as playerids
- games – int, the game, or list of int the same length as playerids
- teams – str or int, the team, or list of int the same length as playerids
- statuses – str, or list of str the same length as playerids
Returns: nothing
-
scrapenhl2.scrape.players.
update_player_logs_from_page
(pbp, season, game)¶ Takes the game play by play and adds players to the master player log file, noting that they were on the roster for this game, which team they played for, and their status (P for played, S for scratch).
Parameters: - season – int, the season
- game – int, the game
- pbp – json, the pbp of the game
Returns: nothing
-
scrapenhl2.scrape.players.
write_player_ids_file
(df)¶ Writes the given dataframe to disk as the player ids mapping.
Parameters: df – pandas dataframe, player ids file Returns: nothing
-
scrapenhl2.scrape.players.
write_player_log_file
(df)¶ Writes the given dataframe to file as the player log filename
Parameters: df – pandas dataframe Returns: nothing
Schedules¶
This module contains methods related to season schedules.
-
scrapenhl2.scrape.schedules.
attach_game_dates_to_dateframe
(df)¶ Takes dataframe with Season and Game columns and adds a Date column (for that game)
Parameters: df – dataframe Returns: dataframe with one more column
-
scrapenhl2.scrape.schedules.
generate_season_schedule_file
(season, force_overwrite=True)¶ Reads season schedule from NHL API and writes to file.
The output contains the following columns:
- Season: int, the season
- Date: str, the dates
- Game: int, the game id
- Type: str, the game type (for preseason vs regular season, etc)
- Status: str, e.g. Final
- Road: int, the road team ID
- RoadScore: int, number of road team goals
- RoadCoach str, ‘N/A’ when this function is run (edited later with road coach name)
- Home: int, the home team ID
- HomeScore: int, number of home team goals
- HomeCoach: str, ‘N/A’ when this function is run (edited later with home coach name)
- Venue: str, the name of the arena
- Result: str, ‘N/A’ when this function is run (edited accordingly later from PoV of home team: W, OTW, SOL, etc)
- PBPStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
- TOIStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
Parameters: - season – int, the season
- force_overwrite – bool. If True, generates entire file from scratch. If False, only redoes when not Final previously.
Returns: Nothing
-
scrapenhl2.scrape.schedules.
get_current_season
()¶ Returns the current season.
Returns: The current season variable (generated at import from _get_current_season)
-
scrapenhl2.scrape.schedules.
get_game_data_from_schedule
¶ This is a helper method that uses the schedule file to isolate information for current game (e.g. teams involved, coaches, venue, score, etc.)
Parameters: - season – int, the season
- game – int, the game
Returns: dict of game data
-
scrapenhl2.scrape.schedules.
get_game_date
(season, game)¶ Returns the date of this game
Parameters: - season – int, the game
- game – int, the season
Returns: str
-
scrapenhl2.scrape.schedules.
get_game_result
(season, game)¶ Returns the result of this game for home team (e.g. W, SOL)
Parameters: - season – int, the season
- game – int, the game
Returns: int, the score
-
scrapenhl2.scrape.schedules.
get_game_status
(season, game)¶ Returns the status of this game (e.g. Final, In Progress)
Parameters: - season – int, the season
- game – int, the game
Returns: int, the score
-
scrapenhl2.scrape.schedules.
get_home_score
(season, game)¶ Returns the home score from this game
Parameters: - season – int, the season
- game – int, the game
Returns: int, the score
-
scrapenhl2.scrape.schedules.
get_home_team
(season, game, returntype='id')¶ Returns the home team from this game
Parameters: - season – int, the game
- game – int, the season
- returntype – str, ‘id’ or ‘name’
Returns: float or str, depending on returntype
-
scrapenhl2.scrape.schedules.
get_road_score
(season, game)¶ Returns the road score from this game
Parameters: - season – int, the season
- game – int, the game
Returns: int, the score
-
scrapenhl2.scrape.schedules.
get_road_team
(season, game, returntype='id')¶ Returns the road team from this game
Parameters: - season – int, the game
- game – int, the season
- returntype – str, ‘id’ or ‘name’
Returns: float or str, depending on returntype
-
scrapenhl2.scrape.schedules.
get_season_schedule
(season)¶ Gets the the season’s schedule file from memory.
Parameters: season – int, the season Returns: dataframe (originally from /scrape/data/other/[season]_schedule.feather)
-
scrapenhl2.scrape.schedules.
get_season_schedule_filename
(season)¶ Gets the filename for the season’s schedule file
Parameters: season – int, the season Returns: str, /scrape/data/other/[season]_schedule.feather
-
scrapenhl2.scrape.schedules.
get_season_schedule_url
(season)¶ Gets the url for a page containing all of this season’s games (Sep 1 to Jun 26) from NHL API.
Parameters: season – int, the season Returns: str, https://statsapi.web.nhl.com/api/v1/schedule?startDate=[season]-09-01&endDate=[season+1]-06-25
-
scrapenhl2.scrape.schedules.
get_team_games
(season=None, team=None, startdate=None, enddate=None)¶ Returns list of games played by team in season.
Just calls get_team_schedule with the provided arguments, returning the series of games from that dataframe.
Parameters: - season – int, the season
- team – int or str, the team
- startdate – str or None
- enddate – str or None
Returns: series of games
-
scrapenhl2.scrape.schedules.
get_team_schedule
(season=None, team=None, startdate=None, enddate=None)¶ Gets the schedule for given team in given season. Or if startdate and enddate are specified, searches between those dates. If season and startdate (and/or enddate) are specified, searches that season between those dates.
Parameters: - season – int, the season
- team – int or str, the team
- startdate – str, YYYY-MM-DD
- enddate – str, YYYY-MM-DD
Returns: dataframe
-
scrapenhl2.scrape.schedules.
get_teams_in_season
(season)¶ Returns all teams that have a game in the schedule for this season
Parameters: season – int, the season Returns: set of team IDs
-
scrapenhl2.scrape.schedules.
schedule_setup
()¶ Reads current season and schedules into memory.
Returns: nothing
-
scrapenhl2.scrape.schedules.
write_season_schedule
(df, season, force_overwrite)¶ A helper method that writes the season schedule file to disk (in feather format for fast read/write)
Parameters: - df – the season schedule datafraome
- season – the season
- force_overwrite – bool. If True, overwrites entire file. If False, only redoes when not Final previously.
Returns: Nothing
Manipulate schedules¶
This module contains methods related to generating and manipulating schedules.
-
scrapenhl2.scrape.manipulate_schedules.
update_schedule_with_coaches
(pbp, season, game)¶ Uses the PbP to update coach info for this game.
Parameters: - pbp – json, the pbp for this game
- season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.manipulate_schedules.
update_schedule_with_pbp_scrape
(season, game)¶ Updates the schedule file saying that specified game’s pbp has been scraped.
Parameters: - season – int, the season
- game – int, the game, or list of ints
Returns: updated schedule
-
scrapenhl2.scrape.manipulate_schedules.
update_schedule_with_result
(season, game, result)¶ Updates the season schedule file with game result (which are listed ‘N/A’ at schedule generation)
Parameters: - season – int, the season
- game – int, the game
- result – str, the result from home team perspective
Returns:
-
scrapenhl2.scrape.manipulate_schedules.
update_schedule_with_result_using_pbp
(pbp, season, game)¶ Uses the PbP to update results for this game.
Parameters: - pbp – json, the pbp for this game
- season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.manipulate_schedules.
update_schedule_with_toi_scrape
(season, game)¶ Updates the schedule file saying that specified game’s toi has been scraped.
Parameters: - season – int, the season
- game – int, the game, or list of int
Returns: nothing
Scrape play by play¶
This module contains methods for scraping pbp.
-
scrapenhl2.scrape.scrape_pbp.
get_game_from_url
(season, game)¶ Gets the page containing information for specified game from NHL API.
Parameters: - season – int, the season
- game – int, the game
Returns: str, the page at the url
-
scrapenhl2.scrape.scrape_pbp.
get_game_pbplog_filename
(season, game)¶ Returns the filename of the parsed pbp html game pbp
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/raw/pbp/[season]/[game].html
-
scrapenhl2.scrape.scrape_pbp.
get_game_pbplog_url
(season, game)¶ Gets the url for a page containing pbp information for specified game from HTML tables.
Parameters: - season – int, the season
- game – int, the game
:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/PL020001.HTM
-
scrapenhl2.scrape.scrape_pbp.
get_game_raw_pbp_filename
(season, game)¶ Returns the filename of the raw pbp folder
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/raw/pbp/[season]/[game].zlib
-
scrapenhl2.scrape.scrape_pbp.
get_game_url
(season, game)¶ Gets the url for a page containing information for specified game from NHL API.
Parameters: - season – int, the season
- game – int, the game
Returns: str, https://statsapi.web.nhl.com/api/v1/game/[season]0[game]/feed/live
-
scrapenhl2.scrape.scrape_pbp.
get_raw_html_pbp
(season, game)¶ Loads the html file containing this game’s play by play from disk.
Parameters: - season – int, the season
- game – int, the game
Returns: str, the html pbp
-
scrapenhl2.scrape.scrape_pbp.
get_raw_pbp
(season, game)¶ Loads the compressed json file containing this game’s play by play from disk.
Parameters: - season – int, the season
- game – int, the game
Returns: json, the json pbp
-
scrapenhl2.scrape.scrape_pbp.
save_raw_html_pbp
(page, season, game)¶ Takes the bytes page containing html pbp information and saves as such
Parameters: - page – bytes
- season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.scrape_pbp.
save_raw_pbp
(page, season, game)¶ Takes the bytes page containing pbp information and saves to disk as a compressed zlib.
Parameters: - page – bytes. str(page) would yield a string version of the json pbp
- season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.scrape_pbp.
scrape_game_pbp
(season, game, force_overwrite=False)¶ This method scrapes the pbp for the given game.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If file exists already, won’t scrape again
Returns: bool, False if not scraped, else True
-
scrapenhl2.scrape.scrape_pbp.
scrape_game_pbp_from_html
(season, game, force_overwrite=True)¶ This method scrapes the html pbp for the given game. Use for live games.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If file exists already, won’t scrape again
Returns: bool, False if not scraped, else True
-
scrapenhl2.scrape.scrape_pbp.
scrape_pbp_setup
()¶ Creates raw pbp folders if need be
Returns:
-
scrapenhl2.scrape.scrape_pbp.
scrape_season_pbp
(season, force_overwrite=False)¶ Scrapes and parses pbp from the given season.
Parameters: - season – int, the season
- force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns: nothing
Parse play by play¶
This module contains methods for parsing PBP.
-
scrapenhl2.scrape.parse_pbp.
get_5v5_corsi_pm
(season, game, cfca=None)¶ Returns a dataframe from home team perspective. Each row is a Corsi event, with time and note of whether it’s positive or negative for home team.
Parameters: - season – int, the season
- game – int, the game
- cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
Returns: dataframe with columns Time and HomeCorsi
-
scrapenhl2.scrape.parse_pbp.
get_game_parsed_pbp_filename
(season, game)¶ Returns the filename of the parsed pbp folder
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/parsed/pbp/[season]/[game].zlib
-
scrapenhl2.scrape.parse_pbp.
get_parsed_pbp
(season, game)¶ Loads the compressed json file containing this game’s play by play from disk.
Parameters: - season – int, the season
- game – int, the game
Returns: json, the json pbp
-
scrapenhl2.scrape.parse_pbp.
parse_game_pbp
(season, game, force_overwrite=False)¶ Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns: True if parsed, False if not
-
scrapenhl2.scrape.parse_pbp.
parse_game_pbp_from_html
(season, game, force_overwrite=False)¶ Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns: True if parsed, False if not
-
scrapenhl2.scrape.parse_pbp.
parse_pbp_setup
()¶ Creates parsed pbp folders if need be
Returns: nothing
-
scrapenhl2.scrape.parse_pbp.
parse_season_pbp
(season, force_overwrite=False)¶ Parses pbp from the given season.
Parameters: - season – int, the season
- force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns: nothing
-
scrapenhl2.scrape.parse_pbp.
read_events_from_page
(rawpbp, season, game)¶ This method takes the json pbp and returns a pandas dataframe with the following columns:
- Index: int, index of event
- Period: str, period of event. In regular season, could be 1, 2, 3, OT, or SO. In playoffs, 1, 2, 3, 4, 5…
- MinSec: str, m:ss, time elapsed in period
- Time: int, time elapsed in game
- Event: str, the event name
- Team: int, the team id. Note that this is switched to blocked team for blocked shots to ease Corsi calculations.
- Actor: int, the acting player id. Switched with recipient for blocks (see above)
- ActorRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with recipient for blocks (see above)
- Recipient: int, the receiving player id. Switched with actor for blocks (see above)
- RecipientRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with actor for blocks (see above)
- X: int, the x coordinate of event (or NaN)
- Y: int, the y coordinate of event (or NaN)
- Note: str, additional notes, which may include penalty duration, assists on a goal, etc.
Parameters: - rawpbp – json, the raw json pbp
- season – int, the season
- game – int, the game
Returns: pandas dataframe, the pbp in a nicer format
-
scrapenhl2.scrape.parse_pbp.
save_parsed_pbp
(pbp, season, game)¶ Saves the pandas dataframe containing pbp information to disk as an HDF5.
Parameters: - pbp – df, a pandas dataframe with the pbp of the game
- season – int, the season
- game – int, the game
Returns: nothing
Scrape TOI¶
This module contains methods for scraping TOI.
-
scrapenhl2.scrape.scrape_toi.
get_game_raw_toi_filename
(season, game)¶ Returns the filename of the raw toi folder
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/raw/toi/[season]/[game].zlib
-
scrapenhl2.scrape.scrape_toi.
get_home_shiftlog_filename
(season, game)¶ Returns the filename of the parsed toi html home shifts
Parameters: - season – int, the season
- game – int, the game
Returns: str, /scrape/data/raw/pbp/[season]/[game]H.html
-
scrapenhl2.scrape.scrape_toi.
get_home_shiftlog_url
(season, game)¶ Gets the url for a page containing shift information for specified game from HTML tables for home team.
Parameters: - season – int, the season
- game – int, the game
:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TH020001.HTM
-
scrapenhl2.scrape.scrape_toi.
get_raw_html_toi
(season, game, homeroad)¶ Loads the html file containing this game’s toi from disk.
Parameters: - season – int, the season
- game – int, the game
- homeroad – str, ‘H’ for home or ‘R’ for road
Returns: str, the html toi
-
scrapenhl2.scrape.scrape_toi.
get_raw_toi
(season, game)¶ Loads the compressed json file containing this game’s shifts from disk.
Parameters: - season – int, the season
- game – int, the game
Returns: dict, the json shifts
-
scrapenhl2.scrape.scrape_toi.
get_road_shiftlog_filename
(season, game)¶ Returns the filename of the parsed toi html road shifts
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/raw/pbp/[season]/[game]H.html
-
scrapenhl2.scrape.scrape_toi.
get_road_shiftlog_url
(season, game)¶ Gets the url for a page containing shift information for specified game from HTML tables for road team.
Parameters: - season – int, the season
- game – int, the game
:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TV020001.HTM
-
scrapenhl2.scrape.scrape_toi.
get_shift_url
(season, game)¶ Gets the url for a page containing shift information for specified game from NHL API.
Parameters: - season – int, the season
- game – int, the game
:return : str, http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=[season]0[game]
-
scrapenhl2.scrape.scrape_toi.
save_raw_toi
(page, season, game)¶ Takes the bytes page containing shift information and saves to disk as a compressed zlib.
Parameters: - page – bytes. str(page) would yield a string version of the json shifts
- season – int, the season
- game – int, the game
Returns: nothing
-
scrapenhl2.scrape.scrape_toi.
save_raw_toi_from_html
(page, season, game, homeroad)¶ Takes the bytes page containing shift information and saves to disk as html.
Parameters: - page – bytes. str(page) would yield a string version of the json shifts
- season – int, he season
- game – int, the game
- homeroad – str, ‘H’ or ‘R’
Returns: nothing
-
scrapenhl2.scrape.scrape_toi.
scrape_game_toi
(season, game, force_overwrite=False)¶ This method scrapes the toi for the given game.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If file exists already, won’t scrape again
Returns: nothing
-
scrapenhl2.scrape.scrape_toi.
scrape_game_toi_from_html
(season, game, force_overwrite=True)¶ This method scrapes the toi html logs for the given game.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If file exists already, won’t scrape again
Returns: nothing
-
scrapenhl2.scrape.scrape_toi.
scrape_season_toi
(season, force_overwrite=False)¶ Scrapes and parses toi from the given season.
Parameters: - season – int, the season
- force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns: nothing
-
scrapenhl2.scrape.scrape_toi.
scrape_toi_setup
()¶ Creates raw toi folders if need be
Returns:
Parse TOI¶
This module contains methods for parsing TOI.
-
scrapenhl2.scrape.parse_toi.
get_game_parsed_toi_filename
(season, game)¶ Returns the filename of the parsed toi folder
Parameters: - season – int, current season
- game – int, game
Returns: str, /scrape/data/parsed/toi/[season]/[game].zlib
-
scrapenhl2.scrape.parse_toi.
get_melted_home_road_5v5_toi
(season, game)¶ Reads parsed TOI for this game, filters for 5v5 TOI, and melts from wide to long on player
Parameters: - season – int, the season
- game – int, the game
Returns: (home_df, road_df), each with columns Time, PlayerID, and Team (which will be H or R)
-
scrapenhl2.scrape.parse_toi.
get_parsed_toi
(season, game)¶ Loads the compressed json file containing this game’s shifts from disk.
Parameters: - season – int, the season
- game – int, the game
Returns: json, the json shifts
-
scrapenhl2.scrape.parse_toi.
parse_game_toi
(season, game, force_overwrite=False)¶ Parses TOI from json for this game
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns: nothing
-
scrapenhl2.scrape.parse_toi.
parse_game_toi_from_html
(season, game, force_overwrite=False)¶ Parses TOI from the html shift log from this game.
Parameters: - season – int, the season
- game – int, the game
- force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns: nothing
-
scrapenhl2.scrape.parse_toi.
parse_season_toi
(season, force_overwrite=False)¶ Parses toi from the given season. Final games covered only.
Parameters: - season – int, the season
- force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns:
-
scrapenhl2.scrape.parse_toi.
parse_toi_setup
()¶ Creates parsed toi folders if need be
Returns:
-
scrapenhl2.scrape.parse_toi.
read_shifts_from_html_pages
(rawtoi1, rawtoi2, teamid1, teamid2, season, game)¶ Aggregates information from two html pages given into a dataframe with one row per second and one col per player.
Parameters: - rawtoi1 – str, html page of shift log for team id1
- rawtoi2 – str, html page of shift log for teamid2
- teamid1 – int, team id corresponding to rawtoi1
- teamid2 – int, team id corresponding to rawtoi1
- season – int, the season
- game – int, the game
Returns: dataframe
-
scrapenhl2.scrape.parse_toi.
read_shifts_from_page
(rawtoi, season, game)¶ Turns JSON shift start-ends into TOI matrix with one row per second and one col per player
Parameters: - rawtoi – dict, json from NHL API
- season – int, the season
- game – int, the game
Returns: dataframe
-
scrapenhl2.scrape.parse_toi.
save_parsed_toi
(toi, season, game)¶ Saves the pandas dataframe containing shift information to disk as an HDF5.
Parameters: - toi – df, a pandas dataframe with the shifts of the game
- season – int, the season
- game – int, the game
Returns: nothing
Team information¶
This module contains methods related to the team info files.
-
scrapenhl2.scrape.team_info.
add_team_to_info_file
(teamid)¶ In case we come across teams that are not in the default list (1-110), use this method to add them to the file.
Parameters: teamid – int, the team ID Returns: (tid, tabbrev, tname)
-
scrapenhl2.scrape.team_info.
generate_team_ids_file
(teamids=None)¶ Reads all team id URLs and stores information to disk. Has the following information:
- ID: int
- Abbreviation: str (three letters)
- Name: str (full name)
Parameters: teamids – iterable of int. Tries to access team ids as listed in teamids. If not, goes from 1-110. Returns: nothing
-
scrapenhl2.scrape.team_info.
get_team_colordict
()¶ Get the team color dictionary
Returns: a dictionary of IDs to tuples of hex colors
-
scrapenhl2.scrape.team_info.
get_team_colors
(team)¶ Returns primary and secondary color for this team.
Parameters: team – str or int, the team Returns: tuple of hex colors
-
scrapenhl2.scrape.team_info.
get_team_info_file
()¶ Returns the team information dataframe from memory. This is stored as a feather file for fast read/write.
Returns: dataframe from /scrape/data/other/TEAM_INFO.feather
-
scrapenhl2.scrape.team_info.
get_team_info_filename
()¶ Returns the team information filename
Returns: str, /scrape/data/other/TEAM_INFO.feather
-
scrapenhl2.scrape.team_info.
get_team_info_from_url
(teamid)¶ Pulls ID, abbreviation, and name from the NHL API.
Parameters: teamid – int, the team ID Returns: (id, abbrev, name)
-
scrapenhl2.scrape.team_info.
get_team_info_url
(teamid)¶ Gets the team url from the NHL API.
Parameters: teamid – int, the team ID Returns: str, http://statsapi.web.nhl.com/api/v1/teams/[teamid]
-
scrapenhl2.scrape.team_info.
team_as_id
¶ A helper method. If team entered is int, returns that. If team is str, returns integer id of that team.
Parameters: team – int, or str Returns: int, the team ID
-
scrapenhl2.scrape.team_info.
team_as_str
¶ A helper method. If team entered is str, returns that. If team is int, returns string name of that team.
Parameters: - team – int, or str
- abbreviation – bool, whether to return 3-letter abbreviation or full name
Returns: str, the team name
-
scrapenhl2.scrape.team_info.
team_setup
()¶ This method loads the team info df into memory
Returns: nothing
-
scrapenhl2.scrape.team_info.
write_team_info_file
(df)¶ Writes the team information file. This is stored as a feather file for fast read/write.
Parameters: df – the (team information) dataframe to write to file Returns: nothing
Teams¶
This module contains method related to team logs.
-
scrapenhl2.scrape.teams.
get_team_pbp
(season, team)¶ Returns the pbp of given team in given season across all games.
Parameters: - season – int, the season
- team – int or str, the team abbreviation.
Returns: df, the pbp of given team in given season
-
scrapenhl2.scrape.teams.
get_team_pbp_filename
(season, team)¶ Returns filename of the PBP log for this team and season
Parameters: - season – int, the season
- team – int or str, the team abbreviation.
Returns:
-
scrapenhl2.scrape.teams.
get_team_toi
(season, team)¶ Returns the toi of given team in given season across all games.
Parameters: - season – int, the season
- team – int or str, the team abbreviation.
Returns: df, the toi of given team in given season
-
scrapenhl2.scrape.teams.
get_team_toi_filename
(season, team)¶ Returns filename of the TOI log for this team and season
Parameters: - season – int, the season
- team – int or str, the team abbreviation.
Returns:
-
scrapenhl2.scrape.teams.
team_setup
()¶ Creates team log-related folders.
Returns: nothing
-
scrapenhl2.scrape.teams.
update_team_logs
(season, force_overwrite=False, force_games=None)¶ This method looks at the schedule for the given season and writes pbp for scraped games to file. It also adds the strength at each pbp event to the log. It only includes games that have both PBP and TOI.
Parameters: - season – int, the season
- force_overwrite – bool, whether to generate from scratch
- force_games – None or iterable of games to force_overwrite specifically
Returns: nothing
-
scrapenhl2.scrape.teams.
write_team_pbp
(pbp, season, team)¶ Writes the given pbp dataframe to file.
Parameters: - pbp – df, the pbp of given team in given season
- season – int, the season
- team – int or str, the team abbreviation.
Returns: nothing
-
scrapenhl2.scrape.teams.
write_team_toi
(toi, season, team)¶ Writes team TOI log to file
Parameters: - toi – df, team toi for this season
- season – int, the season
- team – int or str, the team abbreviation.
Returns: