Scrape

The scrapenhl2.scrape module contains methods useful for scraping.

Useful examples

Updating data:

from scrapenhl2.scrape import autoupdate
autoupdate.autoupdate()

Get the season schedule:

from scrapenhl2.scrape import schedules
schedules.get_season_schedule(2017)

Convert between player ID and player name:

from scrapenhl2.scrape import players
pname = 'Alex Ovechkin'
players.player_as_id(pname)

pid = 8471214
players.player_as_str(pid)

There’s much more, and feel free to submit pull requests with whatever you find useful.

Methods

The functions in these modules are organized pretty logically under the module names.

Autoupdate

This module contains methods for automatically scraping and parsing games.

scrapenhl2.scrape.autoupdate.autoupdate(season=None)

Run this method to update local data. It reads the schedule file for given season and scrapes and parses previously unscraped games that have gone final or are in progress. Use this for 2010 or later.

Parameters:season – int, the season. If None (default), will do current season
Returns:nothing
scrapenhl2.scrape.autoupdate.delete_game_html(season, game)

Deletes html files. HTML files are used for live game charts, but deleted in favor of JSONs when games go final.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.autoupdate.read_final_games(games, season)
Parameters:
  • games
  • season
Returns:

scrapenhl2.scrape.autoupdate.read_inprogress_games(inprogressgames, season)

Saves these games to file via html (for toi) and json (for pbp)

Parameters:inprogressgames – list of int
Returns:

Events

This module contains methods related to PBP events.

scrapenhl2.scrape.events.convert_event(event)

Converts to a more convenient, standardized name (see get_event_dictionary)

Parameters:event – str, the event name
Returns:str, shortened event name
scrapenhl2.scrape.events.event_setup()

Loads event dictionary into memory

Returns:nothing
scrapenhl2.scrape.events.get_event_dictionary()

Returns the abbreviation: long name event mapping (in lowercase)

Returns:dict of str:str
scrapenhl2.scrape.events.get_event_longname

A method for translating event abbreviations to full names (for pbp matching)

Parameters:eventname – str, the event name
Returns:the non-abbreviated event name

Games

This module contains methods related to scraping games.

scrapenhl2.scrape.games.find_recent_games(team1, team2=None, limit=1)

A convenience function that lists the most recent in progress or final games for specified team(s)

Parameters:
  • team1 – str, a team
  • team2 – str, a team (optional)
  • limit – How many games to return
Returns:

df with relevant rows

scrapenhl2.scrape.games.get_player_5v5_log_filename(season)

Gets the filename for the season’s player log file. Includes 5v5 CF, CA, TOI, and more.

Parameters:season – int, the season
Returns:str, /scrape/data/other/[season]_player_log.feather
scrapenhl2.scrape.games.most_recent_game_id(team1, team2)

A convenience function to get the most recent game (this season) between two teams.

Parameters:
  • team1 – str, a team
  • team2 – str, a team
Returns:

int, a game number

General helpers

This module contains general helper methods. None of these methods have dependencies on other scrapenhl2 modules.

scrapenhl2.scrape.general_helpers.add_sim_scores(df, name)

Adds fuzzywuzzy’s token set similarity scores to provded dataframe

Parameters:
  • df – pandas dataframe with column Name
  • name – str, name to compare to
Returns:

df with an additional column SimScore

scrapenhl2.scrape.general_helpers.anti_join(df1, df2, **kwargs)

Anti-joins two dataframes.

Parameters:
  • df1 – dataframe
  • df2 – dataframe
  • kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’). Specifically, need join keys.
Returns:

dataframe

scrapenhl2.scrape.general_helpers.check_number(obj)

A helper method to check if obj is int, float, np.int64, etc. This is frequently needed, so is helpful.

Parameters:obj – the object to check the type
Returns:bool
scrapenhl2.scrape.general_helpers.check_number_last_first_format(name)

Checks if specified name looks like “8 Ovechkin, Alex”

Parameters:name – str
Returns:bool
scrapenhl2.scrape.general_helpers.check_types(obj)

A helper method to check if obj is int, float, np.int64, or str. This is frequently needed, so is helpful.

Parameters:obj – the object to check the type
Returns:bool
scrapenhl2.scrape.general_helpers.fill_join(df1, df2, **kwargs)

Uses data from df2 to fill in missing values from df1. Helpful when you have to join using multiple data sources. Preserves data order. Won’t work when joining introduces duplicates.

Parameters:
  • df1 – dataframe
  • df2 – dataframe
  • kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’ and ‘suffixes’)
Returns:

dataframe

scrapenhl2.scrape.general_helpers.flip_first_last(name)

Changes Ovechkin, Alex to Alex Ovechkin. Also changes to title case.

Parameters:name – str
Returns:str, flipped if applicable
scrapenhl2.scrape.general_helpers.fuzzy_match_player(name_provided, names, minimum_similarity=50)

This method checks similarity between each entry in names and the name_provided using token set matching and returns the entry that matches best. Returns None if no similarity is greater than minimum_similarity. (See e.g. http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)

Parameters:
  • name_provided – str, name to look for
  • names – list (or ndarray, or similar) of
  • minimum_similarity – int from 0 to 100, minimum similarity. If all are below this, returns None.
Returns:

str, string in names that best matches name_provided

scrapenhl2.scrape.general_helpers.get_initials(pname)

Splits name on spaces and returns first letter from each part.

Parameters:pname – str, player name
Returns:str, player initials
scrapenhl2.scrape.general_helpers.get_lastname(pname)

Splits name on first space and returns second part.

Parameters:pname – str, player name
Returns:str, player last name
scrapenhl2.scrape.general_helpers.infer_season_from_date

Looks at a date and infers the season based on that: Year-1 if month is Aug or before; returns year otherwise.

Parameters:date – str, YYYY-MM-DD
Returns:int, the season. 2007-08 would be 2007.
scrapenhl2.scrape.general_helpers.intervals(lst, interval_pct=10)

A method that divides list into intervals and returns tuples indicating each interval mark. Useful for giving updates when cycling through games.

Parameters:
  • lst – lst to divide
  • interval_pct – int, pct for each interval to represent. e.g. 10 means it will mark every 10%.
Returns:

a list of tuples of (index, value)

scrapenhl2.scrape.general_helpers.log_exceptions(fn)

A decorator that wraps the passed in function and logs exceptions should one occur

Parameters:function – the function
Returns:nothing
scrapenhl2.scrape.general_helpers.melt_helper(df, **kwargs)

Earlier versions of pandas do not support pd.DataFrame.melt. This helps to bridge the gap. It first tries df.melt, and if that doesn’t work, it uses pd.melt.

Parameters:
  • df – dataframe
  • kwargs – arguments to pd.melt or pd.DataFrame.melt.
Returns:

melted dataframe

scrapenhl2.scrape.general_helpers.mmss_to_secs(strtime)

Converts time from mm:ss to seconds

Parameters:strtime – str, mm:ss
Returns:int
scrapenhl2.scrape.general_helpers.once_per_second(fn, calls_per_second=1)

A decorator that sleeps for one second after executing the function. Used when scraping NHL site. This also means all functions that access the internet sleep for a second.

Parameters:fn – the function
Returns:nothing
scrapenhl2.scrape.general_helpers.period_contribution(x)

Turns period–1, 2, 3, OT, etc–into # of seconds elapsed in game until start. :param x: str or int, 1, 2, 3, etc :return: int, number of seconds elapsed until start of specified period

scrapenhl2.scrape.general_helpers.print_and_log(message, level='info', print_and_log=True)

A helper method that prints message to console and also writes to log with specified level.

Parameters:
  • message – str, the message
  • level – str, the level of log: info, warn, error, critical
  • print_and_log – bool. If False, logs only.
Returns:

nothing

scrapenhl2.scrape.general_helpers.remove_leading_number(string)

Will convert 8 Alex Ovechkin to Alex Ovechkin, or Alex Ovechkin to Alex Ovechkin

Parameters:string – a string
Returns:string without leading numbers
scrapenhl2.scrape.general_helpers.start_logging()

Clears out logging folder, and starts the log in this folder

scrapenhl2.scrape.general_helpers.try_to_access_dict(base_dct, *keys, **kwargs)

A helper method that accesses base_dct using keys, one-by-one. Returns None if a key does not exist.

Parameters:
  • base_dct – dict, a dictionary
  • keys – str, int, or other valid dict keys
  • kwargs – can specify default using kwarg default_return=0, for example.
Returns:

obj, base_dct[key1][key2][key3]… or None if a key is not in the dictionary

scrapenhl2.scrape.general_helpers.try_url_n_times(url, timeout=5, n=5)

A helper method that tries to access given url up to five times, returning the page.

Parameters:
  • url – str, the url to access
  • timeout – int, number of secs to wait before timeout. Default 5.
  • n – int, the max number of tries. Default 5.
Returns:

bytes

Organization

This module contains paths to folders.

scrapenhl2.scrape.organization.check_create_folder(*args)

A helper method to create a folder if it doesn’t exist already

Parameters:args – list of str, the parts of the filepath. These are joined together with the base directory
Returns:nothing
scrapenhl2.scrape.organization.get_base_dir()

Returns the base directory of this package (one directory up from this file)

Returns:str, the base directory
scrapenhl2.scrape.organization.get_other_data_folder()

Returns the folder containing other data

Returns:str, /scrape/data/other/
scrapenhl2.scrape.organization.get_parsed_data_folder()

Returns the folder containing parsed data

Returns:str, /scrape/data/parsed/
scrapenhl2.scrape.organization.get_raw_data_folder()

Returns the folder containing raw data

Returns:str, /scrape/data/raw/
scrapenhl2.scrape.organization.get_season_parsed_pbp_folder(season)

Returns the folder containing parsed pbp for given season

Parameters:season – int, current season
Returns:str, /scrape/data/parsed/pbp/[season]/
scrapenhl2.scrape.organization.get_season_parsed_toi_folder(season)

Returns the folder containing parsed toi for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/toi/[season]/
scrapenhl2.scrape.organization.get_season_raw_pbp_folder(season)

Returns the folder containing raw pbp for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/pbp/[season]/
scrapenhl2.scrape.organization.get_season_raw_toi_folder(season)

Returns the folder containing raw toi for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/toi/[season]/
scrapenhl2.scrape.organization.get_season_team_pbp_folder(season)

Returns the folder containing team pbp logs for given season

Parameters:season – int, current season
Returns:str, /scrape/data/teams/pbp/[season]/
scrapenhl2.scrape.organization.get_season_team_toi_folder(season)

Returns the folder containing team toi logs for given season

Parameters:season – int, current season
Returns:str, /scrape/data/teams/toi/[season]/
scrapenhl2.scrape.organization.get_team_data_folder()

Returns the folder containing team log data

Returns:str, /scrape/data/teams/
scrapenhl2.scrape.organization.organization_setup()

Creates other folder if need be

Returns:nothing

Players

This module contains methods related to individual player info.

scrapenhl2.scrape.players.check_default_player_id(playername)

E.g. For Mike Green, I should automatically assume we mean 8471242 (WSH/DET), not 8468436. Returns None if not in dict. Ideally improve code so this isn’t needed.

Parameters:playername – str
Returns:int, or None
scrapenhl2.scrape.players.generate_player_ids_file()

Creates a dataframe with these columns:

  • ID: int, player ID
  • Name: str, player name
  • DOB: str, date of birth
  • Hand: char, R or L
  • Pos: char, one of C/R/L/D/G

It will be populated with Alex Ovechkin to start. :return: nothing

scrapenhl2.scrape.players.generate_player_log_file()

Run this when no player log file exists already. This is for getting the datatypes right. Adds Alex Ovechkin in Game 1 vs Pittsburgh in 2016-2017.

Returns:nothing
scrapenhl2.scrape.players.get_player_handedness

Retrieves handedness of player

Parameters:player – str or int, the player name or ID
Returns:str, player hand (L or R)
scrapenhl2.scrape.players.get_player_ids_file()

Returns the player information file. This is stored as a feather file for fast read/write.

Returns:/scrape/data/other/PLAYER_INFO.feather
scrapenhl2.scrape.players.get_player_info_from_url(playerid)

Gets ID, Name, Hand, Pos, DOB, Height, Weight, and Nationality from the NHL API.

Parameters:playerid – int, the player id
Returns:dict with player ID, name, handedness, position, etc
scrapenhl2.scrape.players.get_player_log_file()

Returns the player log file from memory.

Returns:dataframe, the log
scrapenhl2.scrape.players.get_player_log_filename()

Returns the player log filename.

Returns:str, /scrape/data/other/PLAYER_LOG.feather
scrapenhl2.scrape.players.get_player_position

Retrieves position of player

Parameters:player – str or int, the player name or ID
Returns:str, player position (e.g. C, D, R, L, G)
scrapenhl2.scrape.players.get_player_url(playerid)

Gets the url for a page containing information for specified player from NHL API.

Parameters:playerid – int, the player ID
Returns:str, https://statsapi.web.nhl.com/api/v1/people/[playerid]
scrapenhl2.scrape.players.player_as_id

A helper method. If player entered is int, returns that. If player is str, returns integer id of that player.

Parameters:
  • playername – int, or str, the player whose names you want to retrieve
  • filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work.
  • dob – yyyy-mm-dd, use to help when multiple players have the same name
Returns:

int, the player ID

scrapenhl2.scrape.players.player_as_str

A helper method. If player is int, returns string name of that player. Else returns standardized name.

Parameters:
  • playerid – int, or str, player whose name you want to retrieve
  • filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work. Probably not needed but you can use this method to go from part of the name to full name, in which case it may be helpful.
Returns:

str, the player name

scrapenhl2.scrape.players.player_setup()

Loads team info file into memory.

Returns:nothing
scrapenhl2.scrape.players.playerlst_as_id(playerlst, exact=False, filterdf=None)

Similar to player_as_id, but less robust against errors, and works on a list of players.

Parameters:
  • players – a list of int, or str, players whose IDs you want to retrieve.
  • exact – bool. If True, looks for exact matches. If False, does not, using player_as_id (but will be slower)
  • filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns:

a list of int/float

scrapenhl2.scrape.players.playerlst_as_str(players, filterdf=None)

Similar to player_as_str, but less robust against errors, and works on a list of players

Parameters:
  • players – a list of int, or str, players whose names you want to retrieve
  • filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns:

a list of str

scrapenhl2.scrape.players.rescrape_player(playerid)

If you notice that a player name, position, etc, is outdated, call this method on their ID. It will re-scrape their data from the NHL API.

Parameters:playerid – int, their ID. Also accepts str, their name.
Returns:nothing
scrapenhl2.scrape.players.update_player_ids_file(playerids, force_overwrite=False)

Adds these entries to player IDs file if need be.

Parameters:
  • playerids – a list of IDs
  • force_overwrite – bool. If True, will re-scrape data for all player ids. If False, only new ones.
Returns:

nothing

scrapenhl2.scrape.players.update_player_ids_from_page(pbp)

Reads the list of players listed in the game file and adds to the player IDs file if they are not there already.

Parameters:pbp – json, the raw pbp
Returns:nothing
scrapenhl2.scrape.players.update_player_log_file(playerids, seasons, games, teams, statuses)

Updates the player log file with given players. The player log file notes which players played in which games and whether they were scratched or played.

Parameters:
  • playerids – int or str or list of int
  • seasons – int, the season, or list of int the same length as playerids
  • games – int, the game, or list of int the same length as playerids
  • teams – str or int, the team, or list of int the same length as playerids
  • statuses – str, or list of str the same length as playerids
Returns:

nothing

scrapenhl2.scrape.players.update_player_logs_from_page(pbp, season, game)

Takes the game play by play and adds players to the master player log file, noting that they were on the roster for this game, which team they played for, and their status (P for played, S for scratch).

Parameters:
  • season – int, the season
  • game – int, the game
  • pbp – json, the pbp of the game
Returns:

nothing

scrapenhl2.scrape.players.write_player_ids_file(df)

Writes the given dataframe to disk as the player ids mapping.

Parameters:df – pandas dataframe, player ids file
Returns:nothing
scrapenhl2.scrape.players.write_player_log_file(df)

Writes the given dataframe to file as the player log filename

Parameters:df – pandas dataframe
Returns:nothing

Schedules

This module contains methods related to season schedules.

scrapenhl2.scrape.schedules.attach_game_dates_to_dateframe(df)

Takes dataframe with Season and Game columns and adds a Date column (for that game)

Parameters:df – dataframe
Returns:dataframe with one more column
scrapenhl2.scrape.schedules.generate_season_schedule_file(season, force_overwrite=True)

Reads season schedule from NHL API and writes to file.

The output contains the following columns:

  • Season: int, the season
  • Date: str, the dates
  • Game: int, the game id
  • Type: str, the game type (for preseason vs regular season, etc)
  • Status: str, e.g. Final
  • Road: int, the road team ID
  • RoadScore: int, number of road team goals
  • RoadCoach str, ‘N/A’ when this function is run (edited later with road coach name)
  • Home: int, the home team ID
  • HomeScore: int, number of home team goals
  • HomeCoach: str, ‘N/A’ when this function is run (edited later with home coach name)
  • Venue: str, the name of the arena
  • Result: str, ‘N/A’ when this function is run (edited accordingly later from PoV of home team: W, OTW, SOL, etc)
  • PBPStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
  • TOIStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
Parameters:
  • season – int, the season
  • force_overwrite – bool. If True, generates entire file from scratch. If False, only redoes when not Final previously.
Returns:

Nothing

scrapenhl2.scrape.schedules.get_current_season()

Returns the current season.

Returns:The current season variable (generated at import from _get_current_season)
scrapenhl2.scrape.schedules.get_game_data_from_schedule

This is a helper method that uses the schedule file to isolate information for current game (e.g. teams involved, coaches, venue, score, etc.)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict of game data

scrapenhl2.scrape.schedules.get_game_date(season, game)

Returns the date of this game

Parameters:
  • season – int, the game
  • game – int, the season
Returns:

str

scrapenhl2.scrape.schedules.get_game_result(season, game)

Returns the result of this game for home team (e.g. W, SOL)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_game_status(season, game)

Returns the status of this game (e.g. Final, In Progress)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_home_score(season, game)

Returns the home score from this game

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_home_team(season, game, returntype='id')

Returns the home team from this game

Parameters:
  • season – int, the game
  • game – int, the season
  • returntype – str, ‘id’ or ‘name’
Returns:

float or str, depending on returntype

scrapenhl2.scrape.schedules.get_road_score(season, game)

Returns the road score from this game

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_road_team(season, game, returntype='id')

Returns the road team from this game

Parameters:
  • season – int, the game
  • game – int, the season
  • returntype – str, ‘id’ or ‘name’
Returns:

float or str, depending on returntype

scrapenhl2.scrape.schedules.get_season_schedule(season)

Gets the the season’s schedule file from memory.

Parameters:season – int, the season
Returns:dataframe (originally from /scrape/data/other/[season]_schedule.feather)
scrapenhl2.scrape.schedules.get_season_schedule_filename(season)

Gets the filename for the season’s schedule file

Parameters:season – int, the season
Returns:str, /scrape/data/other/[season]_schedule.feather
scrapenhl2.scrape.schedules.get_season_schedule_url(season)

Gets the url for a page containing all of this season’s games (Sep 1 to Jun 26) from NHL API.

Parameters:season – int, the season
Returns:str, https://statsapi.web.nhl.com/api/v1/schedule?startDate=[season]-09-01&endDate=[season+1]-06-25
scrapenhl2.scrape.schedules.get_team_games(season=None, team=None, startdate=None, enddate=None)

Returns list of games played by team in season.

Just calls get_team_schedule with the provided arguments, returning the series of games from that dataframe.

Parameters:
  • season – int, the season
  • team – int or str, the team
  • startdate – str or None
  • enddate – str or None
Returns:

series of games

scrapenhl2.scrape.schedules.get_team_schedule(season=None, team=None, startdate=None, enddate=None)

Gets the schedule for given team in given season. Or if startdate and enddate are specified, searches between those dates. If season and startdate (and/or enddate) are specified, searches that season between those dates.

Parameters:
  • season – int, the season
  • team – int or str, the team
  • startdate – str, YYYY-MM-DD
  • enddate – str, YYYY-MM-DD
Returns:

dataframe

scrapenhl2.scrape.schedules.get_teams_in_season(season)

Returns all teams that have a game in the schedule for this season

Parameters:season – int, the season
Returns:set of team IDs
scrapenhl2.scrape.schedules.schedule_setup()

Reads current season and schedules into memory.

Returns:nothing
scrapenhl2.scrape.schedules.write_season_schedule(df, season, force_overwrite)

A helper method that writes the season schedule file to disk (in feather format for fast read/write)

Parameters:
  • df – the season schedule datafraome
  • season – the season
  • force_overwrite – bool. If True, overwrites entire file. If False, only redoes when not Final previously.
Returns:

Nothing

Manipulate schedules

This module contains methods related to generating and manipulating schedules.

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_coaches(pbp, season, game)

Uses the PbP to update coach info for this game.

Parameters:
  • pbp – json, the pbp for this game
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_pbp_scrape(season, game)

Updates the schedule file saying that specified game’s pbp has been scraped.

Parameters:
  • season – int, the season
  • game – int, the game, or list of ints
Returns:

updated schedule

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_result(season, game, result)

Updates the season schedule file with game result (which are listed ‘N/A’ at schedule generation)

Parameters:
  • season – int, the season
  • game – int, the game
  • result – str, the result from home team perspective
Returns:

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_result_using_pbp(pbp, season, game)

Uses the PbP to update results for this game.

Parameters:
  • pbp – json, the pbp for this game
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_toi_scrape(season, game)

Updates the schedule file saying that specified game’s toi has been scraped.

Parameters:
  • season – int, the season
  • game – int, the game, or list of int
Returns:

nothing

Scrape play by play

This module contains methods for scraping pbp.

scrapenhl2.scrape.scrape_pbp.get_game_from_url(season, game)

Gets the page containing information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, the page at the url

scrapenhl2.scrape.scrape_pbp.get_game_pbplog_filename(season, game)

Returns the filename of the parsed pbp html game pbp

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game].html

scrapenhl2.scrape.scrape_pbp.get_game_pbplog_url(season, game)

Gets the url for a page containing pbp information for specified game from HTML tables.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/PL020001.HTM

scrapenhl2.scrape.scrape_pbp.get_game_raw_pbp_filename(season, game)

Returns the filename of the raw pbp folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game].zlib

scrapenhl2.scrape.scrape_pbp.get_game_url(season, game)

Gets the url for a page containing information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, https://statsapi.web.nhl.com/api/v1/game/[season]0[game]/feed/live

scrapenhl2.scrape.scrape_pbp.get_raw_html_pbp(season, game)

Loads the html file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, the html pbp

scrapenhl2.scrape.scrape_pbp.get_raw_pbp(season, game)

Loads the compressed json file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json pbp

scrapenhl2.scrape.scrape_pbp.save_raw_html_pbp(page, season, game)

Takes the bytes page containing html pbp information and saves as such

Parameters:
  • page – bytes
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_pbp.save_raw_pbp(page, season, game)

Takes the bytes page containing pbp information and saves to disk as a compressed zlib.

Parameters:
  • page – bytes. str(page) would yield a string version of the json pbp
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_pbp.scrape_game_pbp(season, game, force_overwrite=False)

This method scrapes the pbp for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

bool, False if not scraped, else True

scrapenhl2.scrape.scrape_pbp.scrape_game_pbp_from_html(season, game, force_overwrite=True)

This method scrapes the html pbp for the given game. Use for live games.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

bool, False if not scraped, else True

scrapenhl2.scrape.scrape_pbp.scrape_pbp_setup()

Creates raw pbp folders if need be

Returns:
scrapenhl2.scrape.scrape_pbp.scrape_season_pbp(season, force_overwrite=False)

Scrapes and parses pbp from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns:

nothing

Parse play by play

This module contains methods for parsing PBP.

scrapenhl2.scrape.parse_pbp.get_5v5_corsi_pm(season, game, cfca=None)

Returns a dataframe from home team perspective. Each row is a Corsi event, with time and note of whether it’s positive or negative for home team.

Parameters:
  • season – int, the season
  • game – int, the game
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
Returns:

dataframe with columns Time and HomeCorsi

scrapenhl2.scrape.parse_pbp.get_game_parsed_pbp_filename(season, game)

Returns the filename of the parsed pbp folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/parsed/pbp/[season]/[game].zlib

scrapenhl2.scrape.parse_pbp.get_parsed_pbp(season, game)

Loads the compressed json file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json pbp

scrapenhl2.scrape.parse_pbp.parse_game_pbp(season, game, force_overwrite=False)

Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

True if parsed, False if not

scrapenhl2.scrape.parse_pbp.parse_game_pbp_from_html(season, game, force_overwrite=False)

Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

True if parsed, False if not

scrapenhl2.scrape.parse_pbp.parse_pbp_setup()

Creates parsed pbp folders if need be

Returns:nothing
scrapenhl2.scrape.parse_pbp.parse_season_pbp(season, force_overwrite=False)

Parses pbp from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns:

nothing

scrapenhl2.scrape.parse_pbp.read_events_from_page(rawpbp, season, game)

This method takes the json pbp and returns a pandas dataframe with the following columns:

  • Index: int, index of event
  • Period: str, period of event. In regular season, could be 1, 2, 3, OT, or SO. In playoffs, 1, 2, 3, 4, 5…
  • MinSec: str, m:ss, time elapsed in period
  • Time: int, time elapsed in game
  • Event: str, the event name
  • Team: int, the team id. Note that this is switched to blocked team for blocked shots to ease Corsi calculations.
  • Actor: int, the acting player id. Switched with recipient for blocks (see above)
  • ActorRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with recipient for blocks (see above)
  • Recipient: int, the receiving player id. Switched with actor for blocks (see above)
  • RecipientRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with actor for blocks (see above)
  • X: int, the x coordinate of event (or NaN)
  • Y: int, the y coordinate of event (or NaN)
  • Note: str, additional notes, which may include penalty duration, assists on a goal, etc.
Parameters:
  • rawpbp – json, the raw json pbp
  • season – int, the season
  • game – int, the game
Returns:

pandas dataframe, the pbp in a nicer format

scrapenhl2.scrape.parse_pbp.save_parsed_pbp(pbp, season, game)

Saves the pandas dataframe containing pbp information to disk as an HDF5.

Parameters:
  • pbp – df, a pandas dataframe with the pbp of the game
  • season – int, the season
  • game – int, the game
Returns:

nothing

Scrape TOI

This module contains methods for scraping TOI.

scrapenhl2.scrape.scrape_toi.get_game_raw_toi_filename(season, game)

Returns the filename of the raw toi folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/toi/[season]/[game].zlib

scrapenhl2.scrape.scrape_toi.get_home_shiftlog_filename(season, game)

Returns the filename of the parsed toi html home shifts

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, /scrape/data/raw/pbp/[season]/[game]H.html

scrapenhl2.scrape.scrape_toi.get_home_shiftlog_url(season, game)

Gets the url for a page containing shift information for specified game from HTML tables for home team.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TH020001.HTM

scrapenhl2.scrape.scrape_toi.get_raw_html_toi(season, game, homeroad)

Loads the html file containing this game’s toi from disk.

Parameters:
  • season – int, the season
  • game – int, the game
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

str, the html toi

scrapenhl2.scrape.scrape_toi.get_raw_toi(season, game)

Loads the compressed json file containing this game’s shifts from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict, the json shifts

scrapenhl2.scrape.scrape_toi.get_road_shiftlog_filename(season, game)

Returns the filename of the parsed toi html road shifts

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game]H.html

scrapenhl2.scrape.scrape_toi.get_road_shiftlog_url(season, game)

Gets the url for a page containing shift information for specified game from HTML tables for road team.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TV020001.HTM

scrapenhl2.scrape.scrape_toi.get_shift_url(season, game)

Gets the url for a page containing shift information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=[season]0[game]

scrapenhl2.scrape.scrape_toi.save_raw_toi(page, season, game)

Takes the bytes page containing shift information and saves to disk as a compressed zlib.

Parameters:
  • page – bytes. str(page) would yield a string version of the json shifts
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_toi.save_raw_toi_from_html(page, season, game, homeroad)

Takes the bytes page containing shift information and saves to disk as html.

Parameters:
  • page – bytes. str(page) would yield a string version of the json shifts
  • season – int, he season
  • game – int, the game
  • homeroad – str, ‘H’ or ‘R’
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_game_toi(season, game, force_overwrite=False)

This method scrapes the toi for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_game_toi_from_html(season, game, force_overwrite=True)

This method scrapes the toi html logs for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_season_toi(season, force_overwrite=False)

Scrapes and parses toi from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_toi_setup()

Creates raw toi folders if need be

Returns:

Parse TOI

This module contains methods for parsing TOI.

scrapenhl2.scrape.parse_toi.get_game_parsed_toi_filename(season, game)

Returns the filename of the parsed toi folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/parsed/toi/[season]/[game].zlib

scrapenhl2.scrape.parse_toi.get_melted_home_road_5v5_toi(season, game)

Reads parsed TOI for this game, filters for 5v5 TOI, and melts from wide to long on player

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

(home_df, road_df), each with columns Time, PlayerID, and Team (which will be H or R)

scrapenhl2.scrape.parse_toi.get_parsed_toi(season, game)

Loads the compressed json file containing this game’s shifts from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json shifts

scrapenhl2.scrape.parse_toi.parse_game_toi(season, game, force_overwrite=False)

Parses TOI from json for this game

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

nothing

scrapenhl2.scrape.parse_toi.parse_game_toi_from_html(season, game, force_overwrite=False)

Parses TOI from the html shift log from this game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

nothing

scrapenhl2.scrape.parse_toi.parse_season_toi(season, force_overwrite=False)

Parses toi from the given season. Final games covered only.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns:

scrapenhl2.scrape.parse_toi.parse_toi_setup()

Creates parsed toi folders if need be

Returns:
scrapenhl2.scrape.parse_toi.read_shifts_from_html_pages(rawtoi1, rawtoi2, teamid1, teamid2, season, game)

Aggregates information from two html pages given into a dataframe with one row per second and one col per player.

Parameters:
  • rawtoi1 – str, html page of shift log for team id1
  • rawtoi2 – str, html page of shift log for teamid2
  • teamid1 – int, team id corresponding to rawtoi1
  • teamid2 – int, team id corresponding to rawtoi1
  • season – int, the season
  • game – int, the game
Returns:

dataframe

scrapenhl2.scrape.parse_toi.read_shifts_from_page(rawtoi, season, game)

Turns JSON shift start-ends into TOI matrix with one row per second and one col per player

Parameters:
  • rawtoi – dict, json from NHL API
  • season – int, the season
  • game – int, the game
Returns:

dataframe

scrapenhl2.scrape.parse_toi.save_parsed_toi(toi, season, game)

Saves the pandas dataframe containing shift information to disk as an HDF5.

Parameters:
  • toi – df, a pandas dataframe with the shifts of the game
  • season – int, the season
  • game – int, the game
Returns:

nothing

Team information

This module contains methods related to the team info files.

scrapenhl2.scrape.team_info.add_team_to_info_file(teamid)

In case we come across teams that are not in the default list (1-110), use this method to add them to the file.

Parameters:teamid – int, the team ID
Returns:(tid, tabbrev, tname)
scrapenhl2.scrape.team_info.generate_team_ids_file(teamids=None)

Reads all team id URLs and stores information to disk. Has the following information:

  • ID: int
  • Abbreviation: str (three letters)
  • Name: str (full name)
Parameters:teamids – iterable of int. Tries to access team ids as listed in teamids. If not, goes from 1-110.
Returns:nothing
scrapenhl2.scrape.team_info.get_team_colordict()

Get the team color dictionary

Returns:a dictionary of IDs to tuples of hex colors
scrapenhl2.scrape.team_info.get_team_colors(team)

Returns primary and secondary color for this team.

Parameters:team – str or int, the team
Returns:tuple of hex colors
scrapenhl2.scrape.team_info.get_team_info_file()

Returns the team information dataframe from memory. This is stored as a feather file for fast read/write.

Returns:dataframe from /scrape/data/other/TEAM_INFO.feather
scrapenhl2.scrape.team_info.get_team_info_filename()

Returns the team information filename

Returns:str, /scrape/data/other/TEAM_INFO.feather
scrapenhl2.scrape.team_info.get_team_info_from_url(teamid)

Pulls ID, abbreviation, and name from the NHL API.

Parameters:teamid – int, the team ID
Returns:(id, abbrev, name)
scrapenhl2.scrape.team_info.get_team_info_url(teamid)

Gets the team url from the NHL API.

Parameters:teamid – int, the team ID
Returns:str, http://statsapi.web.nhl.com/api/v1/teams/[teamid]
scrapenhl2.scrape.team_info.team_as_id

A helper method. If team entered is int, returns that. If team is str, returns integer id of that team.

Parameters:team – int, or str
Returns:int, the team ID
scrapenhl2.scrape.team_info.team_as_str

A helper method. If team entered is str, returns that. If team is int, returns string name of that team.

Parameters:
  • team – int, or str
  • abbreviation – bool, whether to return 3-letter abbreviation or full name
Returns:

str, the team name

scrapenhl2.scrape.team_info.team_setup()

This method loads the team info df into memory

Returns:nothing
scrapenhl2.scrape.team_info.write_team_info_file(df)

Writes the team information file. This is stored as a feather file for fast read/write.

Parameters:df – the (team information) dataframe to write to file
Returns:nothing

Teams

This module contains method related to team logs.

scrapenhl2.scrape.teams.get_team_pbp(season, team)

Returns the pbp of given team in given season across all games.

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

df, the pbp of given team in given season

scrapenhl2.scrape.teams.get_team_pbp_filename(season, team)

Returns filename of the PBP log for this team and season

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

scrapenhl2.scrape.teams.get_team_toi(season, team)

Returns the toi of given team in given season across all games.

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

df, the toi of given team in given season

scrapenhl2.scrape.teams.get_team_toi_filename(season, team)

Returns filename of the TOI log for this team and season

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

scrapenhl2.scrape.teams.team_setup()

Creates team log-related folders.

Returns:nothing
scrapenhl2.scrape.teams.update_team_logs(season, force_overwrite=False, force_games=None)

This method looks at the schedule for the given season and writes pbp for scraped games to file. It also adds the strength at each pbp event to the log. It only includes games that have both PBP and TOI.

Parameters:
  • season – int, the season
  • force_overwrite – bool, whether to generate from scratch
  • force_games – None or iterable of games to force_overwrite specifically
Returns:

nothing

scrapenhl2.scrape.teams.write_team_pbp(pbp, season, team)

Writes the given pbp dataframe to file.

Parameters:
  • pbp – df, the pbp of given team in given season
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

nothing

scrapenhl2.scrape.teams.write_team_toi(toi, season, team)

Writes team TOI log to file

Parameters:
  • toi – df, team toi for this season
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns: