Welcome to scrapenhl2’s documentation!

Table of Contents

Scrape

The scrapenhl2.scrape module contains methods useful for scraping.

Useful examples

Updating data:

from scrapenhl2.scrape import autoupdate
autoupdate.autoupdate()

Get the season schedule:

from scrapenhl2.scrape import schedules
schedules.get_season_schedule(2017)

Convert between player ID and player name:

from scrapenhl2.scrape import players
pname = 'Alex Ovechkin'
players.player_as_id(pname)

pid = 8471214
players.player_as_str(pid)

There’s much more, and feel free to submit pull requests with whatever you find useful.

Methods

The functions in these modules are organized pretty logically under the module names.

Autoupdate

This module contains methods for automatically scraping and parsing games.

scrapenhl2.scrape.autoupdate.autoupdate(season=None)

Run this method to update local data. It reads the schedule file for given season and scrapes and parses previously unscraped games that have gone final or are in progress. Use this for 2010 or later.

Parameters:season – int, the season. If None (default), will do current season
Returns:nothing
scrapenhl2.scrape.autoupdate.delete_game_html(season, game)

Deletes html files. HTML files are used for live game charts, but deleted in favor of JSONs when games go final.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.autoupdate.read_final_games(games, season)
Parameters:
  • games
  • season
Returns:

scrapenhl2.scrape.autoupdate.read_inprogress_games(inprogressgames, season)

Saves these games to file via html (for toi) and json (for pbp)

Parameters:inprogressgames – list of int
Returns:
Events

This module contains methods related to PBP events.

scrapenhl2.scrape.events.convert_event(event)

Converts to a more convenient, standardized name (see get_event_dictionary)

Parameters:event – str, the event name
Returns:str, shortened event name
scrapenhl2.scrape.events.event_setup()

Loads event dictionary into memory

Returns:nothing
scrapenhl2.scrape.events.get_event_dictionary()

Returns the abbreviation: long name event mapping (in lowercase)

Returns:dict of str:str
scrapenhl2.scrape.events.get_event_longname

A method for translating event abbreviations to full names (for pbp matching)

Parameters:eventname – str, the event name
Returns:the non-abbreviated event name
Games

This module contains methods related to scraping games.

scrapenhl2.scrape.games.find_recent_games(team1, team2=None, limit=1)

A convenience function that lists the most recent in progress or final games for specified team(s)

Parameters:
  • team1 – str, a team
  • team2 – str, a team (optional)
  • limit – How many games to return
Returns:

df with relevant rows

scrapenhl2.scrape.games.get_player_5v5_log_filename(season)

Gets the filename for the season’s player log file. Includes 5v5 CF, CA, TOI, and more.

Parameters:season – int, the season
Returns:str, /scrape/data/other/[season]_player_log.feather
scrapenhl2.scrape.games.most_recent_game_id(team1, team2)

A convenience function to get the most recent game (this season) between two teams.

Parameters:
  • team1 – str, a team
  • team2 – str, a team
Returns:

int, a game number

General helpers

This module contains general helper methods. None of these methods have dependencies on other scrapenhl2 modules.

scrapenhl2.scrape.general_helpers.add_sim_scores(df, name)

Adds fuzzywuzzy’s token set similarity scores to provded dataframe

Parameters:
  • df – pandas dataframe with column Name
  • name – str, name to compare to
Returns:

df with an additional column SimScore

scrapenhl2.scrape.general_helpers.anti_join(df1, df2, **kwargs)

Anti-joins two dataframes.

Parameters:
  • df1 – dataframe
  • df2 – dataframe
  • kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’). Specifically, need join keys.
Returns:

dataframe

scrapenhl2.scrape.general_helpers.check_number(obj)

A helper method to check if obj is int, float, np.int64, etc. This is frequently needed, so is helpful.

Parameters:obj – the object to check the type
Returns:bool
scrapenhl2.scrape.general_helpers.check_number_last_first_format(name)

Checks if specified name looks like “8 Ovechkin, Alex”

Parameters:name – str
Returns:bool
scrapenhl2.scrape.general_helpers.check_types(obj)

A helper method to check if obj is int, float, np.int64, or str. This is frequently needed, so is helpful.

Parameters:obj – the object to check the type
Returns:bool
scrapenhl2.scrape.general_helpers.fill_join(df1, df2, **kwargs)

Uses data from df2 to fill in missing values from df1. Helpful when you have to join using multiple data sources. Preserves data order. Won’t work when joining introduces duplicates.

Parameters:
  • df1 – dataframe
  • df2 – dataframe
  • kwargs – keyword arguments as passed to pd.DataFrame.merge (except for ‘how’ and ‘suffixes’)
Returns:

dataframe

scrapenhl2.scrape.general_helpers.flip_first_last(name)

Changes Ovechkin, Alex to Alex Ovechkin. Also changes to title case.

Parameters:name – str
Returns:str, flipped if applicable
scrapenhl2.scrape.general_helpers.fuzzy_match_player(name_provided, names, minimum_similarity=50)

This method checks similarity between each entry in names and the name_provided using token set matching and returns the entry that matches best. Returns None if no similarity is greater than minimum_similarity. (See e.g. http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/)

Parameters:
  • name_provided – str, name to look for
  • names – list (or ndarray, or similar) of
  • minimum_similarity – int from 0 to 100, minimum similarity. If all are below this, returns None.
Returns:

str, string in names that best matches name_provided

scrapenhl2.scrape.general_helpers.get_initials(pname)

Splits name on spaces and returns first letter from each part.

Parameters:pname – str, player name
Returns:str, player initials
scrapenhl2.scrape.general_helpers.get_lastname(pname)

Splits name on first space and returns second part.

Parameters:pname – str, player name
Returns:str, player last name
scrapenhl2.scrape.general_helpers.infer_season_from_date

Looks at a date and infers the season based on that: Year-1 if month is Aug or before; returns year otherwise.

Parameters:date – str, YYYY-MM-DD
Returns:int, the season. 2007-08 would be 2007.
scrapenhl2.scrape.general_helpers.intervals(lst, interval_pct=10)

A method that divides list into intervals and returns tuples indicating each interval mark. Useful for giving updates when cycling through games.

Parameters:
  • lst – lst to divide
  • interval_pct – int, pct for each interval to represent. e.g. 10 means it will mark every 10%.
Returns:

a list of tuples of (index, value)

scrapenhl2.scrape.general_helpers.log_exceptions(fn)

A decorator that wraps the passed in function and logs exceptions should one occur

Parameters:function – the function
Returns:nothing
scrapenhl2.scrape.general_helpers.melt_helper(df, **kwargs)

Earlier versions of pandas do not support pd.DataFrame.melt. This helps to bridge the gap. It first tries df.melt, and if that doesn’t work, it uses pd.melt.

Parameters:
  • df – dataframe
  • kwargs – arguments to pd.melt or pd.DataFrame.melt.
Returns:

melted dataframe

scrapenhl2.scrape.general_helpers.mmss_to_secs(strtime)

Converts time from mm:ss to seconds

Parameters:strtime – str, mm:ss
Returns:int
scrapenhl2.scrape.general_helpers.once_per_second(fn, calls_per_second=1)

A decorator that sleeps for one second after executing the function. Used when scraping NHL site. This also means all functions that access the internet sleep for a second.

Parameters:fn – the function
Returns:nothing
scrapenhl2.scrape.general_helpers.period_contribution(x)

Turns period–1, 2, 3, OT, etc–into # of seconds elapsed in game until start. :param x: str or int, 1, 2, 3, etc :return: int, number of seconds elapsed until start of specified period

scrapenhl2.scrape.general_helpers.print_and_log(message, level='info', print_and_log=True)

A helper method that prints message to console and also writes to log with specified level.

Parameters:
  • message – str, the message
  • level – str, the level of log: info, warn, error, critical
  • print_and_log – bool. If False, logs only.
Returns:

nothing

scrapenhl2.scrape.general_helpers.remove_leading_number(string)

Will convert 8 Alex Ovechkin to Alex Ovechkin, or Alex Ovechkin to Alex Ovechkin

Parameters:string – a string
Returns:string without leading numbers
scrapenhl2.scrape.general_helpers.start_logging()

Clears out logging folder, and starts the log in this folder

scrapenhl2.scrape.general_helpers.try_to_access_dict(base_dct, *keys, **kwargs)

A helper method that accesses base_dct using keys, one-by-one. Returns None if a key does not exist.

Parameters:
  • base_dct – dict, a dictionary
  • keys – str, int, or other valid dict keys
  • kwargs – can specify default using kwarg default_return=0, for example.
Returns:

obj, base_dct[key1][key2][key3]… or None if a key is not in the dictionary

scrapenhl2.scrape.general_helpers.try_url_n_times(url, timeout=5, n=5)

A helper method that tries to access given url up to five times, returning the page.

Parameters:
  • url – str, the url to access
  • timeout – int, number of secs to wait before timeout. Default 5.
  • n – int, the max number of tries. Default 5.
Returns:

bytes

Organization

This module contains paths to folders.

scrapenhl2.scrape.organization.check_create_folder(*args)

A helper method to create a folder if it doesn’t exist already

Parameters:args – list of str, the parts of the filepath. These are joined together with the base directory
Returns:nothing
scrapenhl2.scrape.organization.get_base_dir()

Returns the base directory of this package (one directory up from this file)

Returns:str, the base directory
scrapenhl2.scrape.organization.get_other_data_folder()

Returns the folder containing other data

Returns:str, /scrape/data/other/
scrapenhl2.scrape.organization.get_parsed_data_folder()

Returns the folder containing parsed data

Returns:str, /scrape/data/parsed/
scrapenhl2.scrape.organization.get_raw_data_folder()

Returns the folder containing raw data

Returns:str, /scrape/data/raw/
scrapenhl2.scrape.organization.get_season_parsed_pbp_folder(season)

Returns the folder containing parsed pbp for given season

Parameters:season – int, current season
Returns:str, /scrape/data/parsed/pbp/[season]/
scrapenhl2.scrape.organization.get_season_parsed_toi_folder(season)

Returns the folder containing parsed toi for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/toi/[season]/
scrapenhl2.scrape.organization.get_season_raw_pbp_folder(season)

Returns the folder containing raw pbp for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/pbp/[season]/
scrapenhl2.scrape.organization.get_season_raw_toi_folder(season)

Returns the folder containing raw toi for given season

Parameters:season – int, current season
Returns:str, /scrape/data/raw/toi/[season]/
scrapenhl2.scrape.organization.get_season_team_pbp_folder(season)

Returns the folder containing team pbp logs for given season

Parameters:season – int, current season
Returns:str, /scrape/data/teams/pbp/[season]/
scrapenhl2.scrape.organization.get_season_team_toi_folder(season)

Returns the folder containing team toi logs for given season

Parameters:season – int, current season
Returns:str, /scrape/data/teams/toi/[season]/
scrapenhl2.scrape.organization.get_team_data_folder()

Returns the folder containing team log data

Returns:str, /scrape/data/teams/
scrapenhl2.scrape.organization.organization_setup()

Creates other folder if need be

Returns:nothing
Players

This module contains methods related to individual player info.

scrapenhl2.scrape.players.check_default_player_id(playername)

E.g. For Mike Green, I should automatically assume we mean 8471242 (WSH/DET), not 8468436. Returns None if not in dict. Ideally improve code so this isn’t needed.

Parameters:playername – str
Returns:int, or None
scrapenhl2.scrape.players.generate_player_ids_file()

Creates a dataframe with these columns:

  • ID: int, player ID
  • Name: str, player name
  • DOB: str, date of birth
  • Hand: char, R or L
  • Pos: char, one of C/R/L/D/G

It will be populated with Alex Ovechkin to start. :return: nothing

scrapenhl2.scrape.players.generate_player_log_file()

Run this when no player log file exists already. This is for getting the datatypes right. Adds Alex Ovechkin in Game 1 vs Pittsburgh in 2016-2017.

Returns:nothing
scrapenhl2.scrape.players.get_player_handedness

Retrieves handedness of player

Parameters:player – str or int, the player name or ID
Returns:str, player hand (L or R)
scrapenhl2.scrape.players.get_player_ids_file()

Returns the player information file. This is stored as a feather file for fast read/write.

Returns:/scrape/data/other/PLAYER_INFO.feather
scrapenhl2.scrape.players.get_player_info_from_url(playerid)

Gets ID, Name, Hand, Pos, DOB, Height, Weight, and Nationality from the NHL API.

Parameters:playerid – int, the player id
Returns:dict with player ID, name, handedness, position, etc
scrapenhl2.scrape.players.get_player_log_file()

Returns the player log file from memory.

Returns:dataframe, the log
scrapenhl2.scrape.players.get_player_log_filename()

Returns the player log filename.

Returns:str, /scrape/data/other/PLAYER_LOG.feather
scrapenhl2.scrape.players.get_player_position

Retrieves position of player

Parameters:player – str or int, the player name or ID
Returns:str, player position (e.g. C, D, R, L, G)
scrapenhl2.scrape.players.get_player_url(playerid)

Gets the url for a page containing information for specified player from NHL API.

Parameters:playerid – int, the player ID
Returns:str, https://statsapi.web.nhl.com/api/v1/people/[playerid]
scrapenhl2.scrape.players.player_as_id

A helper method. If player entered is int, returns that. If player is str, returns integer id of that player.

Parameters:
  • playername – int, or str, the player whose names you want to retrieve
  • filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work.
  • dob – yyyy-mm-dd, use to help when multiple players have the same name
Returns:

int, the player ID

scrapenhl2.scrape.players.player_as_str

A helper method. If player is int, returns string name of that player. Else returns standardized name.

Parameters:
  • playerid – int, or str, player whose name you want to retrieve
  • filterids – a tuple of players to choose from. Needs to be tuple else caching won’t work. Probably not needed but you can use this method to go from part of the name to full name, in which case it may be helpful.
Returns:

str, the player name

scrapenhl2.scrape.players.player_setup()

Loads team info file into memory.

Returns:nothing
scrapenhl2.scrape.players.playerlst_as_id(playerlst, exact=False, filterdf=None)

Similar to player_as_id, but less robust against errors, and works on a list of players.

Parameters:
  • players – a list of int, or str, players whose IDs you want to retrieve.
  • exact – bool. If True, looks for exact matches. If False, does not, using player_as_id (but will be slower)
  • filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns:

a list of int/float

scrapenhl2.scrape.players.playerlst_as_str(players, filterdf=None)

Similar to player_as_str, but less robust against errors, and works on a list of players

Parameters:
  • players – a list of int, or str, players whose names you want to retrieve
  • filterdf – df, a dataframe of players to choose from. Defaults to all.
Returns:

a list of str

scrapenhl2.scrape.players.rescrape_player(playerid)

If you notice that a player name, position, etc, is outdated, call this method on their ID. It will re-scrape their data from the NHL API.

Parameters:playerid – int, their ID. Also accepts str, their name.
Returns:nothing
scrapenhl2.scrape.players.update_player_ids_file(playerids, force_overwrite=False)

Adds these entries to player IDs file if need be.

Parameters:
  • playerids – a list of IDs
  • force_overwrite – bool. If True, will re-scrape data for all player ids. If False, only new ones.
Returns:

nothing

scrapenhl2.scrape.players.update_player_ids_from_page(pbp)

Reads the list of players listed in the game file and adds to the player IDs file if they are not there already.

Parameters:pbp – json, the raw pbp
Returns:nothing
scrapenhl2.scrape.players.update_player_log_file(playerids, seasons, games, teams, statuses)

Updates the player log file with given players. The player log file notes which players played in which games and whether they were scratched or played.

Parameters:
  • playerids – int or str or list of int
  • seasons – int, the season, or list of int the same length as playerids
  • games – int, the game, or list of int the same length as playerids
  • teams – str or int, the team, or list of int the same length as playerids
  • statuses – str, or list of str the same length as playerids
Returns:

nothing

scrapenhl2.scrape.players.update_player_logs_from_page(pbp, season, game)

Takes the game play by play and adds players to the master player log file, noting that they were on the roster for this game, which team they played for, and their status (P for played, S for scratch).

Parameters:
  • season – int, the season
  • game – int, the game
  • pbp – json, the pbp of the game
Returns:

nothing

scrapenhl2.scrape.players.write_player_ids_file(df)

Writes the given dataframe to disk as the player ids mapping.

Parameters:df – pandas dataframe, player ids file
Returns:nothing
scrapenhl2.scrape.players.write_player_log_file(df)

Writes the given dataframe to file as the player log filename

Parameters:df – pandas dataframe
Returns:nothing
Schedules

This module contains methods related to season schedules.

scrapenhl2.scrape.schedules.attach_game_dates_to_dateframe(df)

Takes dataframe with Season and Game columns and adds a Date column (for that game)

Parameters:df – dataframe
Returns:dataframe with one more column
scrapenhl2.scrape.schedules.generate_season_schedule_file(season, force_overwrite=True)

Reads season schedule from NHL API and writes to file.

The output contains the following columns:

  • Season: int, the season
  • Date: str, the dates
  • Game: int, the game id
  • Type: str, the game type (for preseason vs regular season, etc)
  • Status: str, e.g. Final
  • Road: int, the road team ID
  • RoadScore: int, number of road team goals
  • RoadCoach str, ‘N/A’ when this function is run (edited later with road coach name)
  • Home: int, the home team ID
  • HomeScore: int, number of home team goals
  • HomeCoach: str, ‘N/A’ when this function is run (edited later with home coach name)
  • Venue: str, the name of the arena
  • Result: str, ‘N/A’ when this function is run (edited accordingly later from PoV of home team: W, OTW, SOL, etc)
  • PBPStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
  • TOIStatus: str, ‘Not scraped’ when this function is run (edited accordingly later)
Parameters:
  • season – int, the season
  • force_overwrite – bool. If True, generates entire file from scratch. If False, only redoes when not Final previously.
Returns:

Nothing

scrapenhl2.scrape.schedules.get_current_season()

Returns the current season.

Returns:The current season variable (generated at import from _get_current_season)
scrapenhl2.scrape.schedules.get_game_data_from_schedule

This is a helper method that uses the schedule file to isolate information for current game (e.g. teams involved, coaches, venue, score, etc.)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict of game data

scrapenhl2.scrape.schedules.get_game_date(season, game)

Returns the date of this game

Parameters:
  • season – int, the game
  • game – int, the season
Returns:

str

scrapenhl2.scrape.schedules.get_game_result(season, game)

Returns the result of this game for home team (e.g. W, SOL)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_game_status(season, game)

Returns the status of this game (e.g. Final, In Progress)

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_home_score(season, game)

Returns the home score from this game

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_home_team(season, game, returntype='id')

Returns the home team from this game

Parameters:
  • season – int, the game
  • game – int, the season
  • returntype – str, ‘id’ or ‘name’
Returns:

float or str, depending on returntype

scrapenhl2.scrape.schedules.get_road_score(season, game)

Returns the road score from this game

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

int, the score

scrapenhl2.scrape.schedules.get_road_team(season, game, returntype='id')

Returns the road team from this game

Parameters:
  • season – int, the game
  • game – int, the season
  • returntype – str, ‘id’ or ‘name’
Returns:

float or str, depending on returntype

scrapenhl2.scrape.schedules.get_season_schedule(season)

Gets the the season’s schedule file from memory.

Parameters:season – int, the season
Returns:dataframe (originally from /scrape/data/other/[season]_schedule.feather)
scrapenhl2.scrape.schedules.get_season_schedule_filename(season)

Gets the filename for the season’s schedule file

Parameters:season – int, the season
Returns:str, /scrape/data/other/[season]_schedule.feather
scrapenhl2.scrape.schedules.get_season_schedule_url(season)

Gets the url for a page containing all of this season’s games (Sep 1 to Jun 26) from NHL API.

Parameters:season – int, the season
Returns:str, https://statsapi.web.nhl.com/api/v1/schedule?startDate=[season]-09-01&endDate=[season+1]-06-25
scrapenhl2.scrape.schedules.get_team_games(season=None, team=None, startdate=None, enddate=None)

Returns list of games played by team in season.

Just calls get_team_schedule with the provided arguments, returning the series of games from that dataframe.

Parameters:
  • season – int, the season
  • team – int or str, the team
  • startdate – str or None
  • enddate – str or None
Returns:

series of games

scrapenhl2.scrape.schedules.get_team_schedule(season=None, team=None, startdate=None, enddate=None)

Gets the schedule for given team in given season. Or if startdate and enddate are specified, searches between those dates. If season and startdate (and/or enddate) are specified, searches that season between those dates.

Parameters:
  • season – int, the season
  • team – int or str, the team
  • startdate – str, YYYY-MM-DD
  • enddate – str, YYYY-MM-DD
Returns:

dataframe

scrapenhl2.scrape.schedules.get_teams_in_season(season)

Returns all teams that have a game in the schedule for this season

Parameters:season – int, the season
Returns:set of team IDs
scrapenhl2.scrape.schedules.schedule_setup()

Reads current season and schedules into memory.

Returns:nothing
scrapenhl2.scrape.schedules.write_season_schedule(df, season, force_overwrite)

A helper method that writes the season schedule file to disk (in feather format for fast read/write)

Parameters:
  • df – the season schedule datafraome
  • season – the season
  • force_overwrite – bool. If True, overwrites entire file. If False, only redoes when not Final previously.
Returns:

Nothing

Manipulate schedules

This module contains methods related to generating and manipulating schedules.

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_coaches(pbp, season, game)

Uses the PbP to update coach info for this game.

Parameters:
  • pbp – json, the pbp for this game
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_pbp_scrape(season, game)

Updates the schedule file saying that specified game’s pbp has been scraped.

Parameters:
  • season – int, the season
  • game – int, the game, or list of ints
Returns:

updated schedule

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_result(season, game, result)

Updates the season schedule file with game result (which are listed ‘N/A’ at schedule generation)

Parameters:
  • season – int, the season
  • game – int, the game
  • result – str, the result from home team perspective
Returns:

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_result_using_pbp(pbp, season, game)

Uses the PbP to update results for this game.

Parameters:
  • pbp – json, the pbp for this game
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.manipulate_schedules.update_schedule_with_toi_scrape(season, game)

Updates the schedule file saying that specified game’s toi has been scraped.

Parameters:
  • season – int, the season
  • game – int, the game, or list of int
Returns:

nothing

Scrape play by play

This module contains methods for scraping pbp.

scrapenhl2.scrape.scrape_pbp.get_game_from_url(season, game)

Gets the page containing information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, the page at the url

scrapenhl2.scrape.scrape_pbp.get_game_pbplog_filename(season, game)

Returns the filename of the parsed pbp html game pbp

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game].html

scrapenhl2.scrape.scrape_pbp.get_game_pbplog_url(season, game)

Gets the url for a page containing pbp information for specified game from HTML tables.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/PL020001.HTM

scrapenhl2.scrape.scrape_pbp.get_game_raw_pbp_filename(season, game)

Returns the filename of the raw pbp folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game].zlib

scrapenhl2.scrape.scrape_pbp.get_game_url(season, game)

Gets the url for a page containing information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, https://statsapi.web.nhl.com/api/v1/game/[season]0[game]/feed/live

scrapenhl2.scrape.scrape_pbp.get_raw_html_pbp(season, game)

Loads the html file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, the html pbp

scrapenhl2.scrape.scrape_pbp.get_raw_pbp(season, game)

Loads the compressed json file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json pbp

scrapenhl2.scrape.scrape_pbp.save_raw_html_pbp(page, season, game)

Takes the bytes page containing html pbp information and saves as such

Parameters:
  • page – bytes
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_pbp.save_raw_pbp(page, season, game)

Takes the bytes page containing pbp information and saves to disk as a compressed zlib.

Parameters:
  • page – bytes. str(page) would yield a string version of the json pbp
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_pbp.scrape_game_pbp(season, game, force_overwrite=False)

This method scrapes the pbp for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

bool, False if not scraped, else True

scrapenhl2.scrape.scrape_pbp.scrape_game_pbp_from_html(season, game, force_overwrite=True)

This method scrapes the html pbp for the given game. Use for live games.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

bool, False if not scraped, else True

scrapenhl2.scrape.scrape_pbp.scrape_pbp_setup()

Creates raw pbp folders if need be

Returns:
scrapenhl2.scrape.scrape_pbp.scrape_season_pbp(season, force_overwrite=False)

Scrapes and parses pbp from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns:

nothing

Parse play by play

This module contains methods for parsing PBP.

scrapenhl2.scrape.parse_pbp.get_5v5_corsi_pm(season, game, cfca=None)

Returns a dataframe from home team perspective. Each row is a Corsi event, with time and note of whether it’s positive or negative for home team.

Parameters:
  • season – int, the season
  • game – int, the game
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
Returns:

dataframe with columns Time and HomeCorsi

scrapenhl2.scrape.parse_pbp.get_game_parsed_pbp_filename(season, game)

Returns the filename of the parsed pbp folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/parsed/pbp/[season]/[game].zlib

scrapenhl2.scrape.parse_pbp.get_parsed_pbp(season, game)

Loads the compressed json file containing this game’s play by play from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json pbp

scrapenhl2.scrape.parse_pbp.parse_game_pbp(season, game, force_overwrite=False)

Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

True if parsed, False if not

scrapenhl2.scrape.parse_pbp.parse_game_pbp_from_html(season, game, force_overwrite=False)

Reads the raw pbp from file, updates player IDs, updates player logs, and parses the JSON to a pandas DF and writes to file. Also updates team logs accordingly.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

True if parsed, False if not

scrapenhl2.scrape.parse_pbp.parse_pbp_setup()

Creates parsed pbp folders if need be

Returns:nothing
scrapenhl2.scrape.parse_pbp.parse_season_pbp(season, force_overwrite=False)

Parses pbp from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns:

nothing

scrapenhl2.scrape.parse_pbp.read_events_from_page(rawpbp, season, game)

This method takes the json pbp and returns a pandas dataframe with the following columns:

  • Index: int, index of event
  • Period: str, period of event. In regular season, could be 1, 2, 3, OT, or SO. In playoffs, 1, 2, 3, 4, 5…
  • MinSec: str, m:ss, time elapsed in period
  • Time: int, time elapsed in game
  • Event: str, the event name
  • Team: int, the team id. Note that this is switched to blocked team for blocked shots to ease Corsi calculations.
  • Actor: int, the acting player id. Switched with recipient for blocks (see above)
  • ActorRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with recipient for blocks (see above)
  • Recipient: int, the receiving player id. Switched with actor for blocks (see above)
  • RecipientRole: str, e.g. for faceoffs there is a “Winner” and “Loser”. Switched with actor for blocks (see above)
  • X: int, the x coordinate of event (or NaN)
  • Y: int, the y coordinate of event (or NaN)
  • Note: str, additional notes, which may include penalty duration, assists on a goal, etc.
Parameters:
  • rawpbp – json, the raw json pbp
  • season – int, the season
  • game – int, the game
Returns:

pandas dataframe, the pbp in a nicer format

scrapenhl2.scrape.parse_pbp.save_parsed_pbp(pbp, season, game)

Saves the pandas dataframe containing pbp information to disk as an HDF5.

Parameters:
  • pbp – df, a pandas dataframe with the pbp of the game
  • season – int, the season
  • game – int, the game
Returns:

nothing

Scrape TOI

This module contains methods for scraping TOI.

scrapenhl2.scrape.scrape_toi.get_game_raw_toi_filename(season, game)

Returns the filename of the raw toi folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/toi/[season]/[game].zlib

scrapenhl2.scrape.scrape_toi.get_home_shiftlog_filename(season, game)

Returns the filename of the parsed toi html home shifts

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

str, /scrape/data/raw/pbp/[season]/[game]H.html

scrapenhl2.scrape.scrape_toi.get_home_shiftlog_url(season, game)

Gets the url for a page containing shift information for specified game from HTML tables for home team.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TH020001.HTM

scrapenhl2.scrape.scrape_toi.get_raw_html_toi(season, game, homeroad)

Loads the html file containing this game’s toi from disk.

Parameters:
  • season – int, the season
  • game – int, the game
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

str, the html toi

scrapenhl2.scrape.scrape_toi.get_raw_toi(season, game)

Loads the compressed json file containing this game’s shifts from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict, the json shifts

scrapenhl2.scrape.scrape_toi.get_road_shiftlog_filename(season, game)

Returns the filename of the parsed toi html road shifts

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/raw/pbp/[season]/[game]H.html

scrapenhl2.scrape.scrape_toi.get_road_shiftlog_url(season, game)

Gets the url for a page containing shift information for specified game from HTML tables for road team.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, e.g. http://www.nhl.com/scores/htmlreports/20072008/TV020001.HTM

scrapenhl2.scrape.scrape_toi.get_shift_url(season, game)

Gets the url for a page containing shift information for specified game from NHL API.

Parameters:
  • season – int, the season
  • game – int, the game

:return : str, http://www.nhl.com/stats/rest/shiftcharts?cayenneExp=gameId=[season]0[game]

scrapenhl2.scrape.scrape_toi.save_raw_toi(page, season, game)

Takes the bytes page containing shift information and saves to disk as a compressed zlib.

Parameters:
  • page – bytes. str(page) would yield a string version of the json shifts
  • season – int, the season
  • game – int, the game
Returns:

nothing

scrapenhl2.scrape.scrape_toi.save_raw_toi_from_html(page, season, game, homeroad)

Takes the bytes page containing shift information and saves to disk as html.

Parameters:
  • page – bytes. str(page) would yield a string version of the json shifts
  • season – int, he season
  • game – int, the game
  • homeroad – str, ‘H’ or ‘R’
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_game_toi(season, game, force_overwrite=False)

This method scrapes the toi for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_game_toi_from_html(season, game, force_overwrite=True)

This method scrapes the toi html logs for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If file exists already, won’t scrape again
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_season_toi(season, force_overwrite=False)

Scrapes and parses toi from the given season.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, rescrapes all games. If false, only previously unscraped ones
Returns:

nothing

scrapenhl2.scrape.scrape_toi.scrape_toi_setup()

Creates raw toi folders if need be

Returns:
Parse TOI

This module contains methods for parsing TOI.

scrapenhl2.scrape.parse_toi.get_game_parsed_toi_filename(season, game)

Returns the filename of the parsed toi folder

Parameters:
  • season – int, current season
  • game – int, game
Returns:

str, /scrape/data/parsed/toi/[season]/[game].zlib

scrapenhl2.scrape.parse_toi.get_melted_home_road_5v5_toi(season, game)

Reads parsed TOI for this game, filters for 5v5 TOI, and melts from wide to long on player

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

(home_df, road_df), each with columns Time, PlayerID, and Team (which will be H or R)

scrapenhl2.scrape.parse_toi.get_parsed_toi(season, game)

Loads the compressed json file containing this game’s shifts from disk.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

json, the json shifts

scrapenhl2.scrape.parse_toi.parse_game_toi(season, game, force_overwrite=False)

Parses TOI from json for this game

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

nothing

scrapenhl2.scrape.parse_toi.parse_game_toi_from_html(season, game, force_overwrite=False)

Parses TOI from the html shift log from this game.

Parameters:
  • season – int, the season
  • game – int, the game
  • force_overwrite – bool. If True, will execute. If False, executes only if file does not exist yet.
Returns:

nothing

scrapenhl2.scrape.parse_toi.parse_season_toi(season, force_overwrite=False)

Parses toi from the given season. Final games covered only.

Parameters:
  • season – int, the season
  • force_overwrite – bool. If true, parses all games. If false, only previously unparsed ones
Returns:

scrapenhl2.scrape.parse_toi.parse_toi_setup()

Creates parsed toi folders if need be

Returns:
scrapenhl2.scrape.parse_toi.read_shifts_from_html_pages(rawtoi1, rawtoi2, teamid1, teamid2, season, game)

Aggregates information from two html pages given into a dataframe with one row per second and one col per player.

Parameters:
  • rawtoi1 – str, html page of shift log for team id1
  • rawtoi2 – str, html page of shift log for teamid2
  • teamid1 – int, team id corresponding to rawtoi1
  • teamid2 – int, team id corresponding to rawtoi1
  • season – int, the season
  • game – int, the game
Returns:

dataframe

scrapenhl2.scrape.parse_toi.read_shifts_from_page(rawtoi, season, game)

Turns JSON shift start-ends into TOI matrix with one row per second and one col per player

Parameters:
  • rawtoi – dict, json from NHL API
  • season – int, the season
  • game – int, the game
Returns:

dataframe

scrapenhl2.scrape.parse_toi.save_parsed_toi(toi, season, game)

Saves the pandas dataframe containing shift information to disk as an HDF5.

Parameters:
  • toi – df, a pandas dataframe with the shifts of the game
  • season – int, the season
  • game – int, the game
Returns:

nothing

Team information

This module contains methods related to the team info files.

scrapenhl2.scrape.team_info.add_team_to_info_file(teamid)

In case we come across teams that are not in the default list (1-110), use this method to add them to the file.

Parameters:teamid – int, the team ID
Returns:(tid, tabbrev, tname)
scrapenhl2.scrape.team_info.generate_team_ids_file(teamids=None)

Reads all team id URLs and stores information to disk. Has the following information:

  • ID: int
  • Abbreviation: str (three letters)
  • Name: str (full name)
Parameters:teamids – iterable of int. Tries to access team ids as listed in teamids. If not, goes from 1-110.
Returns:nothing
scrapenhl2.scrape.team_info.get_team_colordict()

Get the team color dictionary

Returns:a dictionary of IDs to tuples of hex colors
scrapenhl2.scrape.team_info.get_team_colors(team)

Returns primary and secondary color for this team.

Parameters:team – str or int, the team
Returns:tuple of hex colors
scrapenhl2.scrape.team_info.get_team_info_file()

Returns the team information dataframe from memory. This is stored as a feather file for fast read/write.

Returns:dataframe from /scrape/data/other/TEAM_INFO.feather
scrapenhl2.scrape.team_info.get_team_info_filename()

Returns the team information filename

Returns:str, /scrape/data/other/TEAM_INFO.feather
scrapenhl2.scrape.team_info.get_team_info_from_url(teamid)

Pulls ID, abbreviation, and name from the NHL API.

Parameters:teamid – int, the team ID
Returns:(id, abbrev, name)
scrapenhl2.scrape.team_info.get_team_info_url(teamid)

Gets the team url from the NHL API.

Parameters:teamid – int, the team ID
Returns:str, http://statsapi.web.nhl.com/api/v1/teams/[teamid]
scrapenhl2.scrape.team_info.team_as_id

A helper method. If team entered is int, returns that. If team is str, returns integer id of that team.

Parameters:team – int, or str
Returns:int, the team ID
scrapenhl2.scrape.team_info.team_as_str

A helper method. If team entered is str, returns that. If team is int, returns string name of that team.

Parameters:
  • team – int, or str
  • abbreviation – bool, whether to return 3-letter abbreviation or full name
Returns:

str, the team name

scrapenhl2.scrape.team_info.team_setup()

This method loads the team info df into memory

Returns:nothing
scrapenhl2.scrape.team_info.write_team_info_file(df)

Writes the team information file. This is stored as a feather file for fast read/write.

Parameters:df – the (team information) dataframe to write to file
Returns:nothing
Teams

This module contains method related to team logs.

scrapenhl2.scrape.teams.get_team_pbp(season, team)

Returns the pbp of given team in given season across all games.

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

df, the pbp of given team in given season

scrapenhl2.scrape.teams.get_team_pbp_filename(season, team)

Returns filename of the PBP log for this team and season

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

scrapenhl2.scrape.teams.get_team_toi(season, team)

Returns the toi of given team in given season across all games.

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

df, the toi of given team in given season

scrapenhl2.scrape.teams.get_team_toi_filename(season, team)

Returns filename of the TOI log for this team and season

Parameters:
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

scrapenhl2.scrape.teams.team_setup()

Creates team log-related folders.

Returns:nothing
scrapenhl2.scrape.teams.update_team_logs(season, force_overwrite=False, force_games=None)

This method looks at the schedule for the given season and writes pbp for scraped games to file. It also adds the strength at each pbp event to the log. It only includes games that have both PBP and TOI.

Parameters:
  • season – int, the season
  • force_overwrite – bool, whether to generate from scratch
  • force_games – None or iterable of games to force_overwrite specifically
Returns:

nothing

scrapenhl2.scrape.teams.write_team_pbp(pbp, season, team)

Writes the given pbp dataframe to file.

Parameters:
  • pbp – df, the pbp of given team in given season
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

nothing

scrapenhl2.scrape.teams.write_team_toi(toi, season, team)

Writes team TOI log to file

Parameters:
  • toi – df, team toi for this season
  • season – int, the season
  • team – int or str, the team abbreviation.
Returns:

Manipulate

The scrapenhl2.manipulate module contains methods useful for scraping.

Useful examples

Add on-ice players to a file:

from scrapenhl.manipulate import add_onice_players as onice
onice.add_players_to_file('/Users/muneebalam/Downloads/zone_entries.csv', 'WSH', time_format='elapsed')
# Will output zone_entries_on-ice.csv in Downloads, with WSH players and opp players on-ice listed.

See documentation below for more information and additional arguments to add_players_to_file.

Methods

General
scrapenhl2.manipulate.manipulate.add_score_adjustment_to_team_pbp(df)

Adds AdjFF and AdjFA

Parameters:df – dataframe
Returns:dataframe with extra columns
scrapenhl2.manipulate.manipulate.convert_to_all_combos(df, fillval=0, *args)

This method takes a dataframe and makes sure all possible combinations of given arguments are present. For example, if you want df to have all combos of P1 and P2, it will create a dataframe with all possible combos, left join existing dataframe onto that, and return that df. Uses fillval to fill all non-key columns.

Parameters:
  • df – the pandas dataframe
  • fillval – obj, the value with which to fill. Default fill is 0
  • args – str, column names, or tuples of combinations of column names
Returns:

df with all combos of columns specified

scrapenhl2.manipulate.manipulate.count_by_keys(df, *args)

A convenience method that isolates specified columns in the dataframe and gets counts. Drops when keys have NAs.

Parameters:
  • df – dataframe
  • args – str, column names in dataframe
Returns:

df, dataframe with each of args and an additional column, Count

scrapenhl2.manipulate.manipulate.filter_for_corsi(pbp)

Filters given dataframe for goal, shot, miss, and block events

Parameters:pbp – a dataframe with column Event
Returns:pbp, filtered for corsi events
scrapenhl2.manipulate.manipulate.filter_for_event_types(pbp, eventtype)

Filters given dataframe for event type(s) specified only.

Parameters:
  • pbp – dataframe. Need a column titled Event
  • eventtype – str or iterable of str, e.g. Goal, Shot, etc
Returns:

dataframe, filtered

scrapenhl2.manipulate.manipulate.filter_for_fenwick(pbp)

Filters given dataframe for SOG only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’ or Event == ‘Shot’
scrapenhl2.manipulate.manipulate.filter_for_five_on_five(df)

Filters given dataframe for 5v5 rows

Parameters:df – dataframe, columns HomeStrength + RoadStrength or TeamStrength + OppStrength
Returns:dataframe
scrapenhl2.manipulate.manipulate.filter_for_goals(pbp)

Filters given dataframe for goals only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’
scrapenhl2.manipulate.manipulate.filter_for_sog(pbp)

Filters given dataframe for SOG only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’ or Event == ‘Shot’
scrapenhl2.manipulate.manipulate.filter_for_team(pbp, team)

Filters dataframe for rows where Team == team

Parameters:
  • pbp – dataframe. Needs to have column Team
  • team – int or str, team ID or name
Returns:

dataframe with rows filtered

scrapenhl2.manipulate.manipulate.generate_5v5_player_log(season)

Takes the play by play and adds player 5v5 info to the master player log file, noting TOI, CF, etc. This takes awhile because it has to calculate TOICOMP. :param season: int, the season :return: nothing

scrapenhl2.manipulate.manipulate.generate_player_toion_toioff(season)

Generates TOION and TOIOFF at 5v5 for each player in this season. :param season: int, the season :return: df with columns Player, TOION, TOIOFF, and TOI60.

scrapenhl2.manipulate.manipulate.generate_toicomp(season)

Generates toicomp at a player-game level :param season: int, the season :return: df,

scrapenhl2.manipulate.manipulate.get_5v5_player_game_cfca(season, team)

Gets CFON, CAON, CFOFF, and CAOFF by game for given team in given season.

Parameters:
  • season – int, the season
  • team – int, team id
Returns:

df with game, player, CFON, CAON, CFOFF, and CAOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_gfga(season, team)

Gets GFON, GAON, GFOFF, and GAOFF by game for given team in given season.

Parameters:
  • season – int, the season
  • team – int, team id
Returns:

df with game, player, GFON, GAON, GFOFF, and GAOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_shift_startend(season, team)

Generates shift starts and ends for shifts that start and end at 5v5–OZ, DZ, NZ, OtF.

Parameters:
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe with shift starts and ends

scrapenhl2.manipulate.manipulate.get_5v5_player_game_toi(season, team)

Gets TOION and TOIOFF by game and player for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player, TOION, and TOIOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_toicomp(season, team)

Calculates data for QoT and QoC at a player-game level for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player,

scrapenhl2.manipulate.manipulate.get_5v5_player_log(season, force_create=False)
Parameters:
  • season – int, the season
  • force_create – bool, create from scratch even if it exists?
Returns:

scrapenhl2.manipulate.manipulate.get_5v5_player_log_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.get_5v5_player_season_toi(season, team)

Gets TOION and TOIOFF by player for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player, TOION, and TOIOFF

scrapenhl2.manipulate.manipulate.get_directions_for_xy_for_game(season, game)

It doesn’t seem like there are rules for whether positive X in XY event locations corresponds to offensive zone events, for example. Best way is to use fields in the the json.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict indicating which direction home team is attacking by period

scrapenhl2.manipulate.manipulate.get_directions_for_xy_for_season(season, team)

Gets directions for team specified using get_directions_for_xy_for_game

Parameters:
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe

scrapenhl2.manipulate.manipulate.get_game_h2h_corsi(season, games, cfca=None)

This method gets H2H Corsi at 5v5 for the given game(s).

Parameters:
  • season – int, the season
  • games – int, the game, or list of int, the games
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
Returns:

a df with [P1, P1Team, P2, P2Team, CF, CA, C+/-]. Entries will be duplicated, as with get_game_h2h_toi.

scrapenhl2.manipulate.manipulate.get_game_h2h_toi(season, games)

This method gets H2H TOI at 5v5 for the given game.

Parameters:
  • season – int, the season
  • games – int, the game, or list of int, the games
Returns:

a df with [P1, P1Team, P2, P2Team, TOI]. Entries will be duplicated (one with given P as P1, another as P2)

scrapenhl2.manipulate.manipulate.get_line_combos(season, game, homeroad='H')

Returns a df listing the 5v5 line combinations used in this game for specified team, and time they each played together

Parameters:
  • season – int, the game
  • game – int, the season
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas dataframe with columns P1, P2, P3, Secs. May contain duplicates

scrapenhl2.manipulate.manipulate.get_micah_score_adjustment()

See http://hockeyviz.com/txt/senstats

Returns:dataframe: HomeScoreDiff, HomeFFWeight, and HomeFAWeight
scrapenhl2.manipulate.manipulate.get_pairings(season, game, homeroad='H')

Returns a df listing the 5v5 pairs used in this game for specified team, and time they each played together

Parameters:
  • season – int, the game
  • game – int, the season
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas dataframe with columns P1, P2, Secs. May contain duplicates

scrapenhl2.manipulate.manipulate.get_pbp_events(*args, **kwargs)

A general method that yields a generator of dataframes of PBP events subject to given limitations.

Keyword arguments are applied as “or” conditions for each individual keyword (e.g. multiple teams) but as “and” conditions otherwise.

The non-keyword arguments are event types subject to “or” conditions:

  • ‘fac’ or ‘faceoff’
  • ‘shot’ or ‘sog’ or ‘save’
  • ‘hit’
  • ‘stop’ or ‘stoppage’
  • ‘block’ or ‘blocked shot’
  • ‘miss’ or ‘missed shot’
  • ‘give’ or ‘giveaway’
  • ‘take’ or ‘takeaway’
  • ‘penl’ or ‘penalty’
  • ‘goal’
  • ‘period end’
  • ‘period official’
  • ‘period ready’
  • ‘period start’
  • ‘game scheduled’
  • ‘gend’ or ‘game end’
  • ‘shootout complete’
  • ‘chal’ or ‘official challenge’
  • ‘post’, which is not an officially designated event but will be searched for

Dataframes are returned season-by-season to save on memory. If you want to operate on all seasons, process this data before going to the next season.

Defaults to return all regular-season and playoff events by all teams.

Supported keyword arguments:

  • add_on_ice: bool. If True, adds on-ice players for each time.
  • players_on_ice: str or int, or list of them, player IDs or names of players on ice for event.
  • players_on_ice_for: like players_on_ice, but players must be on ice for team that “did” event.
  • players_on_ice_ag: like players_on_ice, but players must be on ice for opponent of team that “did” event.
  • team, str or int, or list of them. Teams to filter for.
  • team_for, str or int, or list of them. Team that committed event.
  • team_ag, str or int, or list of them. Team that “received” event.
  • home_team: str or int, or list of them. Home team.
  • road_team: str or int, or list of them. Road team.
  • start_date: str or date, will only return data on or after this date. YYYY-MM-DD
  • end_date: str or date, will only return data on or before this date. YYYY-MM-DD
  • start_season: int, will only return events in or after this season. Defaults to 2010-11.
  • end_season: int, will only return events in or before this season. Defaults to current season.
  • season_type: int or list of int. 1 for preseason, 2 for regular, 3 for playoffs, 4 for ASG, 6 for Oly, 8 for WC.
    Defaults to 2 and 3.
  • start_game: int, start game. Applies only to start season. Game ID will be this, or greater.
  • end_game: int, end game. Applies only to end season. Game ID will be this, or smaller.
  • acting_player: str or int, or list of them, players who committed event (e.g. took a shot).
  • receiving_player: str or int, or list of them, players who received event (e.g. took a hit).
  • strength_hr: tuples or list of them, e.g. (5, 5) or ((5, 5), (4, 4), (3, 3)). This is (Home, Road).
    If neither strength_hr nor strength_to is specified, uses 5v5.
  • strength_to: tuples or list of them, e.g. (5, 5) or ((5, 5), (4, 4), (3, 3)). This is (Team, Opponent).
    If neither strength_hr nor strength_to is specified, uses 5v5.
  • score_diff: int or list of them, acceptable score differences (e.g. 0 for tied, (1, 2, 3) for up by 1-3 goals)
  • start_time: int, seconds elapsed in game. Events returned will be after this.
  • end_time: int, seconds elapsed in game. Events returned will be before this.
Parameters:
  • args – str, event types to search for (applied “OR”, not “AND”)
  • kwargs – keyword arguments specifying filters (applied “AND”, not “OR”)
Returns:

df, a pandas dataframe

scrapenhl2.manipulate.manipulate.get_player_positions()

Use to get player positions :return: df with colnames ID and position

scrapenhl2.manipulate.manipulate.get_player_toi(season, game, pos=None, homeroad='H')

Returns a df listing 5v5 ice time for each player for specified team.

Parameters:
  • season – int, the game
  • game – int, the season
  • pos – specify ‘L’, ‘C’, ‘R’, ‘D’ or None for all
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas df with columns Player, Secs

scrapenhl2.manipulate.manipulate.get_player_toion_toioff_file(season, force_create=False)
Parameters:
  • season – int, the season
  • force_create – bool, should this be read from file if possible, or created from scratch
Returns:

scrapenhl2.manipulate.manipulate.get_player_toion_toioff_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.get_toicomp_file(season, force_create=False)

If you want to rewrite the TOI60 file, too, then run get_player_toion_toioff_file with force_create=True before running this method. :param season: int, the season :param force_create: bool, should this be read from file if possible, or created from scratch :return:

scrapenhl2.manipulate.manipulate.get_toicomp_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.infer_zones_for_faceoffs(df, directions, xcol='X', ycol='Y', timecol='Time', focus_team=None, season=None, faceoffs=True)

Inferring zones for events from XY is hard–this method takes are of that by referencing against the JSON’s notes on which team went which direction in which period.

This method notes several different zones for faceoffs:

  • OL (offensive zone, left)
  • OR (offensive zone, right)
  • NOL (neutral zone, near offensive blueline, left)
  • NOR (neutral zone, near offensive blueline, right)
  • NDL (neutral zone, near defensive blueline, left)
  • NDR (neutral zone, near defensive blueline, right)
  • DL (defensive zone, left)
  • DR (defensive zone, right)
  • N (center ice)

This method can also handle non-faceoff events, using three zones

  • N (neutral)
  • O (offensive)
  • D (defensive)
Parameters:
  • df – dataframe with columns Game, specified xcol, and specified ycol
  • directions – dataframe with columns Game, Period, and Direction (‘left’ or ‘right’)
  • xcol – str, the column containing X coordinates in df
  • ycol – str, the column containing Y coordinates in df
  • timecol – str, the column containing the time in seconds.
  • focus_team – int, str, or None. Directions are stored with home perspective. So specify focus team and will flip when focus_team is on the road. If None, does not do the extra home/road flip. Necessitates Season column in df.
  • season – int, the season
  • faceoffs – bool. If True will use the nine zones above. If False, only the three.
Returns:

dataframe with extra column EventLoc

scrapenhl2.manipulate.manipulate.merge_onto_all_team_games_and_zero_fill(df, season, team)

A method that gets all team games from this season and left joins df onto it on game, then zero fills NAs. Makes sure you didn’t miss any games and get NAs later.

Parameters:
  • df – dataframe with columns Game and PlayerID or Player
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe

scrapenhl2.manipulate.manipulate.player_columns_to_name(df, columns=None)

Takes a dataframe and transforms specified columns of player IDs into names. If no columns provided, searches for defaults: H1, H2, H3, H4, H5, H6, HG (and same seven with R)

Parameters:
  • df – A dataframe
  • columns – a list of strings, or None
Returns:

df, dataframe with same column names, but columns now names instead of IDs

scrapenhl2.manipulate.manipulate.save_5v5_player_log(df, season)
Parameters:season – int, the season
Returns:nothing
scrapenhl2.manipulate.manipulate.save_player_toion_toioff_file(df, season)
Parameters:
  • df
  • season – int, the season
Returns:

scrapenhl2.manipulate.manipulate.save_toicomp_file(df, season)
Parameters:
  • df
  • season – int, the season
Returns:

scrapenhl2.manipulate.manipulate.team_5v5_score_state_summary_by_game(season)

Uses the team TOI log to group by team and game and score state for this season. 5v5 only.

Parameters:season – int, the season
Returns:dataframe, grouped by team, strength, and game
scrapenhl2.manipulate.manipulate.team_5v5_shot_rates_by_score(season)

Uses the team TOI and PBP logs to group by team and game and score state for this season. 5v5 only.

Parameters:season – int, the season
Returns:dataframe, grouped by team, strength, and game. Also columns for TOI, CF, and CA
scrapenhl2.manipulate.manipulate.time_to_mss(sectime)

Converts a number of seconds to m:ss format

Parameters:sectime – int, a number of seconds
Returns:str, sectime in m:ss
Add on-ice players

Add on-ice players to a file by specifying filename and columns from which to infer time elapsed in game.

scrapenhl2.manipulate.add_onice_players.add_onice_players_to_df(df, focus_team, season, gamecol, player_output='ids')

Uses the _Secs column in df, the season, and the gamecol to join onto on-ice players.

Parameters:
  • df – dataframe
  • focus_team – str or int, team to focus on. Its players will be listed in first in sheet.
  • season – int, the season
  • gamecol – str, the column with game IDs
  • player_output – str, use ‘names’ or ‘nums’ or ‘ids’. Currently ‘nums’ is not supported.
Returns:

dataframe with team and opponent players

scrapenhl2.manipulate.add_onice_players.add_players_to_file(filename, focus_team, season=None, gamecol='Game', periodcol='Period', timecol='Time', time_format='elapsed', update_data=False, player_output='names')

Adds names of on-ice players to the end of each line, and writes to file in the same folder as input file. Specifically, adds 1 second to the time in the spreadsheet and adds players who were on the ice at that time.

You cannot necessarily trust results when times coincide with stoppages–and it’s worth checking faceoffs as well.

Parameters:
  • filename – str, the file to read. Will save output as this filename but ending in “on-ice.csv”
  • focus_team – str or int, e.g. ‘WSH’ or ‘WPG’
  • season – int. For 2007-08, use 2007. Defaults to current season.
  • gamecol – str. The column holding game IDs (e.g. 20001). By default, looks for column called “Game”
  • periodcol – str. The column holding period number/name (1, 2, 3, 4 or OT, etc). By default: “Period”
  • timecol – str. The column holding time in period in M:SS format.
  • time_format – str, how to interpret timecol. Use ‘elapsed’ or ‘remaining’. E.g. the start of a period is 0:00 with elapsed and 20:00 in remaining.
  • update_data – bool. If True, will autoupdate() data for given season. If not, will not update game data. Use when file includes data from games not already scraped.
  • player_output – str, use ‘names’ or ‘nums’. Currently only supports ‘names’
Returns:

nothing

scrapenhl2.manipulate.add_onice_players.add_times_to_file(df, periodcol, timecol, time_format)

Uses specified periodcol, timecol, and time_format col to calculate _Secs, time elapsed in game.

Parameters:
  • df – dataframe
  • periodcol – str, the column that holds period name/number (1, 2, 3, 4 or OT, etc)
  • timecol – str, the column that holds time in m:ss format
  • time_format – use ‘elapsed’ (preferred) or ‘remaining’. This refers to timecol: e.g. 120 secs elapsed in the 2nd period might be listed as 2:00 in timecol, or as 18:00.
Returns:

dataframe with extra column _Secs, time elapsed in game.

TOI and Corsi for combinations of players

This module contains methods for generating H2H data for games

scrapenhl2.manipulate.combos.get_game_combo_corsi(season, game, player_n=2, cfca=None, *hrcodes)

This method gets H2H Corsi at 5v5 for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • player_n – int. E.g. 1 gives you a list of players and TOI, 2 gives you h2h, 3 gives you groups of 3, etc.
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
  • hrcodes – to limit exploding joins, specify strings containing ‘H’ and ‘R’ and ‘A’, each of length player_n For example, if player_n=3, specify ‘HHH’ to only get home team player combos. If this is left unspecified, will do all combos, which can be problematic when player_n > 3. ‘R’ for road, ‘H’ for home, ‘A’ for all (both)
Returns:

a df with [P1, P1Team, P2, P2Team, TOI, etc]. Entries will be duplicated.

scrapenhl2.manipulate.combos.get_game_combo_toi(season, game, player_n=2, *hrcodes)

This method gets H2H TOI at 5v5 for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • player_n – int. E.g. 1 gives you a list of players and TOI, 2 gives you h2h, 3 gives you groups of 3, etc.
  • hrcodes – to limit exploding joins, specify strings containing ‘H’ and ‘R’ and ‘A’, each of length player_n For example, if player_n=3, specify ‘HHH’ to only get home team player combos. If this is left unspecified, will do all combos, which can be problematic when player_n > 3. ‘R’ for road, ‘H’ for home, ‘A’ for all (both)
Returns:

a df with [P1, P1Team, P2, P2Team, TOI, etc]. Entries will be duplicated.

scrapenhl2.manipulate.combos.get_team_combo_corsi(season, team, games, n_players=2)

Gets combo Corsi for team for specified games

Parameters:
  • season – int, the season
  • team – int or str, team
  • games – int or iterable of int, games
  • n_players – int. E.g. 1 gives you player TOI, 2 gives you 2-player group TOI, 3 makes 3-player groups, etc
Returns:

dataframe

scrapenhl2.manipulate.combos.get_team_combo_toi(season, team, games, n_players=2)

Gets 5v5 combo TOI for team for specified games

Parameters:
  • season – int, the season
  • team – int or str, team
  • games – int or iterable of int, games
  • n_players – int. E.g. 1 gives you player TOI, 2 gives you 2-player group TOI, 3 makes 3-player groups, etc
Returns:

dataframe

Plot

The scrapenhl2.plot module contains methods useful for plotting.

Useful examples

First, import:

from scrapenhl2.plot import *

Get the H2H for an in-progress game:

live_h2h('WSH', 'EDM')
_images/example_h2h.png

Get the Corsi timeline as well, but don’t update data this time:

live_timeline('WSH', 'EDM', update=False)
_images/example_timeline.png

Save the timeline of a memorable game to file:

game_timeline(2016, 30136, save_file='/Users/muneebalam/Desktop/WSH_TOR_G6.png')

More methods being added regularly.

App

This package contains a lightweight app for browsing charts and doing some data manipulations.

Launch using:

import scrapenhl2.plot.app as app
app.browse_game_charts()
# app.browse_player_charts()
# app.browse_team_charts()

It will print a link in your terminal–follow it. The page looks something like this:

_images/game_page_screenshot.png

The dropdowns also allow you to search–just start typing.

Methods (games)

Game H2H
_images/WSH-TOR_G6.png

This module contains methods for creating a game H2H chart.

scrapenhl2.plot.game_h2h.game_h2h(season, game, save_file=None)

Creates the grid H2H charts seen on @muneebalamcu

Parameters:
  • season – int, the season
  • game – int, the game
  • save_file – str, specify a valid filepath to save to file. If None, merely shows on screen.
Returns:

nothing

scrapenhl2.plot.game_h2h.live_h2h(team1, team2, update=True, save_file=None)

A convenience method that updates data then displays h2h for most recent game between specified tams.

Parameters:
  • team1 – str or int, team
  • team2 – str or int, other team
  • update – bool, should data be updated first?
  • save_file – str, specify a valid filepath to save to file. If None, merely shows on screen.
Returns:

nothing

Corsi timeline
_images/WSH-TOR_G6_timeline.png

This module has methods for creating a game corsi timeline.

scrapenhl2.plot.game_timeline.game_timeline(season, game, save_file=None)

Creates a shot attempt timeline as seen on @muneebalamcu

Parameters:
  • season – int, the season
  • game – int, the game
  • save_file – str, specify a valid filepath to save to file. If None, merely shows on screen. Specify ‘fig’ to return the figure
Returns:

nothing, or the figure

scrapenhl2.plot.game_timeline.get_goals_for_timeline(season, game, homeroad, granularity='sec')

Returns a list of goal times

Parameters:
  • season – int, the season
  • game – int, the game
  • homeroad – str, ‘H’ for home and ‘R’ for road
  • granularity – can respond in minutes (‘min’), or seconds (‘sec’), elapsed in game
Returns:

a list of int, seconds elapsed

scrapenhl2.plot.game_timeline.live_timeline(team1, team2, update=True, save_file=None)

A convenience method that updates data then displays timeline for most recent game between specified tams.

Parameters:
  • team1 – str or int, team
  • team2 – str or int, other team
  • update – bool, should data be updated first?
  • save_file – str, specify a valid filepath to save to file. If None, merely shows on screen.
Returns:

nothing

Methods (teams)

Team TOI by score
_images/Score_states_2015.png

This module contains methods for making a stacked bar graph indicating how much TOI each team spends in score states.

scrapenhl2.plot.team_score_state_toi.get_score_state_graph_title(season)
Parameters:season – int, the season
Returns:
scrapenhl2.plot.team_score_state_toi.score_state_graph(season)

Generates a horizontal stacked bar graph showing how much 5v5 TOI each team has played in each score state for given season.

Parameters:season – int, the season
Returns:
Team lineup CF%
_images/Caps_lineup_cf.png

This module contains methods to generate a graph showing player CF%. 18 little graphs, 1 for each of 18 players.

scrapenhl2.plot.team_lineup_cf.team_lineup_cf_graph(team, **kwargs)

This method builds a 4x5 matrix of rolling CF% line graphs. The left 4x3 matrix are forward lines and the top-right 3x2 are defense pairs.

Parameters:
  • team – str or id, team to build this graph for
  • kwargs – need to specify the following as iterables of names: l1, l2, l3, l4, p1, p2, p3. Three players for each of the ‘l’s and two for each of the ‘p’s.
Returns:

figure, or nothing

Team shot rates by score
_images/Caps_shot_score_parallel.png _images/Caps_shot_rates_score_scatter.png

This module creates a scatterplot for specified team with shot attempt rates versus league median from down 3 to up 3.

scrapenhl2.plot.team_score_shot_rate.team_score_shot_rate_parallel(team, startseason, endseason=None, save_file=None)
Parameters:
  • team
  • startseason
  • endseason
  • save_file
Returns:

scrapenhl2.plot.team_score_shot_rate.team_score_shot_rate_scatter(team, startseason, endseason=None, save_file=None)
Parameters:
  • team – str or int, team
  • startseason – int, the starting season (inclusive)
  • endseason – int, the ending season (inclusive)
Returns:

nothing

Methods (individuals)

Player rolling CF and GF
_images/Ovechkin_rolling_cf.png

This module creates rolling CF% and GF% charts

scrapenhl2.plot.rolling_cf_gf.rolling_player_cf(player, **kwargs)

Creates a graph with CF% and CF% off. Defaults to roll_len of 25.

Parameters:
  • player – str or int, player to generate for
  • kwargs – other filters. See scrapenhl2.plot.visualization_helper.get_and_filter_5v5_log for more information.
Returns:

nothing, or figure

scrapenhl2.plot.rolling_cf_gf.rolling_player_gf(player, **kwargs)

Creates a graph with GF% and GF% off. Defaults to roll_len of 40.

Parameters:
  • player – str or int, player to generate for
  • kwargs – other filters. See scrapenhl2.plot.visualization_helper.get_and_filter_5v5_log for more information.
Returns:

nothing, or figure

Player rolling boxcars
_images/Oshie_boxcars.png

This module contains methods for creating the rolling boxcars stacked area graph.

scrapenhl2.plot.rolling_boxcars.calculate_boxcar_rates(df)

Takes the given dataframe and makes the following calculations:

  • Divides col ending in GFON, iA2, iA1, and iG by one ending in TOI
  • Adds iG to iA1, calls result iP1
  • Adds iG and iA1 to iA2, calls result iP
  • Adds /60 to ends of iG, iA1, iP1, iA2, iP, and GFON
Parameters:df – dataframe
Returns:dataframe with columns changed as specified, and only those mentioned above selected.
scrapenhl2.plot.rolling_boxcars.rolling_player_boxcars(player, **kwargs)

A method to generate the rolling boxcars graph.

Parameters:
  • player – str or int, player to generate for
  • kwargs – other filters. See scrapenhl2.plot.visualization_helper.get_and_filter_5v5_log for more information.
Returns:

nothing, or figure

Methods (individual comparisons)

Team D-pair shot rates
_images/Caps_d_pairs.png

This module contains methods for creating a scatterplot of team defense pair shot rates.

scrapenhl2.plot.defense_pairs.drop_duplicate_pairs(rates)

The shot rates dataframe has duplicates–e.g. in one row Orlov is PlayerID1 and Niskanen PlayerID2, but in another Niskanen is PlayerID1 and Orlov is playerID2. This method will select only one, using the following rules:

  • For mixed-hand pairs, pick the one where P1 is the lefty and P2 is the righty
  • For other pairs, arrange by PlayerID. The one with the smaller ID is P1 and the larger, P2.
Parameters:rates – dataframe as created by get_dpair_shot_rates
Returns:dataframe, rates with half of rows dropped
scrapenhl2.plot.defense_pairs.get_dpair_shot_rates(team, startdate, enddate)

Gets CF/60 and CA/60 by defenseman duo (5v5 only) for this team between given range of dates

Parameters:
  • team – int or str, team
  • startdate – str, start date
  • enddate – str, end date (inclusive)
Returns:

dataframe with PlayerID1, PlayerID2, CF, CA, TOI (in secs), CF/60 and CA/60

scrapenhl2.plot.defense_pairs.team_dpair_shot_rates_scatter(team, min_pair_toi=50, **kwargs)

Creates a scatterplot of team defense pair shot attempr rates.

Parameters:
  • team – int or str, team
  • min_pair_toi – int, number of minutes for pair to qualify
  • kwargs – Use season- or date-range-related kwargs only.
Returns:

Usage

This module creates static and animated usage charts.

scrapenhl2.plot.usage.animated_usage_chart(**kwargs)
Parameters:kwargs
Returns:
scrapenhl2.plot.usage.parallel_coords_team_comparison(**kwargs)
Parameters:kwargs
Returns:nothing, or figure
scrapenhl2.plot.usage.parallel_usage_chart(**kwargs)
Parameters:kwargs – Defaults to take last month of games for all teams.
Returns:nothing, or figure

Helper methods

This method contains utilities for visualization.

scrapenhl2.plot.visualization_helper.add_cfpct_ref_lines_to_plot(ax, refs=None)

Adds reference lines to specified axes. For example, it could add 50%, 55%, and 45% CF% lines.

50% has the largest width and is solid. 40%, 60%, etc will be dashed with medium width. Other numbers will be dotted and have the lowest width.

Also adds little labels in center of pictured range.

Parameters:
  • ax – axes. CF should be on the X axis and CA on the Y axis.
  • refs – None, or a list of percentages (e.g. [45, 50, 55]). Defaults to every 5% from 35% to 65%
Returns:

nothing

scrapenhl2.plot.visualization_helper.add_good_bad_fast_slow(margin=0.05, bottomleft='Slower', bottomright='Better', topleft='Worse', topright='Faster')

Adds better, worse, faster, slower, to current matplotlib plot. CF60 should be on the x-axis and CA60 on the y-axis. Also expands figure limits by margin (default 5%). That means you should use this before using, say, add_cfpct_ref_lines_to_plot.

Parameters:
  • margin – expand figure limits by margin. Defaults to 5%.
  • bottomleft – label to put in bottom left corner
  • bottomright – label to put in bottom right corner
  • topleft – label to put in top left corner
  • topright – label to put in top right corner
Returns:

nothing

scrapenhl2.plot.visualization_helper.filter_5v5_for_player(df, **kwargs)

This method filters the given dataframe for given player(s), if specified

Parameters:
  • df – dataframe
  • kwargs – relevant one is player
Returns:

dataframe, filtered for specified players

scrapenhl2.plot.visualization_helper.filter_5v5_for_team(df, **kwargs)

This method filters the given dataframe for given team(s), if specified

Parameters:
  • df – dataframe
  • kwargs – relevant one is team
Returns:

dataframe, filtered for specified players

scrapenhl2.plot.visualization_helper.filter_5v5_for_toi(df, **kwargs)

This method filters the given dataframe for minimum or max TOI or TOI60.

This method groups at the player level. So if a player hits the minimum total but not for one or more teams they played for over the the relevant time period, they will be included.

Parameters:
  • df – dataframe
  • kwargs – relevant ones are min_toi, max_toi, min_toi60, and max_toi60
Returns:

dataframe, filtered for specified players

scrapenhl2.plot.visualization_helper.format_number_with_plus(stringnum)

Converts 0 to 0, -1 to -1, and 1 to +1 (for presentation purposes).

Parameters:stringnum – int
Returns:str, transformed as specified above.
scrapenhl2.plot.visualization_helper.generic_5v5_log_graph_title(figtype, **kwargs)

Generates a figure title incorporating parameters from kwargs:

[Fig type] for [player, or multiple players, or team] [date range] [rolling window, if applicable] [TOI range, if applicable] [TOI60 range, if applicable]

Methods for individual graphs can take this list and arrange as necessary.

Parameters:
  • figtype – str brief description, e.g. Rolling CF% or Lineup CF%
  • kwargs – See get_and_filter_5v5_log
Returns:

list of strings

scrapenhl2.plot.visualization_helper.get_5v5_df_start_end(**kwargs)

This method retrieves the correct years of the 5v5 player log and concatenates them.

Parameters:kwargs – the relevant ones here are startseason and endseason
Returns:dataframe
scrapenhl2.plot.visualization_helper.get_and_filter_5v5_log(**kwargs)

This method retrieves the 5v5 log and filters for keyword arguments provided to the original method. For example, rolling_player_cf calls this method first.

Currently supported keyword arguments:

  • startseason: int, the season to start with. Defaults to current - 3.
  • startdate: str, yyyy-mm-dd. Defaults to Sep 15 of startseason
  • endseason: int, the season to end with (inclusive). Defaults to current
  • enddate: str, yyyy-mm-dd. Defaults to June 21 of endseason + 1
  • roll_len: int, calculates rolling sums over this variable.
  • roll_len_days: int, calculates rolling sum over this time window
  • player: int or str, player ID or name
  • players: list of int or str, player IDs or names
  • min_toi: float, minimum TOI for a player for inclusion in minutes.
  • max_toi: float, maximum TOI for a player for inclusion in minutes.
  • min_toi60: float, minimum TOI60 for a player for inclusion in minutes.
  • max_toi60: float, maximum TOI60 for a player for inclusion in minutes.
  • team: int or str, filter data for this team only
  • add_missing_games: bool. If True will add in missing rows for missing games. Must also specify team.

Developer’s note: when adding support for new kwargs, also add support in _generic_graph_title

Parameters:kwargs – e.g. startseason, endseason.
Returns:df, filtered
scrapenhl2.plot.visualization_helper.get_enddate_from_kwargs(**kwargs)

Returns 6/21 of endseason + 1, or enddate

scrapenhl2.plot.visualization_helper.get_line_slope_intercept(x1, y1, x2, y2)

Returns slope and intercept of lines defined by given coordinates

scrapenhl2.plot.visualization_helper.get_startdate_enddate_from_kwargs(**kwargs)

Returns startseason and endseason kwargs. Defaults to current - 3 and current

scrapenhl2.plot.visualization_helper.hex_to_rgb(value, maxval=256)

Return (red, green, blue) for the hex color given as #rrggbb.

scrapenhl2.plot.visualization_helper.insert_missing_team_games(df, **kwargs)
Parameters:
  • df – dataframe, 5v5 player log or part of it
  • kwargs – relevant ones are ‘team’ and ‘add_missing_games’
Returns:

dataframe with added rows

scrapenhl2.plot.visualization_helper.make_5v5_rolling_days(df, **kwargs)

Takes rolling sums based on roll_len_days kwarg. E.g. 30 for a ~monthly rolling sum.

Parameters:
  • df – dataframe
  • kwargs – the relevant one is roll_len_days, int
Returns:

dataframe with extra columns

scrapenhl2.plot.visualization_helper.make_5v5_rolling_gp(df, **kwargs)

Takes rolling sums of numeric columns and concatenates onto the dataframe. Will exclude season, game, player, and team.

Parameters:
  • df – dataframe
  • kwargs – the relevant one is roll_len
Returns:

dataframe with extra columns

scrapenhl2.plot.visualization_helper.make_color_darker(hex=None, rgb=None, returntype='hex')

Makes specified color darker. This is done by converting to rgb and multiplying by 50%.

Parameters:
  • hex – str. Specify either this or rgb.
  • rgb – 3-tuple of floats 0-255. Specify either this or hex
  • returntype – str, ‘hex’ or ‘rgb’
Returns:

a hex or rgb color, input color but darker

scrapenhl2.plot.visualization_helper.make_color_lighter(hex=None, rgb=None, returntype='hex')

Makes specified color lighter. This is done by converting to rgb getting closer to 255 by 50%.

Parameters:
  • hex – str. Specify either this or rgb.
  • rgb – 3-tuple of floats 0-255. Specify either this or hex
  • returntype – str, ‘hex’ or ‘rgb’
Returns:

a hex or rgb color, input color but lighter

scrapenhl2.plot.visualization_helper.parallel_coords(backgrounddf, foregrounddf, groupcol, legendcol=None, axis=None)
Parameters:
  • backgrounddf
  • foregrounddf
  • groupcol – For inline labels (e.g. initials)
  • legendcol – So you can provide another groupcol for legend (e.g. name)
  • axis
Returns:

scrapenhl2.plot.visualization_helper.parallel_coords_background(dataframe, groupcol, axis=None)
Parameters:
  • dataframe
  • groupcol
  • axis
  • zorder
  • alpha
  • color
  • label
Returns:

scrapenhl2.plot.visualization_helper.parallel_coords_foreground(dataframe, groupcol, axis=None)
Parameters:
  • dataframe
  • groupcol
  • axis
  • zorder
  • alpha
  • color
  • label
Returns:

scrapenhl2.plot.visualization_helper.parallel_coords_xy(dataframe, groupcol)
Parameters:
  • dataframe – data in wide format
  • groupcol – column to use as index (e.g. playername)
Returns:

column dictionary, dataframe in long format

scrapenhl2.plot.visualization_helper.rgb_to_hex(red, green, blue)

Return color as #rrggbb for the given RGB color values.

scrapenhl2.plot.visualization_helper.savefilehelper(**kwargs)

Saves current matplotlib figure, or saves to file, or displays

Parameters:kwargs – searches for ‘save_file’. If not found or None, displays figure. If ‘fig’, returns figure. If a filepath, saves.
Returns:nothing, or a figure

This module is from SO. It adds labels for lines on the lines themselves.

scrapenhl2.plot.label_lines.labelLine(line, x, label=None, align=True, **kwargs)

Labels line with line2D label data

scrapenhl2.plot.label_lines.labelLines(lines, align=True, xvals=None, **kwargs)

Labels lines in a line graph

Support

Feel free to contact me with questions or suggestions.

Github

Link.

Collaboration

I’m happy to partner with you in development efforts–just shoot me a message, or just get started by resolving a GitHub issue or suggested enhancement. Please also let me know if you’d like to alpha- or beta-test my code.

Donations

If you would like to support my work, please donate money to a charity of your choice. Many large charities do great work all around the world (e.g. Médecins Sans Frontières), but don’t forget that your support is often more critical for local/small charities. Also consider that small regular donations are sometimes better than one large donation.

You can vet a charity you’re targeting using a charity rating website.

If you do make a donation, make me happy and leave a record here.. (It’s anonymous.)

Introduction

scrapenhl2 is a python package for scraping and manipulating NHL data pulled from the NHL website.

Installation

You need python3 and the python scientific stack (e.g. numpy, matplotlib, pandas, etc). Easiest way is to simply use Anaconda. To be safe, make sure you have python 3.5+, matplotlib 2.0+, and pandas 0.20+.

Next, if you are on Windows, you need to get python-Levenshtein. You can find it here. Download the appropriate .whl file–connect your version of python with the “cp” you see and use the one with “amd64” if you have an AMD 64-bit processor–and navigate to your downloads folder in command line. For example:

cd
cd muneebalam
cd Downloads

Next, install the whl file using pip:

pip install [insert filename here].whl

(Sometimes, this errors out and says you need Visual Studio C++ tools. You can download and install the 2015 version from here.)

Now, all users can open up terminal or command line and enter:

pip install scrapenhl2

(If you have multiple versions of python installed, you may need to alter that command slightly.)

For now, installation should be pretty quick, but in the future it may take awhile (depending on how many past years’ files I make part of the package).

As far as coding environments go, I recommend jupyter notebook or Pycharm Community. Some folks also like the PyDev plugin in Eclipse. The latter two are full-scale applications, while the former launches in your browser. Open up terminal or command line and run:

jupyter notebook

Then navigate to your coding folder, start a new Python file, and you’re good to go.

Use

Note that because this is in pre-alpha/alpha, syntax and use may be buggy and subject to change.

On startup, when you have an internet connection and some games have gone final since you last used the package, open up your python environment and update:

from scrapenhl2.scrape import autoupdate
autoupdate.autoupdate()

Autoupdate should update you regularly on its progress; be patient.

To get a game H2H, use:

from scrapenhl2.plot import game_h2h
season = 2016
game = 30136
game_h2h.game_h2h(season, game)
_images/WSH-TOR_G6.png

To get a game timeline, use:

from scrapenhl2.plot import game_timeline
season = 2016
game = 30136
game_timeline.game_timeline(season, game)
_images/WSH-TOR_G6_timeline.png

To get a player rolling CF% graph, use:

from scrapenhl2.plot import rolling_cf_gf
player = 'Ovechkin'
rolling_games = 25
start_year = 2015
end_year = 2017
rolling_cf_gf.rolling_player_cf(player, rolling_games, start_year, end_year)
_images/Ovechkin_rolling_cf.png

This package is targeted for script use, so I recommend familiarizing yourself with python. (This is not intended to be a replacement for a site like Corsica.)

Look through the documentation at Read the Docs and the examples on Github. Also always feel free to contact me with questions or suggestions.

Contact

Twitter.

Collaboration

I’m happy to partner with you in development efforts–just shoot me a message or submit a pull request. Please also let me know if you’d like to alpha- or beta-test my code.

Donations

If you would like to support my work, please donate money to a charity of your choice. Many large charities do great work all around the world (e.g. Médecins Sans Frontières), but don’t forget that your support is often more critical for local/small charities. Also consider that small regular donations are sometimes better than one large donation.

You can vet a charity you’re targeting using a charity rating website.

If you do make a donation, make me happy and leave a record here.. (It’s anonymous.)

Change log

1/13/18: Various bug fixes, some charts added.

11/10/17: Switched from Flask to Dash, bug fixes.

11/5/17: Bug fixes and method to add on-ice players to file. More refactoring.

10/28/17: Major refactoring. Docs up and running.

10/21/17: Added basic front end. Committed early versions of 2017 logs.

10/16/17: Added initial versions of game timelines, player rolling corsi, and game H2H graphs.

10/10/17: Bug fixes on scraping and team logs. Started methods to aggregate 5v5 game-by-game data for players.

10/7/17: Committed code to scrape 2010 onward and create team logs; still bugs to fix.

9/24/17: Committed minimal structure.

Major outstanding to-dos

  • Bring in old play by play and shifts from HTML
  • More examples
  • More graphs
  • More graphs in Dash app

Indices and tables