Manipulate

The scrapenhl2.manipulate module contains methods useful for scraping.

Useful examples

Add on-ice players to a file:

from scrapenhl.manipulate import add_onice_players as onice
onice.add_players_to_file('/Users/muneebalam/Downloads/zone_entries.csv', 'WSH', time_format='elapsed')
# Will output zone_entries_on-ice.csv in Downloads, with WSH players and opp players on-ice listed.

See documentation below for more information and additional arguments to add_players_to_file.

Methods

General

scrapenhl2.manipulate.manipulate.add_score_adjustment_to_team_pbp(df)

Adds AdjFF and AdjFA

Parameters:df – dataframe
Returns:dataframe with extra columns
scrapenhl2.manipulate.manipulate.convert_to_all_combos(df, fillval=0, *args)

This method takes a dataframe and makes sure all possible combinations of given arguments are present. For example, if you want df to have all combos of P1 and P2, it will create a dataframe with all possible combos, left join existing dataframe onto that, and return that df. Uses fillval to fill all non-key columns.

Parameters:
  • df – the pandas dataframe
  • fillval – obj, the value with which to fill. Default fill is 0
  • args – str, column names, or tuples of combinations of column names
Returns:

df with all combos of columns specified

scrapenhl2.manipulate.manipulate.count_by_keys(df, *args)

A convenience method that isolates specified columns in the dataframe and gets counts. Drops when keys have NAs.

Parameters:
  • df – dataframe
  • args – str, column names in dataframe
Returns:

df, dataframe with each of args and an additional column, Count

scrapenhl2.manipulate.manipulate.filter_for_corsi(pbp)

Filters given dataframe for goal, shot, miss, and block events

Parameters:pbp – a dataframe with column Event
Returns:pbp, filtered for corsi events
scrapenhl2.manipulate.manipulate.filter_for_event_types(pbp, eventtype)

Filters given dataframe for event type(s) specified only.

Parameters:
  • pbp – dataframe. Need a column titled Event
  • eventtype – str or iterable of str, e.g. Goal, Shot, etc
Returns:

dataframe, filtered

scrapenhl2.manipulate.manipulate.filter_for_fenwick(pbp)

Filters given dataframe for SOG only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’ or Event == ‘Shot’
scrapenhl2.manipulate.manipulate.filter_for_five_on_five(df)

Filters given dataframe for 5v5 rows

Parameters:df – dataframe, columns HomeStrength + RoadStrength or TeamStrength + OppStrength
Returns:dataframe
scrapenhl2.manipulate.manipulate.filter_for_goals(pbp)

Filters given dataframe for goals only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’
scrapenhl2.manipulate.manipulate.filter_for_sog(pbp)

Filters given dataframe for SOG only.

Parameters:pbp – dataframe. Need a column titled Event
Returns:dataframe. Only rows where Event == ‘Goal’ or Event == ‘Shot’
scrapenhl2.manipulate.manipulate.filter_for_team(pbp, team)

Filters dataframe for rows where Team == team

Parameters:
  • pbp – dataframe. Needs to have column Team
  • team – int or str, team ID or name
Returns:

dataframe with rows filtered

scrapenhl2.manipulate.manipulate.generate_5v5_player_log(season)

Takes the play by play and adds player 5v5 info to the master player log file, noting TOI, CF, etc. This takes awhile because it has to calculate TOICOMP. :param season: int, the season :return: nothing

scrapenhl2.manipulate.manipulate.generate_player_toion_toioff(season)

Generates TOION and TOIOFF at 5v5 for each player in this season. :param season: int, the season :return: df with columns Player, TOION, TOIOFF, and TOI60.

scrapenhl2.manipulate.manipulate.generate_toicomp(season)

Generates toicomp at a player-game level :param season: int, the season :return: df,

scrapenhl2.manipulate.manipulate.get_5v5_player_game_cfca(season, team)

Gets CFON, CAON, CFOFF, and CAOFF by game for given team in given season.

Parameters:
  • season – int, the season
  • team – int, team id
Returns:

df with game, player, CFON, CAON, CFOFF, and CAOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_gfga(season, team)

Gets GFON, GAON, GFOFF, and GAOFF by game for given team in given season.

Parameters:
  • season – int, the season
  • team – int, team id
Returns:

df with game, player, GFON, GAON, GFOFF, and GAOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_shift_startend(season, team)

Generates shift starts and ends for shifts that start and end at 5v5–OZ, DZ, NZ, OtF.

Parameters:
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe with shift starts and ends

scrapenhl2.manipulate.manipulate.get_5v5_player_game_toi(season, team)

Gets TOION and TOIOFF by game and player for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player, TOION, and TOIOFF

scrapenhl2.manipulate.manipulate.get_5v5_player_game_toicomp(season, team)

Calculates data for QoT and QoC at a player-game level for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player,

scrapenhl2.manipulate.manipulate.get_5v5_player_log(season, force_create=False)
Parameters:
  • season – int, the season
  • force_create – bool, create from scratch even if it exists?
Returns:

scrapenhl2.manipulate.manipulate.get_5v5_player_log_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.get_5v5_player_season_toi(season, team)

Gets TOION and TOIOFF by player for given team in given season. :param season: int, the season :param team: int, team id :return: df with game, player, TOION, and TOIOFF

scrapenhl2.manipulate.manipulate.get_directions_for_xy_for_game(season, game)

It doesn’t seem like there are rules for whether positive X in XY event locations corresponds to offensive zone events, for example. Best way is to use fields in the the json.

Parameters:
  • season – int, the season
  • game – int, the game
Returns:

dict indicating which direction home team is attacking by period

scrapenhl2.manipulate.manipulate.get_directions_for_xy_for_season(season, team)

Gets directions for team specified using get_directions_for_xy_for_game

Parameters:
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe

scrapenhl2.manipulate.manipulate.get_game_h2h_corsi(season, games, cfca=None)

This method gets H2H Corsi at 5v5 for the given game(s).

Parameters:
  • season – int, the season
  • games – int, the game, or list of int, the games
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
Returns:

a df with [P1, P1Team, P2, P2Team, CF, CA, C+/-]. Entries will be duplicated, as with get_game_h2h_toi.

scrapenhl2.manipulate.manipulate.get_game_h2h_toi(season, games)

This method gets H2H TOI at 5v5 for the given game.

Parameters:
  • season – int, the season
  • games – int, the game, or list of int, the games
Returns:

a df with [P1, P1Team, P2, P2Team, TOI]. Entries will be duplicated (one with given P as P1, another as P2)

scrapenhl2.manipulate.manipulate.get_line_combos(season, game, homeroad='H')

Returns a df listing the 5v5 line combinations used in this game for specified team, and time they each played together

Parameters:
  • season – int, the game
  • game – int, the season
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas dataframe with columns P1, P2, P3, Secs. May contain duplicates

scrapenhl2.manipulate.manipulate.get_micah_score_adjustment()

See http://hockeyviz.com/txt/senstats

Returns:dataframe: HomeScoreDiff, HomeFFWeight, and HomeFAWeight
scrapenhl2.manipulate.manipulate.get_pairings(season, game, homeroad='H')

Returns a df listing the 5v5 pairs used in this game for specified team, and time they each played together

Parameters:
  • season – int, the game
  • game – int, the season
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas dataframe with columns P1, P2, Secs. May contain duplicates

scrapenhl2.manipulate.manipulate.get_pbp_events(*args, **kwargs)

A general method that yields a generator of dataframes of PBP events subject to given limitations.

Keyword arguments are applied as “or” conditions for each individual keyword (e.g. multiple teams) but as “and” conditions otherwise.

The non-keyword arguments are event types subject to “or” conditions:

  • ‘fac’ or ‘faceoff’
  • ‘shot’ or ‘sog’ or ‘save’
  • ‘hit’
  • ‘stop’ or ‘stoppage’
  • ‘block’ or ‘blocked shot’
  • ‘miss’ or ‘missed shot’
  • ‘give’ or ‘giveaway’
  • ‘take’ or ‘takeaway’
  • ‘penl’ or ‘penalty’
  • ‘goal’
  • ‘period end’
  • ‘period official’
  • ‘period ready’
  • ‘period start’
  • ‘game scheduled’
  • ‘gend’ or ‘game end’
  • ‘shootout complete’
  • ‘chal’ or ‘official challenge’
  • ‘post’, which is not an officially designated event but will be searched for

Dataframes are returned season-by-season to save on memory. If you want to operate on all seasons, process this data before going to the next season.

Defaults to return all regular-season and playoff events by all teams.

Supported keyword arguments:

  • add_on_ice: bool. If True, adds on-ice players for each time.
  • players_on_ice: str or int, or list of them, player IDs or names of players on ice for event.
  • players_on_ice_for: like players_on_ice, but players must be on ice for team that “did” event.
  • players_on_ice_ag: like players_on_ice, but players must be on ice for opponent of team that “did” event.
  • team, str or int, or list of them. Teams to filter for.
  • team_for, str or int, or list of them. Team that committed event.
  • team_ag, str or int, or list of them. Team that “received” event.
  • home_team: str or int, or list of them. Home team.
  • road_team: str or int, or list of them. Road team.
  • start_date: str or date, will only return data on or after this date. YYYY-MM-DD
  • end_date: str or date, will only return data on or before this date. YYYY-MM-DD
  • start_season: int, will only return events in or after this season. Defaults to 2010-11.
  • end_season: int, will only return events in or before this season. Defaults to current season.
  • season_type: int or list of int. 1 for preseason, 2 for regular, 3 for playoffs, 4 for ASG, 6 for Oly, 8 for WC.
    Defaults to 2 and 3.
  • start_game: int, start game. Applies only to start season. Game ID will be this, or greater.
  • end_game: int, end game. Applies only to end season. Game ID will be this, or smaller.
  • acting_player: str or int, or list of them, players who committed event (e.g. took a shot).
  • receiving_player: str or int, or list of them, players who received event (e.g. took a hit).
  • strength_hr: tuples or list of them, e.g. (5, 5) or ((5, 5), (4, 4), (3, 3)). This is (Home, Road).
    If neither strength_hr nor strength_to is specified, uses 5v5.
  • strength_to: tuples or list of them, e.g. (5, 5) or ((5, 5), (4, 4), (3, 3)). This is (Team, Opponent).
    If neither strength_hr nor strength_to is specified, uses 5v5.
  • score_diff: int or list of them, acceptable score differences (e.g. 0 for tied, (1, 2, 3) for up by 1-3 goals)
  • start_time: int, seconds elapsed in game. Events returned will be after this.
  • end_time: int, seconds elapsed in game. Events returned will be before this.
Parameters:
  • args – str, event types to search for (applied “OR”, not “AND”)
  • kwargs – keyword arguments specifying filters (applied “AND”, not “OR”)
Returns:

df, a pandas dataframe

scrapenhl2.manipulate.manipulate.get_player_positions()

Use to get player positions :return: df with colnames ID and position

scrapenhl2.manipulate.manipulate.get_player_toi(season, game, pos=None, homeroad='H')

Returns a df listing 5v5 ice time for each player for specified team.

Parameters:
  • season – int, the game
  • game – int, the season
  • pos – specify ‘L’, ‘C’, ‘R’, ‘D’ or None for all
  • homeroad – str, ‘H’ for home or ‘R’ for road
Returns:

pandas df with columns Player, Secs

scrapenhl2.manipulate.manipulate.get_player_toion_toioff_file(season, force_create=False)
Parameters:
  • season – int, the season
  • force_create – bool, should this be read from file if possible, or created from scratch
Returns:

scrapenhl2.manipulate.manipulate.get_player_toion_toioff_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.get_toicomp_file(season, force_create=False)

If you want to rewrite the TOI60 file, too, then run get_player_toion_toioff_file with force_create=True before running this method. :param season: int, the season :param force_create: bool, should this be read from file if possible, or created from scratch :return:

scrapenhl2.manipulate.manipulate.get_toicomp_filename(season)
Parameters:season – int, the season
Returns:
scrapenhl2.manipulate.manipulate.infer_zones_for_faceoffs(df, directions, xcol='X', ycol='Y', timecol='Time', focus_team=None, season=None, faceoffs=True)

Inferring zones for events from XY is hard–this method takes are of that by referencing against the JSON’s notes on which team went which direction in which period.

This method notes several different zones for faceoffs:

  • OL (offensive zone, left)
  • OR (offensive zone, right)
  • NOL (neutral zone, near offensive blueline, left)
  • NOR (neutral zone, near offensive blueline, right)
  • NDL (neutral zone, near defensive blueline, left)
  • NDR (neutral zone, near defensive blueline, right)
  • DL (defensive zone, left)
  • DR (defensive zone, right)
  • N (center ice)

This method can also handle non-faceoff events, using three zones

  • N (neutral)
  • O (offensive)
  • D (defensive)
Parameters:
  • df – dataframe with columns Game, specified xcol, and specified ycol
  • directions – dataframe with columns Game, Period, and Direction (‘left’ or ‘right’)
  • xcol – str, the column containing X coordinates in df
  • ycol – str, the column containing Y coordinates in df
  • timecol – str, the column containing the time in seconds.
  • focus_team – int, str, or None. Directions are stored with home perspective. So specify focus team and will flip when focus_team is on the road. If None, does not do the extra home/road flip. Necessitates Season column in df.
  • season – int, the season
  • faceoffs – bool. If True will use the nine zones above. If False, only the three.
Returns:

dataframe with extra column EventLoc

scrapenhl2.manipulate.manipulate.merge_onto_all_team_games_and_zero_fill(df, season, team)

A method that gets all team games from this season and left joins df onto it on game, then zero fills NAs. Makes sure you didn’t miss any games and get NAs later.

Parameters:
  • df – dataframe with columns Game and PlayerID or Player
  • season – int, the season
  • team – int or str, the team
Returns:

dataframe

scrapenhl2.manipulate.manipulate.player_columns_to_name(df, columns=None)

Takes a dataframe and transforms specified columns of player IDs into names. If no columns provided, searches for defaults: H1, H2, H3, H4, H5, H6, HG (and same seven with R)

Parameters:
  • df – A dataframe
  • columns – a list of strings, or None
Returns:

df, dataframe with same column names, but columns now names instead of IDs

scrapenhl2.manipulate.manipulate.save_5v5_player_log(df, season)
Parameters:season – int, the season
Returns:nothing
scrapenhl2.manipulate.manipulate.save_player_toion_toioff_file(df, season)
Parameters:
  • df
  • season – int, the season
Returns:

scrapenhl2.manipulate.manipulate.save_toicomp_file(df, season)
Parameters:
  • df
  • season – int, the season
Returns:

scrapenhl2.manipulate.manipulate.team_5v5_score_state_summary_by_game(season)

Uses the team TOI log to group by team and game and score state for this season. 5v5 only.

Parameters:season – int, the season
Returns:dataframe, grouped by team, strength, and game
scrapenhl2.manipulate.manipulate.team_5v5_shot_rates_by_score(season)

Uses the team TOI and PBP logs to group by team and game and score state for this season. 5v5 only.

Parameters:season – int, the season
Returns:dataframe, grouped by team, strength, and game. Also columns for TOI, CF, and CA
scrapenhl2.manipulate.manipulate.time_to_mss(sectime)

Converts a number of seconds to m:ss format

Parameters:sectime – int, a number of seconds
Returns:str, sectime in m:ss

Add on-ice players

Add on-ice players to a file by specifying filename and columns from which to infer time elapsed in game.

scrapenhl2.manipulate.add_onice_players.add_onice_players_to_df(df, focus_team, season, gamecol, player_output='ids')

Uses the _Secs column in df, the season, and the gamecol to join onto on-ice players.

Parameters:
  • df – dataframe
  • focus_team – str or int, team to focus on. Its players will be listed in first in sheet.
  • season – int, the season
  • gamecol – str, the column with game IDs
  • player_output – str, use ‘names’ or ‘nums’ or ‘ids’. Currently ‘nums’ is not supported.
Returns:

dataframe with team and opponent players

scrapenhl2.manipulate.add_onice_players.add_players_to_file(filename, focus_team, season=None, gamecol='Game', periodcol='Period', timecol='Time', time_format='elapsed', update_data=False, player_output='names')

Adds names of on-ice players to the end of each line, and writes to file in the same folder as input file. Specifically, adds 1 second to the time in the spreadsheet and adds players who were on the ice at that time.

You cannot necessarily trust results when times coincide with stoppages–and it’s worth checking faceoffs as well.

Parameters:
  • filename – str, the file to read. Will save output as this filename but ending in “on-ice.csv”
  • focus_team – str or int, e.g. ‘WSH’ or ‘WPG’
  • season – int. For 2007-08, use 2007. Defaults to current season.
  • gamecol – str. The column holding game IDs (e.g. 20001). By default, looks for column called “Game”
  • periodcol – str. The column holding period number/name (1, 2, 3, 4 or OT, etc). By default: “Period”
  • timecol – str. The column holding time in period in M:SS format.
  • time_format – str, how to interpret timecol. Use ‘elapsed’ or ‘remaining’. E.g. the start of a period is 0:00 with elapsed and 20:00 in remaining.
  • update_data – bool. If True, will autoupdate() data for given season. If not, will not update game data. Use when file includes data from games not already scraped.
  • player_output – str, use ‘names’ or ‘nums’. Currently only supports ‘names’
Returns:

nothing

scrapenhl2.manipulate.add_onice_players.add_times_to_file(df, periodcol, timecol, time_format)

Uses specified periodcol, timecol, and time_format col to calculate _Secs, time elapsed in game.

Parameters:
  • df – dataframe
  • periodcol – str, the column that holds period name/number (1, 2, 3, 4 or OT, etc)
  • timecol – str, the column that holds time in m:ss format
  • time_format – use ‘elapsed’ (preferred) or ‘remaining’. This refers to timecol: e.g. 120 secs elapsed in the 2nd period might be listed as 2:00 in timecol, or as 18:00.
Returns:

dataframe with extra column _Secs, time elapsed in game.

TOI and Corsi for combinations of players

This module contains methods for generating H2H data for games

scrapenhl2.manipulate.combos.get_game_combo_corsi(season, game, player_n=2, cfca=None, *hrcodes)

This method gets H2H Corsi at 5v5 for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • player_n – int. E.g. 1 gives you a list of players and TOI, 2 gives you h2h, 3 gives you groups of 3, etc.
  • cfca – str, or None. If you specify ‘cf’, returns CF only. For CA, use ‘ca’. None returns CF - CA.
  • hrcodes – to limit exploding joins, specify strings containing ‘H’ and ‘R’ and ‘A’, each of length player_n For example, if player_n=3, specify ‘HHH’ to only get home team player combos. If this is left unspecified, will do all combos, which can be problematic when player_n > 3. ‘R’ for road, ‘H’ for home, ‘A’ for all (both)
Returns:

a df with [P1, P1Team, P2, P2Team, TOI, etc]. Entries will be duplicated.

scrapenhl2.manipulate.combos.get_game_combo_toi(season, game, player_n=2, *hrcodes)

This method gets H2H TOI at 5v5 for the given game.

Parameters:
  • season – int, the season
  • game – int, the game
  • player_n – int. E.g. 1 gives you a list of players and TOI, 2 gives you h2h, 3 gives you groups of 3, etc.
  • hrcodes – to limit exploding joins, specify strings containing ‘H’ and ‘R’ and ‘A’, each of length player_n For example, if player_n=3, specify ‘HHH’ to only get home team player combos. If this is left unspecified, will do all combos, which can be problematic when player_n > 3. ‘R’ for road, ‘H’ for home, ‘A’ for all (both)
Returns:

a df with [P1, P1Team, P2, P2Team, TOI, etc]. Entries will be duplicated.

scrapenhl2.manipulate.combos.get_team_combo_corsi(season, team, games, n_players=2)

Gets combo Corsi for team for specified games

Parameters:
  • season – int, the season
  • team – int or str, team
  • games – int or iterable of int, games
  • n_players – int. E.g. 1 gives you player TOI, 2 gives you 2-player group TOI, 3 makes 3-player groups, etc
Returns:

dataframe

scrapenhl2.manipulate.combos.get_team_combo_toi(season, team, games, n_players=2)

Gets 5v5 combo TOI for team for specified games

Parameters:
  • season – int, the season
  • team – int or str, team
  • games – int or iterable of int, games
  • n_players – int. E.g. 1 gives you player TOI, 2 gives you 2-player group TOI, 3 makes 3-player groups, etc
Returns:

dataframe