DS220 Final Project¶
Kush Lalwani Premier League Stats
About the Dataset¶
https://www.kaggle.com/datasets/zaeemnalla/premier-league/
This dataset from kaggle consists of two different csv files.
Both of the sets contain data about Premier League teams from the 2006/07 season to the 2017/18 season.
The first dataset contains the results for every single fixture played between two teams over the 12 year span.
The second dataset contains cumulative statistics about each individual team over the course of a season, it contains 40 different metrics.
We will use the pandas and matplot libraries to answer the following questions:
- Which team has the highest conversion rate per scoring attempt and is there a correlation between scoring attempts and goals scored?
- What is the outcome of penalty kicks against a team?
- Which team has had the highest amount of cards per game, is there a relationship between win-loss ratio and discipline?
- How do teams typically score their goals?
- Which team has the highest number of saves and is there a correlation between the amount of saves to the number of clean sheets?
- What is the distribution of types of passes for every team?
- What is the distribution of the results of a teams scoring attempts?
Importing dataset into DataFrame¶
import pandas as pd
import matplotlib.pyplot as plt
plData = pd.read_csv("PLstats.csv")
Cleaning the Data¶
This dataset contain some very useful statistics about teams over 12 seasons, however not every stat was tracked through all of those years.
For example, statistics like headed clearances, through balls, and dispossesions were not tracked for the first season of the dataset.
Also backpasses and big chances were not tracked until the 2010/11 season.
Otherwise scrolling through the dataset, everything looked good except for the 'saves' statistic, which doesn't appear to have the correct numbers for some of the cells. Through the first few seasons listed the numbers seemed comparatively low to the other numbers. Also there are a few null values in this column.
I do not feel it is fair to replace these values with averages or some other number because there are many factors that can affect the number of saves a team can have. For example, new players, such as goalkeepers, can significantly change the number of saves.
To solve this issue with the dataset, I will start my analysis in the 2013/14 season. I feel that five seasons is long enough to derive useful insights.
plData.head()
| team | wins | losses | goals | total_yel_card | total_red_card | total_scoring_att | ontarget_scoring_att | hit_woodwork | att_hd_goal | ... | total_cross | corner_taken | touches | big_chance_missed | clearance_off_line | dispossessed | penalty_save | total_high_claim | punches | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Manchester United | 28.0 | 5.0 | 83.0 | 60.0 | 1.0 | 698.0 | 256.0 | 21.0 | 12.0 | ... | 918.0 | 258.0 | 25686.0 | NaN | 1.0 | NaN | 2.0 | 37.0 | 25.0 | 2006-2007 |
| 1 | Chelsea | 24.0 | 3.0 | 64.0 | 62.0 | 4.0 | 636.0 | 216.0 | 14.0 | 16.0 | ... | 897.0 | 231.0 | 24010.0 | NaN | 2.0 | NaN | 1.0 | 74.0 | 22.0 | 2006-2007 |
| 2 | Liverpool | 20.0 | 10.0 | 57.0 | 44.0 | 0.0 | 668.0 | 214.0 | 15.0 | 8.0 | ... | 1107.0 | 282.0 | 24150.0 | NaN | 1.0 | NaN | 0.0 | 51.0 | 27.0 | 2006-2007 |
| 3 | Arsenal | 19.0 | 8.0 | 63.0 | 59.0 | 3.0 | 638.0 | 226.0 | 19.0 | 10.0 | ... | 873.0 | 278.0 | 25592.0 | NaN | 1.0 | NaN | 0.0 | 88.0 | 27.0 | 2006-2007 |
| 4 | Tottenham Hotspur | 17.0 | 12.0 | 57.0 | 48.0 | 3.0 | 520.0 | 184.0 | 6.0 | 5.0 | ... | 796.0 | 181.0 | 22200.0 | NaN | 2.0 | NaN | 0.0 | 51.0 | 24.0 | 2006-2007 |
5 rows × 42 columns
plData.tail()
| team | wins | losses | goals | total_yel_card | total_red_card | total_scoring_att | ontarget_scoring_att | hit_woodwork | att_hd_goal | ... | total_cross | corner_taken | touches | big_chance_missed | clearance_off_line | dispossessed | penalty_save | total_high_claim | punches | season | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 235 | Huddersfield Town | 9.0 | 19.0 | 28.0 | 62.0 | 3.0 | 362.0 | 109.0 | 8.0 | 5.0 | ... | 765.0 | 165.0 | 22619.0 | 21.0 | 6.0 | 416.0 | 2.0 | 31.0 | 24.0 | 2017-2018 |
| 236 | Swansea City | 8.0 | 21.0 | 28.0 | 51.0 | 1.0 | 338.0 | 103.0 | 8.0 | 3.0 | ... | 694.0 | 150.0 | 22775.0 | 26.0 | 1.0 | 439.0 | 3.0 | 44.0 | 15.0 | 2017-2018 |
| 237 | Southampton | 7.0 | 16.0 | 37.0 | 63.0 | 2.0 | 450.0 | 145.0 | 15.0 | 7.0 | ... | 800.0 | 227.0 | 24639.0 | 37.0 | 4.0 | 379.0 | 1.0 | 29.0 | 13.0 | 2017-2018 |
| 238 | Stoke City | 7.0 | 19.0 | 35.0 | 62.0 | 1.0 | 384.0 | 132.0 | 8.0 | 8.0 | ... | 598.0 | 136.0 | 20368.0 | 33.0 | 3.0 | 402.0 | 0.0 | 27.0 | 14.0 | 2017-2018 |
| 239 | West Bromwich Albion | 6.0 | 19.0 | 31.0 | 73.0 | 1.0 | 378.0 | 114.0 | 7.0 | 10.0 | ... | 784.0 | 176.0 | 20552.0 | 28.0 | 3.0 | 446.0 | 0.0 | 40.0 | 5.0 | 2017-2018 |
5 rows × 42 columns
plData.dropna(inplace=True)
for row in plData.index:
if plData.loc[row,"saves"] < 60:
plData.drop(row,inplace= True)
print(plData.shape)
(100, 42)
The data has now been cleaned of all nulls or any incorrect data, I know that this is the correct data because it is 100 rows long. There are 20 teams in the Premier League and we are looking at 5 seasons of data, so that should be 100 rows of statistics.
There is another problem with the dataset however, when searching through the data there needs to be better index than just using the row number. This is where the issue arises, one would think to use the team name as the index, but many teams have remained in the Premier League for multiple seasons, meaning that there would be duplicate indexes which is not allowed. Luckily, pandas supports multi-indexing, this is similar to creating a primary key using more than one attribute. I will use the team name and the season as the index.
plData.set_index(['team','season'],inplace = True)
print(plData.index)
MultiIndex([( 'Manchester City', '2013-2014'),
( 'Liverpool', '2013-2014'),
( 'Chelsea', '2013-2014'),
( 'Arsenal', '2013-2014'),
( 'Everton', '2013-2014'),
( 'Tottenham Hotspur', '2013-2014'),
( 'Manchester United', '2013-2014'),
( 'Newcastle United', '2013-2014'),
( 'Southampton', '2013-2014'),
( 'Crystal Palace', '2013-2014'),
( 'Stoke City', '2013-2014'),
( 'Swansea City', '2013-2014'),
( 'West Ham United', '2013-2014'),
( 'Aston Villa', '2013-2014'),
( 'Hull City', '2013-2014'),
( 'Sunderland', '2013-2014'),
( 'Fulham', '2013-2014'),
( 'Norwich City', '2013-2014'),
( 'Cardiff City', '2013-2014'),
( 'West Bromwich Albion', '2013-2014'),
( 'Chelsea', '2014-2015'),
( 'Manchester City', '2014-2015'),
( 'Arsenal', '2014-2015'),
( 'Manchester United', '2014-2015'),
( 'Tottenham Hotspur', '2014-2015'),
( 'Liverpool', '2014-2015'),
( 'Southampton', '2014-2015'),
( 'Swansea City', '2014-2015'),
( 'Stoke City', '2014-2015'),
( 'Crystal Palace', '2014-2015'),
( 'Everton', '2014-2015'),
( 'West Ham United', '2014-2015'),
( 'Leicester City', '2014-2015'),
( 'West Bromwich Albion', '2014-2015'),
( 'Aston Villa', '2014-2015'),
( 'Newcastle United', '2014-2015'),
( 'Hull City', '2014-2015'),
( 'Queens Park Rangers', '2014-2015'),
( 'Burnley', '2014-2015'),
( 'Sunderland', '2014-2015'),
( 'Leicester City', '2015-2016'),
( 'Arsenal', '2015-2016'),
( 'Manchester City', '2015-2016'),
( 'Manchester United', '2015-2016'),
( 'Tottenham Hotspur', '2015-2016'),
( 'Southampton', '2015-2016'),
( 'Liverpool', '2015-2016'),
( 'West Ham United', '2015-2016'),
( 'Stoke City', '2015-2016'),
( 'Chelsea', '2015-2016'),
( 'Swansea City', '2015-2016'),
( 'Watford', '2015-2016'),
( 'AFC Bournemouth', '2015-2016'),
( 'Crystal Palace', '2015-2016'),
( 'Everton', '2015-2016'),
( 'West Bromwich Albion', '2015-2016'),
( 'Newcastle United', '2015-2016'),
( 'Norwich City', '2015-2016'),
( 'Sunderland', '2015-2016'),
( 'Aston Villa', '2015-2016'),
( 'Chelsea', '2016-2017'),
( 'Tottenham Hotspur', '2016-2017'),
( 'Arsenal', '2016-2017'),
( 'Manchester City', '2016-2017'),
( 'Liverpool', '2016-2017'),
( 'Manchester United', '2016-2017'),
( 'Everton', '2016-2017'),
( 'AFC Bournemouth', '2016-2017'),
( 'Crystal Palace', '2016-2017'),
( 'Leicester City', '2016-2017'),
( 'Southampton', '2016-2017'),
( 'Swansea City', '2016-2017'),
( 'West Bromwich Albion', '2016-2017'),
( 'West Ham United', '2016-2017'),
( 'Burnley', '2016-2017'),
( 'Stoke City', '2016-2017'),
( 'Watford', '2016-2017'),
( 'Hull City', '2016-2017'),
( 'Sunderland', '2016-2017'),
( 'Middlesbrough', '2016-2017'),
( 'Manchester City', '2017-2018'),
( 'Manchester United', '2017-2018'),
( 'Tottenham Hotspur', '2017-2018'),
( 'Chelsea', '2017-2018'),
( 'Liverpool', '2017-2018'),
( 'Arsenal', '2017-2018'),
( 'Burnley', '2017-2018'),
( 'Everton', '2017-2018'),
( 'Leicester City', '2017-2018'),
( 'Newcastle United', '2017-2018'),
( 'AFC Bournemouth', '2017-2018'),
( 'Crystal Palace', '2017-2018'),
( 'Watford', '2017-2018'),
( 'West Ham United', '2017-2018'),
('Brighton and Hove Albion', '2017-2018'),
( 'Huddersfield Town', '2017-2018'),
( 'Swansea City', '2017-2018'),
( 'Southampton', '2017-2018'),
( 'Stoke City', '2017-2018'),
( 'West Bromwich Albion', '2017-2018')],
names=['team', 'season'])
Answering the questions¶
Which team has the highest conversion rate per scoring attempt and is there a correlation between scoring attempts and goals scored?¶
We would do this by dividing the goals by the total scoring attempts. We will group the data set by only those two statistics and we will also create a new column called "conversion rate".
conv_rate_frame = plData[['goals','total_scoring_att']].copy()
conversion_rate = conv_rate_frame['goals'] / conv_rate_frame['total_scoring_att']
conv_rate_frame['conv_rate'] = conversion_rate
conv_rate_frame.sort_values(by='conv_rate', ascending=False,inplace= True)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 8))
conv_rate_frame.plot.bar(ax=ax, y='conv_rate')
plt.xlabel('Teams and Seasons')
plt.ylabel('Conversion Rate')
plt.title('Premier League Goal Conversion Rates from 2013/14 to 2017/18')
plt.tight_layout()
plt.show()
From the above bar graph we can see that Manchester City in the 2017-2018 season were the most clinical team at scoring goals with a conversion rate of about 16%. The lowest conversion rate was held by Norwich City in 2013-2014 with just about 6% of their attempts leading to goals.
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(conv_rate_frame['total_scoring_att'], conv_rate_frame['goals'])
plt.xlabel('Total Scoring Attempts')
plt.ylabel('Goals')
plt.title('Scatter Plot of Goals vs Total Scoring Attempts')
plt.grid(True)
plt.tight_layout()
correlation = conv_rate_frame['total_scoring_att'].corr(conv_rate_frame['goals'])
plt.annotate(f'Correlation: {correlation:.4f}', xy=(0.5, 0.95), xycoords='axes fraction', ha='center', fontsize=12)
plt.show()
This second plot is a scatter plot with goals and total scoring attempts. This scatterplot also has a correlation that was found for the plot. The correlation was relatively high at 0.8573, this means that this is a strong relationship. So from the graph we can conclude that typically; if a team has more scoring attempts, the team will also have more goals.
What is the outcome of penalty kicks against a team?¶
We would find the outcome of penalties by recording all of the possible outcomes in a new data frame. Only the statistics for the defending team will be recorded because otherwise, having both attacking and defending penalties can lead to duplicate penalties being tracked. With this in mind a new dataframe will be created using just 'penalty goals conceded', 'penalty saves', and 'penalties conceded'.
From this new dataframe, a new column can be derived. This column would be called 'penalties missed against', this would be created by the result of subtracting 'penalty goals conceded' and 'penalty saves' from 'penalties conceded'.
pk_result_frame = plData[['penalty_conceded','pen_goals_conceded','penalty_save']].copy()
pen_miss_against = plData['penalty_conceded'] - plData['penalty_save'] - plData['pen_goals_conceded']
pk_result_frame['pen_miss_ag'] = pen_miss_against
labels = ['Penalties Scored', 'Penalties Saved', 'Penalties Missed']
sizes = [pk_result_frame['pen_goals_conceded'].sum(), pk_result_frame['penalty_save'].sum(), pk_result_frame['pen_miss_ag'].sum()]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['lime', 'orange', 'red'],wedgeprops=dict(edgecolor='black', linewidth=1))
ax.axis('equal')
plt.title('Distribution of Penalty Outcomes')
plt.show()
This pie chart shows the results of every penalty taken over the five years that span the dataset. Throughout those five years, 76.6% of the penalties taken were scored. 17% of the penalties taken ended up being saved by the opposition goalkeeper. Finally, 6.4% of the penalties taken missed the goal entirely, either going wide or above the crossbar.
According to this website: https://www.researchgate.net/publication/308873835_A_Quantitative_Analysis_of_Penalty-Kicks_in_the_English_Premier_League
Average penalty statistics in the Premier League are:
- Scored: 79.5%
- Saved: 16.2%
- Missed: 4.3%
This matches pretty closely to the statistics that were derived from the dataset. There was a maximum difference between the two sources of 2.9% which is pretty close.
Which team has had the highest amount of cards per game, is there a relationship between win-loss ratio and discipline?¶
We would need to create a new data frame grouped by 'wins', 'losses', 'yellow cards', and 'red cards'. A new column will be created called 'total cards', which would be the sum of 'yellow cards' and 'red cards'. Another new column would be made called 'win loss ratio', which would be 'wins' divided by 'losses'.
tot_cards = plData['total_yel_card'] + plData['total_red_card']
wl_ratio = plData['wins'] / plData['losses']
wl_card_frame = plData[['wins','losses','total_yel_card','total_red_card']].copy()
wl_card_frame['tot_cards'] = plData['total_yel_card'] + plData['total_red_card']
wl_card_frame['wl_ratio'] = plData['wins'] / plData['losses']
wl_card_frame.sort_values(by='tot_cards', ascending=False,inplace= True)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 8))
wl_card_frame.plot.bar(ax=ax, y='tot_cards')
plt.xlabel('Teams and Seasons')
plt.ylabel('Total Cards')
plt.title('Premier League Yellow and Red Cards from 2013/14 to 2017/18')
plt.tight_layout()
plt.show()
As seen in the above bar graph, which visualizes the amount of total cards accumulated by teams in descending order. As seen in the bar graph, Sunderland in 2014-15 with 96 total cards. And the most disciplined team was Arsenal in 2015-16 with only 44 cards throughout the season.
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(wl_card_frame['tot_cards'], wl_card_frame['wl_ratio'])
plt.xlabel('Total Cards')
plt.ylabel('Win-Loss Ratio')
plt.title('Scatter Plot of Win-Loss Ratio vs Total Cards')
plt.grid(True)
plt.tight_layout()
correlation = wl_card_frame['tot_cards'].corr(wl_card_frame['wl_ratio'])
plt.annotate(f'Correlation: {correlation:.4f}', xy=(0.5, 0.95), xycoords='axes fraction', ha='center', fontsize=12)
plt.show()
This second plot is a scatterplot which shows the relationship between a team's total cards and that team's win-loss ratio. As seen in the scatterplot, there is not much of a relation between the two statistics. This assumption is confirmed by the correlation, the correlation is -0.1186 which is a very weak negative relation. This means that the win-loss ratio is not really related to the total cards. Because it is a negative correlation, the higher the amount of cards the lower the win-loss ratio; but because the correlation is so weak, it should not be taken so seriously.
How do teams typically score their goals?¶
This question would need to have a dataframe that is made up of all possible goal scoring scenarios. So the new data frame would need to have 'headed goals', 'penalty goals', 'freekick goals', 'fastbreak goals', 'own goals', and a new column called 'other' which is made of 'goals' subtracted by all the other columns.
goal_method_frame = plData[['goals','att_hd_goal','att_pen_goal','att_freekick_goal','goal_fastbreak','own_goals']].copy()
other_goals = plData['goals'] - plData['att_hd_goal'] - plData['att_pen_goal'] - plData['att_freekick_goal'] - plData['goal_fastbreak'] - plData['own_goals']
goal_method_frame['other_goals'] = other_goals
labels = ['Headed Goals', 'Penalty Goals', 'Freekick Goals', 'Fast Break Goals','Own Goals','Other Goals']
sizes = [goal_method_frame['att_hd_goal'].sum(), goal_method_frame['att_pen_goal'].sum(), goal_method_frame['att_freekick_goal'].sum(), goal_method_frame['goal_fastbreak'].sum(), goal_method_frame['own_goals'].sum(),goal_method_frame['other_goals'].sum()]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.2f%%', startangle=90,colors=['dodgerblue', 'lime', 'gold','darkorange','firebrick','mediumorchid'],wedgeprops=dict(edgecolor='black', linewidth=1))
ax.axis('equal')
plt.title('Distribution of Scoring Methods')
plt.show()
As seen in the above pie chart, the vast majority (65.14%) of goals come from regular open play, these are goals that would not fit into the other categories listed above. The second most common method of scoring goals is through headers (17.18%). These headed goals could come from a variety of methods: such as headers from crossed passes, corner kicks, or freekicks. Next, is goals from penalty kicks; penalty kicks occur when fouls or handling offences occur in the box, this accounts for 6.76% of the goals over the five seasons. Next is goals which come from a fast break or a counter attack (4.62%). Next is own goals, which is when a player ends up putting the ball into their own net (3.66%). And finally, the least common method of scoring goals is from a direct freekick (2.65%), this is when the ball is directly scored from a freekick. Overall, we can see that most goals (about 90%) come from open play scenarios, while only about 10% come from dead-ball situations.
Which team has the highest number of saves and is there a correlation between the amount of saves to the number of clean sheets?¶
I will make a new dataframe that consists of 'saves' and 'clean sheets'. Clean sheets is a game in which the team concedes no goals, so it would be a pretty good descriptor of goalkeeper performance.
save_frame = plData[['saves','clean_sheet']].copy()
save_frame.sort_values(by='saves', ascending=False,inplace= True)
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(18, 8))
save_frame.plot.bar(ax=ax, y='saves')
plt.xlabel('Teams and Seasons')
plt.ylabel('Total Saves')
plt.title('Premier League Saves from 2013/14 to 2017/18')
plt.tight_layout()
plt.show()
As we can see from the bar chart, the team that has the most saves was Sunderland in 2016-2017 with 176 saves through the season. Sunderland were a historically bad team, who finished at the bottom of the Premier League. The team with the lowest amount of saves was Manchester City in 2017-2018 with just 62 saves. This Manchester City team finished in first place and won the Premier League breaking many records. There is a slight relation that can be derived from the bar graph. Typically better teams have less saves, while worse teams have a higher amount of saves. We will look at this further with the correlation.
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(save_frame['saves'], save_frame['clean_sheet'])
plt.xlabel('Saves')
plt.ylabel('Clean Sheets')
plt.title('Scatter Plot of Clean Sheets vs Goalkeeper Saves')
plt.grid(True)
plt.tight_layout()
correlation = save_frame['saves'].corr(save_frame['clean_sheet'])
plt.annotate(f'Correlation: {correlation:.4f}', xy=(0.5, 0.95), xycoords='axes fraction', ha='center', fontsize=12)
plt.show()
Shown above is a scatter plot of Clean Sheets vs Goalkeeper Saves. The scatter plot slightly resembles a line, this is confirmed by the correlation which is -0.4765. The correlation means that there is a moderate negative relation between the two axes on the plot. The moderate correlation means that there is some relation between the two variables. The negative portion of the correlation means that if there are more saves, there are typically less clean sheets. An application of this correlation can be seen through some real Premier League examples. For example, teams that are good often have strong defences to protect their goalkeeper from having to make many saves. This would also apply for teams that are bad, bad teams often have leaky defences which leads to their goalkeepers facing lots of shots which need to be saved. With the good team's defences they would tend to also keep more clean sheets, and bad teams would keep less of them. This understanding of the correlation also goes hand in hand with the bar chart above as well.
What is the distribution of types of passes for every team?¶
This will be done by compiling all types of passes into another dataframe. The columns will be 'total passes', 'through passes', 'long balls', 'back passes', and 'crosses'. I will also make a column called other to account for all other types of passes.
pass_frame = plData[['total_pass','total_through_ball','total_long_balls','backward_pass','total_cross']].copy()
other = plData['total_pass'] - plData['total_through_ball'] - plData['total_long_balls'] - plData['backward_pass']- plData['total_cross']
pass_frame['other_pass'] = other
labels = ['Through Balls', 'Long Balls', 'Back Passes', 'Crosses','Other Passes']
sizes = [pass_frame['total_through_ball'].sum(), pass_frame['total_long_balls'].sum(), pass_frame['backward_pass'].sum(), pass_frame['total_cross'].sum(),pass_frame['other_pass'].sum()]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.2f%%',labeldistance=1.35, pctdistance=1.2, startangle=180,colors=['#FF5733', '#33FF57', '#3373FF', '#FF33C7', '#FFD700'],wedgeprops=dict(edgecolor='black', linewidth=1))
ax.axis('equal')
plt.title('Distribution of Passes')
plt.show()
As seen through the pie chart above the most common type of passes are 'other passes' (65.58%). These passes are where passes either go sideways or forward. Next there is back passes which account for 15.04% of the passes. These passes are passes which are passed backwards, this typically occurs when players have no forward passes available, this could be common among teams who like to keep possesion of the ball. Next, there are long balls at 14.46%, these types of passes are common with teams who are more defensive minded. These teams dont tend to hold the ball for a lot of time, they usually send the ball far up the field in an attempt to create chances. Then there are crosses and through balls which account for 4.59% and 0.33% respectively. These two types of passes are typically used for directly creating goals, which is why they are not as common as the other types. Crosses are passed from wide areas in hopes that an attacker can put away the chance. Through balls are played through the last line of defence, these passes are usually very difficult to accomplish, which could be why they are so uncommon.
What is the distribution of the results of a teams scoring attempts?¶
This will be done by creating a new dataframe using the columns 'total scoring attempts', 'goals', 'saves','outfield blocks', 'on target shots', 'hit woodwork', and 'off target shots'. we will need to create the new data base to make new columns which would be the result of mathematical operators between the columns because some of these statistics encompass other stats.
score_frame = plData[['total_scoring_att']].copy()
off_frame = plData['total_scoring_att'] - plData['ontarget_scoring_att'] - plData['hit_woodwork']
hit_frame = plData['hit_woodwork'].copy()
goals = plData['goals'].copy()
saved = plData['saves'].copy()
blocked = plData['outfielder_block'].copy()
score_frame['off_frame'] = off_frame
score_frame['hit_frame'] = hit_frame
score_frame['goals'] = goals
score_frame['saved'] = saved
score_frame['blocked'] = blocked
labels = ['Off Frame', 'Hit Woodwork', 'Goals', 'Saved','Blocked']
sizes = [score_frame['off_frame'].sum(), score_frame['hit_frame'].sum(), score_frame['goals'].sum(), score_frame['saved'].sum(),score_frame['blocked'].sum()]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct='%1.2f%%', startangle=180,colors=['#92B5E7', '#7F8BC0', '#64B590', '#E48696', '#DFA163'],wedgeprops=dict(edgecolor='black', linewidth=1))
ax.axis('equal')
plt.title('Distribution of Scoring Attempts')
plt.show()
Looking at this pie chart, we can look at what the probability of outcomes for a scoring chances. Looking at the pie chart, 52.71% of scoring chances will be off-target. The off-target is comprised of both shots that hit the woodwork (post or crossbar)(1.98%) and shots that either go wide or over the goal (50.73%). For shots that are on-target there are three different outcomes; either the shot could be blocked by an opposing outfield player (21.03%), saved by the goalkeeper (17.95%), or result in a goal (8.3%). This information explains why some soccer games are so low scoring. Over half the time, a player cannot manage to get their shot on target. And when a shot is on target, there is only a 17.55% chance of scoring. This also shows that over the five year span, defenders have been very good at directly preventing goals.
Conclusion¶
Overall, this project has left me with lots of new ideas on how to interpret data. I have used new skils such as the pandas library and the matplot library. I am really interested in the topic that I chose for this, and in my own free time, I might try to come up with some more questions to answer about this particular dataset. In the future, I hope to get a data science related job in sports, because this project was really enjoyable for me. This data set has so many more possible insights to derive, many of which would be useful to highlight.