A common need for data processing is grouping records by column(s). In today’s article, we’re summarizing the Python Pandas dataframe operations.
These possibilities involve the counting of workers in each department of a company, the measurement of the average salaries of male and female staff in each department, and the calculation of the average salary of staff of various ages.
To be able to manage much of the grouping activities easily, Pandas has a groupby() function. But there are certain functions that are difficult for the feature to perform.
Let’s study these “difficult” activities here and attempt to provide possible solutions. Groupby is one of the most important features of Pandas. It is used according to the split-apply-combine technique for grouping and summarising data.
The Dataset for Demonstrating Dataframe Operations
Taken from https://www.kaggle.com/kaggle/sf-salaries. It is a salary dataset of employees in San Francisco over the duration of 3 years:
import pandas as pd import numpy as np df = pd.read_csv("Salaries.csv",index_col="Id") df![]()
Aggregation by a single column
For a certain column, you group records and then perform aggregate over each group. This is the easiest application of the technique above.
df.groupby('Year')['Year'].count()![]()
Aggregation by multiple columns
You group several fields of records and then do an aggregate over each group.
df['BasePay'] = df['BasePay'].apply(pd.to_numeric, downcast="float", errors="coerce") df.groupby(['Year','JobTitle'],as_index=False).mean()![]()
Aggregation by a computed column
The grouping key does not represent informational indexes and must be determined on the basis of current data. A key like that is called a measured column.
df.groupby(df['BasePay']-100000).mean()![]()
Multiple aggregations
How about working with multiple columns at once? Let’s take a look.
1. An aggregate on any of several columns
(On each of many tables, you execute one kind of aggregate):
df.groupby('Year',as_index=False).agg({'EmployeeName':'count','BasePay':'mean'})![]()
In order to get the aggregation completed, related columns and the involving aggregate operations are passed into the function in the form of a dictionary, where columns are keys and aggregates are values.
2. Multiple aggregates on one column
On a single column, you can execute more than one form of aggregate:
df.groupby('Year').agg(['std','mean']).reset_index()![]()
3. Multiple aggregates over multiple columns
You can summarize numerous columns in which a single column comprises several aggregates. The aggregate operation can also be user-defined:
df.groupby('Year').agg({'BasePay':['count','mean'],'JobTitle':'max'})![]()
Aggregation functions available
Here are the 13 features of aggregation available in Pandas and a brief description of what it does.
- Mean(): Mean Estimate of Classes
- Sum(): Compute the sum of the values of a group
- Size(): Compute sizes of community
- Count(): Computing the group count
- Std(): Standard group deviation
- Var(): Variance computation of groups
- Sem(): The default average error for classes
- Describe(): Produces figures that are informative
- First(): The first of the group values are calculated.
- Last(): Last estimation of group values
DataFrame Join Operations
We understand the advantages of taking a two-relationship Cartesian product, which gives us all the potential tuples that are coupled together.
In some instances, however, it might not be possible for us to take a Cartesian product where we find immense relationships with thousands of tuples with a considerable amount of attributes.
Join is a variation of the sorting process of a Cartesian product +. Two tuples from separate ties are paired by a join action if a given join condition is fulfilled.
pandas.DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
You can replace the “how” parameter with ‘left’, ‘right’, ‘outer’, or ‘inner’. The default value is ‘left’.
Constructing DataFrame from a dictionary.
d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(data=d) df
col1 col2 0 1 3 1 2 4
Notice that the inferred dtype is int64.
df.dtypes
col1 int64 col2 int64 dtype: object
To enforce a single dtype:
df = pd.DataFrame(data=d, dtype=np.int8) df.dtypes
col1 int8 col2 int8 dtype: object
Constructing DataFrame from numpy ndarray:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c']) df2
a b c 0 1 2 3 1 4 5 6 2 7 8 9
Ending Note
The Python dataframe operations is a very vast topic and there are a lot of variables and functions that you can work with. If you’d like to learn more about the dataframe operations, visit the Pandas dataframe documentation here.