Python Pandas Dataframe Operations Summary With Examples

A common need for data processing is grouping records by column(s). In today’s article, we’re summarizing the Python Pandas dataframe operations.

These possibilities involve the counting of workers in each department of a company, the measurement of the average salaries of male and female staff in each department, and the calculation of the average salary of staff of various ages.

To be able to manage much of the grouping activities easily, Pandas has a groupby() function. But there are certain functions that are difficult for the feature to perform.

Let’s study these “difficult” activities here and attempt to provide possible solutions. Groupby is one of the most important features of Pandas. It is used according to the split-apply-combine technique for grouping and summarising data.

The Dataset for Demonstrating Dataframe Operations

Taken from https://www.kaggle.com/kaggle/sf-salaries. It is a salary dataset of employees in San Francisco over the duration of 3 years:

import pandas as pd
import numpy as np
df = pd.read_csv("Salaries.csv",index_col="Id")
df
aggregate by Year
Dataset

Aggregation by a single column

For a certain column, you group records and then perform aggregate over each group. This is the easiest application of the technique above.

df.groupby('Year')['Year'].count()
aggregate by Year
aggregate by Year

Aggregation by multiple columns

You group several fields of records and then do an aggregate over each group.

df['BasePay'] = df['BasePay'].apply(pd.to_numeric, downcast="float", errors="coerce")
df.groupby(['Year','JobTitle'],as_index=False).mean()
Multiple aggregates over multiple columns
aggregate by Year and Job

Aggregation by a computed column

The grouping key does not represent informational indexes and must be determined on the basis of current data. A key like that is called a measured column.

Multiple aggregates over multiple columns

aggregate by (Year-2000)
df.groupby(df['BasePay']-100000).mean()
aggregated on counts and means
aggregate by adjusted Basepay

Multiple aggregations

How about working with multiple columns at once? Let’s take a look.

1. An aggregate on any of several columns

(On each of many tables, you execute one kind of aggregate):

df.groupby('Year',as_index=False).agg({'EmployeeName':'count','BasePay':'mean'})
aggregated on counts and means
aggregated on counts and means

In order to get the aggregation completed, related columns and the involving aggregate operations are passed into the function in the form of a dictionary, where columns are keys and aggregates are values.

2. Multiple aggregates on one column

On a single column, you can execute more than one form of aggregate:

df.groupby('Year').agg(['std','mean']).reset_index()
Multiple aggregates over multiple columns
st-dev and mean aggregation on column Year

3. Multiple aggregates over multiple columns

You can summarize numerous columns in which a single column comprises several aggregates. The aggregate operation can also be user-defined:

df.groupby('Year').agg({'BasePay':['count','mean'],'JobTitle':'max'})
Multiple aggregates over multiple columns
Multiple aggregates over multiple columns

Aggregation functions available

Here are the 13 features of aggregation available in Pandas and a brief description of what it does.

  • Mean(): Mean Estimate of Classes
  • Sum(): Compute the sum of the values of a group
  • Size(): Compute sizes of community
  • Count(): Computing the group count
  • Std(): Standard group deviation
  • Var(): Variance computation of groups
  • Sem(): The default average error for classes
  • Describe(): Produces figures that are informative
  • First(): The first of the group values are calculated.
  • Last(): Last estimation of group values

DataFrame Join Operations

We understand the advantages of taking a two-relationship Cartesian product, which gives us all the potential tuples that are coupled together.

In some instances, however, it might not be possible for us to take a Cartesian product where we find immense relationships with thousands of tuples with a considerable amount of attributes.

Join is a variation of the sorting process of a Cartesian product +. Two tuples from separate ties are paired by a join action if a given join condition is fulfilled.

pandas.DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

You can replace the “how” parameter with ‘left’, ‘right’, ‘outer’, or ‘inner’. The default value is ‘left’.

Python Pandas Dataframe Operations Summary

Types Of Joins

Constructing DataFrame from a dictionary.

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 3
1 2 4

Notice that the inferred dtype is int64.

df.dtypes
col1 int64
col2 int64
dtype: object

To enforce a single dtype:

df = pd.DataFrame(data=d, dtype=np.int8)
df.dtypes
col1 int8
col2 int8
dtype: object

Constructing DataFrame from numpy ndarray:

df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2
  a b c
0 1 2 3
1 4 5 6
2 7 8 9

Ending Note

The Python dataframe operations is a very vast topic and there are a lot of variables and functions that you can work with. If you’d like to learn more about the dataframe operations, visit the Pandas dataframe documentation here.

By admin

Leave a Reply

%d bloggers like this: