Enhancing Data Processing with Aggregate Functions in Snowflake Snowpark

In data analysis, efficiently summarizing and processing large datasets is crucial for gaining actionable insights. Snowflake Snowpark, combined with the power of Python, offers a robust environment for performing advanced data operations, including the use of aggregate functions. These functions allow you to perform calculations across multiple rows and return summarized results, making them indispensable for data processing and analysis.

Why this blog?

Navigating large datasets can be daunting, especially when trying to extract meaningful insights. This comprehensive guide dives deep into the use of aggregate functions in Snowflake Snowpark, showcasing how they can streamline your data workflows. Whether you’re new to Snowpark or an experienced user, this article will equip you with the knowledge to enhance your data processing and achieve more efficient, accurate results.

Summarizing Data Efficiently with SQL and Python

When working with large datasets, the ability to efficiently summarize and analyze data is crucial. Aggregate functions are key to this process, offering the ability to perform calculations across sets of values and return a single, summarized result. These functions are especially powerful when used with the GROUP BY clause, allowing you to compute meaningful insights from grouped rows in your SQL queries.

Snowflake Snowpark, a versatile data engineering platform, enhances this capability by integrating seamlessly with Python, providing a robust environment for data manipulation and analysis. By leveraging Snowflake’s processing power and Snowpark’s Python API, you can unlock advanced analytics and simplify complex data operations.

Let us explore how to effectively use aggregate functions in Snowflake Snowpark with Python, enabling you to elevate your data analysis to the next level.

1. Aggregate Functions in Snowpark

The DataFrame.agg method in Snowpark is used to aggregate data within a DataFrame. This method accepts any valid Snowflake aggregate function names as input, allowing you to perform calculations on multiple rows and return a single result.

There are multiple approaches to passing DataFrame columns to the DataFrame.agg method for executing aggregate calculations:

  • A Column object
  • A tuple where the first element is either a column object or name, and the second element specifies the aggregate function
  • A list combining any of the above
  • A dictionary that maps column names to the corresponding aggregate functions
2. Demonstration of Aggregate Functions Using the DataFrame.agg Method in Snowpark

To execute aggregate calculations using the DataFrame.agg method, follow these steps:

STEP 1: Initiate a connection with Snowflake from Snowpark by utilizing the Session class.
STEP 2: Import the required aggregate functions, such as min, max, and sum, from the snowflake.snowpark.functions package.
STEP 3: Create a DataFrame that holds the data on which the aggregate functions will be applied.
STEP 4: Apply aggregate calculations to the DataFrame using the DataFrame.agg method.

# Demonstration
Consider the following EMPLOYEE data for demonstrating aggregate functions in Snowpark:

employee_data = [    [1,'TONY',24000],
    [2,'STEVE',17000],
    [3,'BRUCE',9000],
    [4,'WANDA',20000],
    [5,'VICTOR',12000],
    [6,'STEPHEN',10000]
]

employee_schema = ["EMP_ID", "EMP_NAME", "SALARY"]

df_emp = session.createDataFrame(employee_data, schema=employee_schema)
df_emp.show()

Output:
-----------------------------------
|"EMP_ID"  |"EMP_NAME"  |"SALARY"  
-----------------------------------
|1         |TONY        |24000 
|2         |STEVE       |17000 
|3         |BRUCE       |9000 
|4         |WANDA       |20000
|5         |VICTOR      |12000 
|6         |STEPHEN     |10000 
-----------------------------------

2. 1. Passing a DataFrame Column Object

Before performing aggregate calculations, import the necessary aggregate function methods from the snowflake.snowpark.functions package.

# Importing the Aggregate Function methods
from snowflake.snowpark.functions import col, min, max, avg

# Passing a Column object to DataFrame.agg method
df_employee.agg(max("SALARY"), min("SALARY")).show()

-----------------------------
| Max(Salary)    | Min(Salary) |
-----------------------------
| 24000          | 9000        |
-----------------------------

df_employee.agg(max(col("SALARY")), min(col("SALARY"))).show()

Output:-----------------------------
| Max(Salary)    | Min(Salary) |
-----------------------------
| 24000          | 9000        |
-----------------------------

2.2. Passing a Tuple with Column Name and Aggregate Function

You can also pass a tuple containing the column name and aggregate function to the DataFrame.agg method:

df_employee.agg(("SALARY", "max"), ("SALARY", "min")).show()

Output:-----------------------------
| Max(Salary)    | Min(Salary) |
-----------------------------
| 24000          | 9000        |
-----------------------------

2.3. Passing a List of Column Objects and Tuples

You can pass a list of columns and tuples to the DataFrame.agg method:

df_employee.agg([("SALARY", "min"), ("SALARY", "max"), avg(col("SALARY"))]).show()

Output:----------------------------------------
| MIN(SALARY) | MAX(SALARY) | AVG(SALARY) |
----------------------------------------
| 9000        | 24000       | 15333.33    |
----------------------------------------

2.4. Passing a Dictionary Mapping Column Name to Aggregate Function

Alternatively, you can pass a dictionary mapping column names to aggregate functions:

df_employee.agg({"SALARY": "min"}).show()

Output:-------------
| MIN(SALARY) |
-------------
| 9000        |
-------------
3. Aggregate Functions Using the DataFrame.select Method in Snowpark

The DataFrame.select method allows you to create a new DataFrame by specifying particular column expressions. You can incorporate aggregate functions within these expressions to select and manipulate data effectively.

# Aggregate functions using select method
df_employee.select(min("SALARY"), max("SALARY")).show()

Output:-----------------------------
| MIN(SALARY) | MAX(SALARY) |
-----------------------------
| 9000        | 24000       |
-----------------------------
4. Renaming the Output of Aggregate Fields

You can rename the output fields generated by aggregate functions to new column names by using the Column.as_ or Column.alias methods, as demonstrated below:

# Renaming column names
df_employee.agg(min("SALARY").as_("min_sal"), max("SALARY").alias("max_sal")).show()

-------------------------
| MIN_SAL | MAX_SAL |
-------------------------
| 9000    | 24000   |
-------------------------

df_employee.select(min("SALARY").as_("MIN_SAL"), max("SALARY").alias("MAX_SAL")).show()

-------------------------
| MIN_SAL | MAX_SAL |
-------------------------
| 9000    | 24000   |
-------------------------

Looking to boost your business with better data management?

Talk to our experts to see how Snowflake can simplify your data operations and drive your business growth.

5. Using the Result of an Aggregate Function as an Input

To illustrate this concept, consider a scenario where you need to retrieve details of the employee with the highest salary. This can be achieved with the following SQL query:

-- Retrieve employee details with the highest salary
SELECT * FROM EMPLOYEES WHERE SALARY IN (
  SELECT MAX(SALARY) FROM EMPLOYEES
);

In this example:

  • The maximum salary in the table is calculated using an aggregate function.
  • The calculated salary is then used as a filter to retrieve the corresponding employee details.


Now, let us understand how the same can be achieved in Snowpark.

The DataFrame.collect method in Snowpark is utilized to retrieve the results after all defined calculations on a DataFrame have been executed. This method consolidates the output into a list of Row objects, allowing you to work with the collected data programmatically. Each Row in the list represents a single record from the DataFrame, containing the calculated values from the executed operations.

In the code provided, the maximum salary is determined using the DataFrame.agg method, and the result is assigned to a variable by calling the DataFrame.collect method

max_sal = df_employee.agg(max("SALARY").alias("MAX_SALARY")).collect()

The following code demonstrates that the max_sal variable is of type list and displays the value it holds.

type(max_sal)
-----------------
| <class 'list'> |
-----------------
print(max_sal)
----------------------------
| [Row(MAX_SALARY=24000)] |
----------------------------

Extract the max salary amount from the list object as shown below.

max_sal = df_max_sal[0]['MAX_SALARY']type(max_sal)
-----------------
|<class 'int'>  |


print(max_sal)
----------
|24000   |
----------

The `DataFrame.filter` method allows you to extract rows from a DataFrame that satisfy a specified condition, similar to how the `WHERE` clause functions in SQL.

The following code extracts the employee details with max salary.

# Get employee details with max salary

df_employee.filter(col("SALARY") == max_sal).show()
------------------------------------
|"EMP_ID"  |"EMP_NAME"  |"SALARY"  |
------------------------------------
|1         |TONY        |24000     |
------------------------------------

FAQ’s

Featured content

Smarter Healthcare Analytics with Snowflake and Ic...

Serverless Computing in Snowflake...

A Complete Guide on Security and Compliance in Sno...

Snowflake Copilot

Streamline SQL Workflow with Snowflake Copilot...

GCP vs. AWS vs. Azure (2024)...

Choosing the Right Cloud Data Engineering & Analytics Platform: Databricks vs. Snowflake

Databricks vs. Snowflake (2024)...

Snowflake tutorial

Quick Tutorial on DataFrame Updates in Snowpark...

Case study : Unified Workforce Data automation using snowflake

Unified Workforce Data and Automated Insights with...

Loading Data into Snowflake using Snowpark DataFrames

Loading Data into Snowflake using Snowpark DataFra...

snowflake, create data frame

Creating DataFrames in Snowflake Snowpark...

Scroll to Top