Window Function in SQL (Chapter 7)

Jinendrasingh
8 min readSep 13, 2023

--

“Unlock the Power of Analytical Magic in Your Database”

In the world of SQL, analytical tasks are made elegant and efficient with the enchanting capabilities of Window Functions. Imagine effortlessly calculating running totals, ranking data, and peeking into the past and future of your dataset. In this blog, we embark on a journey into the realm of Window Functions, exploring their spells of insight and mastery in the realm of SQL databases. Get ready to wield the magic of analytics like never before!

A window function in SQL is a specialized function that performs calculations on a specific set of rows related to the current row within the result set of a query, allowing us to perform analytical tasks like ranking, aggregation, and running calculations across rows.

Unlike aggregate functions like SUM, COUNT, or AVG, which produce a single result for each group of rows, window functions return a value for each row in the result set based on a specific window or frame of rows.

Window functions are often used to perform calculations that involve comparing a row to other rows in the result set, such as calculating running totals, ranking rows based on specific criteria, finding the difference between the current row and the previous row, and many other analytical tasks.

Let's first understand the frame and window of rows.

In SQL window functions, a “frame” and a “window of rows” refer to the subset of rows within the result set that the window function operates on for each row. These concepts are essential for understanding how window functions work and how they calculate results based on a specific set of rows.

Here’s a breakdown of these terms:

Window of Rows:

The “window of rows” or “window” refers to the group of rows within the result set that is considered when performing calculations with a window function. This window is defined using the OVER clause of the window function and typically includes the current row. The window can be further customized to include other rows relative to the current row. It is determined by the PARTITION BY, ORDER BY, and ROWS BETWEEN clauses within the OVER clause.

Frame:

The “frame” is a specific subset of rows within the window of rows that is used for calculations. It is defined by the ROWS BETWEEN clause within the OVER clause. The ROWS BETWEEN clause specifies how many rows preceding and following the current row should be included in the frame. The frame allows us to control which rows are considered when applying the window function.

Common frame specifications include:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: This includes all rows from the start of the window (unbounded preceding) up to and including the current row.

ROWS BETWEEN n PRECEDING AND m FOLLOWING: This includes a specified number of rows before (preceding) and after (following) the current row.

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: This includes all rows in the window, from the start to the end.

Aggregate functions used with the OVER() clause:

  1. This example uses the SUM() function to calculate the running total of a numeric column, such as sales, within a specific window of rows ordered by date.
SELECT
date,
sales,
SUM(sales) OVER (ORDER BY date) AS running_total
FROM
sales_data;

2. In this example, the AVG() function, calculates the average sales within each product category. The PARTITION BY clause is used to group the rows by product category.

SELECT
product_category,
sales,
AVG(sales) OVER (PARTITION BY product_category) AS avg_sales_by_category
FROM
sales_data;

3. Here, the MAX() function finds the maximum sales within a rolling window of the last 3 rows, ordered by date.

SELECT
date,
sales,
MAX(sales) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS max_sales_in_window
FROM
sales_data;

RANK(), DENSE_RANK(), and ROW_NUMBER() :

These window functions in SQL are used to assign rankings or unique identifiers to rows within a result set. They are often used in analytical queries to determine the relative position of rows based on specific criteria. Here’s an explanation of each function:

RANK()

This window function assigns a unique rank to each row within a result set based on the values in one or more columns. Rows with the same values receive the same rank, and the next rank is skipped. If multiple rows share the same rank, the next rank is adjusted accordingly.

SELECT
student_name,
exam_score,
RANK() OVER (ORDER BY exam_score DESC) AS rank
FROM
exam_results;

In this query, the RANK() function assigns a rank to each student based on their exam scores in descending order. If two students have the same exam score, they will receive the same rank, and the next rank will be skipped.

DENSE_RANK()

DENSE_RANK() is similar to RANK(), but it assigns ranks without skipping any values, even if multiple rows share the same rank. This means that consecutive rows with the same values will have the same rank, and there are no gaps in ranking.

SELECT
student_name,
exam_score,
DENSE_RANK() OVER (ORDER BY exam_score DESC) AS dense_rank
FROM
exam_results;

In this query, the DENSE_RANK() function assigns ranks to students based on their exam scores in descending order. If two students have the same exam score, they will receive the same rank, and there are no gaps in ranking.

ROW_NUMBER()

ROW_NUMBER() is a window function that assigns a unique sequential number to each row within a result set. It does not consider the values in any column but simply assigns an incrementing number to each row.

SELECT
student_name,
ROW_NUMBER() OVER (ORDER BY student_name) AS row_num
FROM
students;

In this query, the ROW_NUMBER() function assigns a unique row number to each student based on their name, ordered alphabetically. Each row receives a different sequential number.

FIRST_VALUE() and LAST_VALUE()

In SQL, FIRST_VALUE() and LAST_VALUE() are window functions that allow us to retrieve the first and last values in a specified window or frame of rows within a result set, respectively.

Here’s a brief explanation of each function:

FIRST_VALUE()

The FIRST_VALUE() function returns the value of a specified column from the first row within the defined window or frame of rows. we can use this function to extract the first value based on a specific ordering or partitioning criteria.

SELECT
product_name,
price,
FIRST_VALUE(price) OVER (ORDER BY price ASC) AS first_price
FROM
products;

In this query, FIRST_VALUE() is used to retrieve the lowest price (price) for each product by ordering the rows in ascending order of price.

LAST_VALUE()

The LAST_VALUE() function, on the other hand, returns the value of a specified column from the last row within the defined window or frame of rows. We can use this function to extract the last value based on a specific ordering or partitioning criteria.

SELECT
product_name,
price,
LAST_VALUE(price) OVER (ORDER BY price ASC) AS last_price
FROM
products;

It’s important to note that the behavior of these functions can be influenced by the ROWS BETWEEN clause within the OVER() clause.

By default, when we use FIRST_VALUE() or LAST_VALUE() without specifying the ROWS BETWEEN clause in the OVER() clause, the entire partition or result set is considered as the window frame. This means that the function will return the first and last values from the entire partition or result set, not just the adjacent rows.

To demonstrate, consider the following example:

Let’s say we have a table sales_data that records the sales of products in different regions, and we want to find the first and last sales amounts for each region. Here’s a query that uses FIRST_VALUE() and LAST_VALUE() without specifying the ROWS BETWEEN clause:

SELECT
region,
product_name,
sales_amount,
FIRST_VALUE(sales_amount) OVER (PARTITION BY region ORDER BY sales_date) AS first_sale_amount,
LAST_VALUE(sales_amount) OVER (PARTITION BY region ORDER BY sales_date) AS last_sale_amount
FROM
sales_data;

In this query, we are trying to find the first and last sale amounts for each region. However, without the ROWS BETWEEN clause, both FIRST_VALUE() and LAST_VALUE() will consider the entire partition (in this case, each region) as the window frame. This will not give us the expected result because it will return the same value for both the first and last sale amounts within each region.

To get the correct result, we need to specify the ROWS BETWEEN clause to define the frame explicitly. For example, if we want to find the first and last sale amounts within each region based on the sales_date, we can use the ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING clause, like this:

SELECT
region,
product_name,
sales_amount,
FIRST_VALUE(sales_amount) OVER (PARTITION BY region ORDER BY sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_sale_amount,
LAST_VALUE(sales_amount) OVER (PARTITION BY region ORDER BY sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sale_amount
FROM
sales_data;

With the ROWS BETWEEN clause specifying UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, we ensure that the entire partition (all rows within the same region) is considered for both the first and last values within each region. This way, we get the correct first and last sale amounts for each region based on the sales_date.

NTH_VALUE()

The NTH_VALUE() function is a window function in SQL that allows us to retrieve the value of a specified column from the nth row within a defined window or frame of rows within a result set. It is particularly useful when we want to extract a value from a specific position within an ordered set of rows.

NTH_VALUE(column_name, n) OVER (
PARTITION BY partition_column
ORDER BY order_column
ROWS BETWEEN unbounded preceding AND unbounded following
) AS nth_value_result
  • column_name: The column from which we want to retrieve values.
  • n: The position of the row from which we want to retrieve the value.
  • PARTITION BY: Optional clause to divide the result set into partitions or groups.
  • ORDER BY: Specifies the order of rows within each partition.
  • ROWS BETWEEN unbounded preceding AND unbounded following: Defines the entire window or frame of rows for the function.

Here’s an example of how to use the NTH_VALUE() function:

Suppose we have a table called employees that contains information about employees and their salaries. We want to find the salary of the third-highest-paid employee within each department. We can use the NTH_VALUE() function to achieve this:

SELECT
department,
employee_name,
salary,
NTH_VALUE(salary, 3) OVER (PARTITION BY department ORDER BY salary DESC) AS third_highest_salary
FROM
employees;

LAG() and LEAD()

The LAG() and LEAD() functions are window functions in SQL that allow us to access the values from the previous and next rows within a specified window or frame of rows in a result set, respectively. These functions are especially useful for performing calculations that involve comparing the current row with its neighboring rows.

Here’s an explanation of each function:

LAG()

The LAG() function retrieves the value of a specified column from the previous row within a defined window or frame of rows. It is often used to calculate the difference or change between the current row and the previous row.

#syntex
LAG(column_name, n) OVER (
PARTITION BY partition_column
ORDER BY order_column
) AS lagged_value

#Example:
SELECT
employee_name,
salary,
LAG(salary, 1) OVER (PARTITION BY department ORDER BY hire_date) AS previous_salary
FROM
employees;

In this query, the LAG() function retrieves the salary of the previous employee in the same department based on the hire date order.

LEAD()

The LEAD() function retrieves the value of a specified column from the next row within a defined window or frame of rows. It is the counterpart to LAG() and is often used to predict or access future values.

#syntex
LEAD(column_name, n) OVER (
PARTITION BY partition_column
ORDER BY order_column
) AS leading_value

#Example
SELECT
employee_name,
salary,
LEAD(salary, 1) OVER (PARTITION BY department ORDER BY hire_date) AS next_salary
FROM
employees;

In this query, the LEAD() function retrieves the salary of the next employee in the same department based on the hire date order.

“Empower Your SQL Skills”

Discover more SQL tips and tricks to enhance your data querying prowess. Stay tuned for more insightful articles and SQL magic on our blog!

--

--

Jinendrasingh
Jinendrasingh

Written by Jinendrasingh

An Aspiring Data Analyst and Computer Science grad. Sharing insights and tips on data analysis through my blog. Join me on my journey!

No responses yet