SELECT

Purpose

Use the SELECT statement to retrieve data from tables.

Since TrainDB is implemented based on Apache Calcite, please refer to the SQL Reference of Apache Calcite for exact SELECT queries.

This document focuses on the SELECT APPROXIMATE statement for approximate queries in TrainDB.

Syntax

Diagram

select

selectItemList

selectItem

tableExpression

tableReference

tablePrimary

groupItemList

groupItem

orderItemList

orderItem

Keywords and Parameters

APPROXIMATE

Specify APPROXIMATE to indicate the query is an approximate query. If the following conditions are satisfied, the query is processed as an approximate query. Otherwise, this keyword is ignored and the query is processed as an exact query. * selectItemList includes aggregate functions that support approximation queries. * There is a model or a synopsis that learned the columns specified in selectItemList.

Currently, avg, sum, count, stddev, and variance aggregate functions are supported for approximate queries.

The GROUP BY, HAVING, and ORDER BY clauses can be used together in an approximate query.

selectItemList

Specify the list of columns or expressions to be retrieved.

FROM tableExpression

Specify the tables from which data is selected. Additionally, aliases and column lists can be specified.

WHERE booleanExpression

Specifies search or filter conditions to retrieve only the rows which satisfy the conditions. If the WHERE clause is omitted, all rows are retrieved.

GROUP BY groupItemList

The GROUP BY clause groups retrieved rows based on the expressions in groupItemList. The column specified in the GROUP BY clause must be included in the selectItemList.

HAVING booleanExpression

The HAVING clause can be used to retrieved the grouped results that satisfy the specified conditions. If the HAVING clause is omitted, all grouped results are retrieved.

ORDER BY orderItemList

The ORDER BY clause sorts retrieved rows by the specified order of the expressions in orderItemList. If the ORDER BY clause is omitted, the retrieved rows are retuned in an arbitrary order.

WITHIN numeric_literal SECONDS

The WITHIN clause specifies an execution time limit of the input approximate query. The query processing time is not guaranteed, but the specified time limit serves as an optimization hint for selecting synopses or models to be used. Currently, a simple policy that selects a synopsis according to the number of rows has been implemented, and the number of rows per second can be adjusted with a configuration parameter.

SYNOPSIS hint

A synopsis to be used for approximate query processing can be specified as a hint, such as SELECT /*+ SYNOPSIS(synopsis_name) */ APPROXIMATE. If the specified synopsis exists and can be used to process the input query, the query will be processed using the synopsis.

Examples

Approximate Query

The following statement approximately retrieves the sum of the reordered columns of the order_products table in the instacart schema.

SELECT APPROXIMATE sum(reordered) FROM instacart.order_products;

The WHERE, GROUP BY, or ORDER BY clauses can also be specified according to the user intent.

SELECT APPROXIMATE sum(reordered) FROM instacart.order_products
WHERE add_to_cart_order < 5;
SELECT APPROXIMATE sum(reordered) FROM instacart.order_products
GROUP BY add_to_cart_order
ORDER BY add_to_cart_order ASC;