SELECT
Purpose
Use the SELECT statement to retrieve data from tables.
Since TrainDB is implemented based on Apache Calcite, please refer to the SQL Reference of Apache Calcite for exact SELECT queries.
This document focuses on the SELECT APPROXIMATE statement for approximate queries in TrainDB.
Syntax
Diagram
select
selectItemList
selectItem
tableExpression
tableReference
tablePrimary
groupItemList
groupItem
orderItemList
orderItem
Keywords and Parameters
APPROXIMATE
Specify APPROXIMATE to indicate the query is an approximate query.
If the following conditions are satisfied, the query is processed as an approximate query.
Otherwise, this keyword is ignored and the query is processed as an exact query.
* selectItemList includes aggregate functions that support approximation queries.
* There is a model or a synopsis that learned the columns specified in selectItemList.
Currently, avg, sum, count, stddev, and variance aggregate functions are supported for approximate queries.
The GROUP BY, HAVING, and ORDER BY clauses can be used together in an approximate query.
selectItemList
Specify the list of columns or expressions to be retrieved.
FROM tableExpression
Specify the tables from which data is selected. Additionally, aliases and column lists can be specified.
WHERE booleanExpression
Specifies search or filter conditions to retrieve only the rows which satisfy the conditions.
If the WHERE clause is omitted, all rows are retrieved.
GROUP BY groupItemList
The GROUP BY clause groups retrieved rows based on the expressions in groupItemList.
The column specified in the GROUP BY clause must be included in the selectItemList.
HAVING booleanExpression
The HAVING clause can be used to retrieved the grouped results that satisfy the specified conditions.
If the HAVING clause is omitted, all grouped results are retrieved.
ORDER BY orderItemList
The ORDER BY clause sorts retrieved rows by the specified order of the expressions in orderItemList.
If the ORDER BY clause is omitted, the retrieved rows are retuned in an arbitrary order.
WITHIN numeric_literal SECONDS
The WITHIN clause specifies an execution time limit of the input approximate query.
The query processing time is not guaranteed, but the specified time limit serves as an optimization hint for selecting synopses or models to be used. Currently, a simple policy that selects a synopsis according to the number of rows has been implemented, and the number of rows per second can be adjusted with a configuration parameter.
SYNOPSIS hint
A synopsis to be used for approximate query processing can be specified as a hint, such as SELECT /*+ SYNOPSIS(synopsis_name) */ APPROXIMATE. If the specified synopsis exists and can be used to process the input query, the query will be processed using the synopsis.
Examples
Approximate Query
The following statement approximately retrieves the sum of the reordered columns of the order_products table in the instacart schema.
SELECT APPROXIMATE sum(reordered) FROM instacart.order_products;
The WHERE, GROUP BY, or ORDER BY clauses can also be specified according to the user intent.
SELECT APPROXIMATE sum(reordered) FROM instacart.order_products
WHERE add_to_cart_order < 5;
SELECT APPROXIMATE sum(reordered) FROM instacart.order_products
GROUP BY add_to_cart_order
ORDER BY add_to_cart_order ASC;