How group by a sub select result that uses a TOP to reduce read rows (Sybase) - sybase-ase

I'm trying to do a group by on a subselect result to reduce the amount of data processed.
My_table has more than 20 million rows.
example:
SELECT TOP 100 A.Column FROM (
SELECT TOP 500 Column FROM My_table) A
GROUP BY A.Column
I want the query to work with only 500 rows from my table, but when I use group by, it takes a lot of time, like if it was grouping up the whole 20 million rows when I'm only grouping 500.
Is there a way to make the sql motor only work with the 500 rows?

If it's irrelevant to you which 500 rows are used, have you considered using set rowcount?
set {rowcount number, textsize number} – causes the SAP ASE server to stop processing the query (select, insert, update, or delete) after the specified number of rows are affected. The number can be a numeric literal with no decimal point or a local variable of type integer.
Infocenter source

Related

LIMIT does not limit the number of results - what will do the trick?

Running the below query, I get output that is longer than 10 rows. when I change the "LIMIT 10" clause to "LIMIT 0" , I get empty results.
what is the right way to extract 10 results out of the returned values in BigQuery
SELECT
UNIQUE(index) AS sample
FROM
[dataset:table]
LIMIT 10
The result is 79318 rows.
thanks,
eilalan
Most likely your table has repeated fields and thus when you use BigQuery Legacy SQL it returns you 10 rows but then in UI they are being flattened with all elements in repeated field presented as one row, thus you see more than 10
To see this - just use BigQuery Standard SQL which does not flatten result and rather shows it in UI in hierarchical way
So, depends on what you need - if you need original 10 rows - you already have them they are just flattened for you in UI (see above explanation). If you need 10 rows from flattened result you need to FLATTEN it first
But the best way is just to use BigQuery Standard SQL ( in case if index is just single repeated field (not record)
SELECT DISTINCT sample
FROM `dataset.table`, UNNEST(index) sample
LIMIT 10
in case if index is repeated record - you should specify field (for example id) in that record
SELECT DISTINCT sample.id
FROM `dataset.table`, UNNEST(index) sample
LIMIT 10

What is the max limit of group_concat/string_agg in bigquery output?

I am using group_concat/string_agg (possibly varchar) and want to ensure that bigquery won't drop any of the data concatenated.
BigQuery will not drop data if a particular query runs out of memory; you will get an error instead. You should try to keep your row sizes below ~100MB, since beyond that you'll start getting errors. You can try creating a large string with an example like this:
#standardSQL
SELECT STRING_AGG(word) AS words FROM `bigquery-public-data.samples.shakespeare`;
There are 164,656 rows in this table, and this query creates a string with 1,168,286 characters (around a megabyte in size). You'll start to see an error if you run a query that requires more than something on the order of hundreds of megabytes on a single node of execution, though:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
This results in an error:
Resources exceeded during query execution.
If you click on the "Explanation" tab in the UI, you can see that the failure happened during stage 1 while building the results of STRING_AGG. In this case, the string would have been 3,303,599,000 characters long, or approximately 3.3 GB in size.
Adding to Elliot's answer - how to fix:
This query (Elliot's) fails:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus)) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));
But you can LIMIT the number of strings concatenated to get a working solution:
#standardSQL
SELECT STRING_AGG(CONCAT(word, corpus) LIMIT 10) AS words
FROM `bigquery-public-data.samples.shakespeare`
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000));

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.
How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

BigQuery error “Response too large to return” when using COUNT(DISTINCT …)

I have a dataset with ~20M rows and I'm observing the following behavior.
The query below returns the error "Response too large to return". The 'id' field is shared among multiple records and the 'field' field has some arbitrary value for each record. I would expect that the result set should only contain 10 rows, well below the query response limit.
SELECT id, COUNT(DISTINCT field)
FROM [my.dataset]
GROUP BY id
LIMIT 10
However, when the DISTINCT keyword is removed from the COUNT aggregation function, BigQuery returns 10 results as expected.
SELECT id, COUNT(field)
FROM [my.dataset]
GROUP BY id
LIMIT 10
I don't understand why the first query returns an error and the second completes successfully. Shouldn't both queries return the same number of rows?
It's not the result size that is causing this response, it's the intermediate size of the data generated by your COUNT DISTINCT query.
Note: COUNT DISTINCT returns a statistical approximation after 1000 values - you can alter the approximation by choosing a particular value for the limit in which DISTINCT will return an approximation.. such as: COUNT(DISTINCT your_field, 500)
See: https://developers.google.com/bigquery/docs/query-reference#aggfunctions
This behavior is due to the design of BigQuery, which makes it so quick: data is queried via separate nodes, and results are aggregated at mixers. A COUNT will tally the total number of results and combine the answer, but COUNT DISTINCT needs to keep track of potentially millions of separate sums, and then combine those values later. Therefore a COUNT DISTINCT can create a lot of data, and could potentially be over internal maximum for individual nodes.
Note also that currently, BigQuery LIMIT clauses are applied after the entire result set is determined.

QUERY speed with limit and milion records

Hi i have a 7milion records db table for testing query speed.
I tested up my 2 queries which are the same query with different limit parametres:
query 1 -
SELECT *
FROM table
LIMIT 20, 50;
query 2 -
SELECT *
FROM table
LIMIT 6000000, 6000030;
query exec times are:
query 1 - 0.006 sec
query 2 - 5.500 sec
In both of these queries, I am fetching same number of records, but in the second case it's taking more time. Can someone please explain the reasons behind this?
Without looking into it too closely, my assumption is that this occurs because the first query only has to read to the 50th record to return results, whereas the second query has to read six million before returning results. Basically, the first query just shorts out quicker.
I would assume that this has an incredible amount to do with the makeup of the table - field types and keys, etc.
If a record is made up of fixed-length fields (e.g. CHAR vs. VARCHAR), then the DBMS can just calculate where the nth record starts and jumps there. If its variable length, then you would have to read the records to determine where the nth record starts. Similarly, I'd further assume that tables which have appropriate primary keys would be quicker to query than those without such keys.
I think the slowdown is tied to the fact you are using limits with offsets and are querying the table with no additional context for indexing. Its possible the first is just faster because it can get to the offset quicker.
It's the difference between returning 50 rows and 6000030 rows (or ~1million rows since you said there were only 7million rows).
With two arguments, the first argument specifies the offset of the
first row to return, and the second specifies the maximum number of
rows to return. The offset of the initial row is 0 (not 1):
SELECT * FROM tbl LIMIT 5,10; # Retrieve rows 6-15
http://dev.mysql.com/doc/refman/5.0/en/select.html
Also, I think you're looking for 30 row pages so your queries should be using 30 as the second parameter in the limit clause.
SELECT *
FROM table
LIMIT 20, 30;
SELECT *
FROM table
LIMIT 6000000, 30;

Resources