Maximum number of columns which can be used in group by clause


#1

Hi Team,

I would like to know what is the maximum number of columns which can be used in the group by clause.

With Regards
Ullas


#2

Hi @UllasShanthakumar,

There should be no inherent limit to the number of columns you can group by, except for the amount of memory required by the group by.

Hopefully this helps.

Regards


#3

Hi Darwin,

Thanks for your response.
Is there any way to calculate the amount of memory required by the group by.
I have a scenario where I need to perform a rolling ( day to day basis for a span of 15 days. ) counts by grouping on 17 fields.
The query fails with the following exception :
Exception: Other error: code -1133870

I have raised a separate issue for the same.

The machine on which I am running is having 8 CPU with 32GB memory.

The mapd.conf file has the following entries only :
port = 9091
http-port = 9090
data = “/var/lib/mapd/data”
null-div-by-zero = true

[web]
port = 9092
frontend = “/opt/mapd/frontend”

Let us know if you have any suggestion on fine tuning any parameters & also a way to calculate the amount of memory required by the group by clause.

With Regards
Ullas Shanthakumar.


#4

Hi @UllasShanthakumar,

Sorry to hear of this. Are other queries running ok? (i.e. can you do queries with 1…2…3 groups?) And does he exception give a line number?

Can you give us a sense of the size of your table (number of rows) and the types of the columns you are grouping on? Also could you retrieve the cardinalities for those columns, and at least give us a range? You can check the cardinality of a column by running SELECT approx_count_distinct(col_name) from table_name.

Finally, when you say 8 CPU, I assume you mean 8 cores correct? Are there any GPUs on the system as well?

MapD will try to use what is called a perfect hash group by when the cardinalities are reasonable (speeds up hashing the results into group bins), but given the high number of columns its likely that we are using our baseline group strategy, which leverages open address hashing with linear probing. In general, that type of hash is going to take 8 bytes per column X (number columns grouped + number group columns projected + # measures) X number rows / fill factor, the latter of which might be 1/2 (50%). So if you are re-projecting all the . group by columns and calculating a single measure, that might be 8 X (17 + 17 + 1) X (1/2) = 560 bytes per output row. We also currently will have each CPU thread (if you are indeed running on CPU) have its own output space to minimize contention, so if you are using more than one thread (by default each thread will be responsible for 32M rows of data, without tweaking `fragment_size), you’d have to multiply by # threds as well.
We likely will do some optimizations in the future around enabling different column widths, and even eliding the redundant projection of the group by key, but you’ll still need some non-negligible amount of memory when the number of distinct groups get large.

Once you give us more details on your query and setup, we’ll try to repro internally. At the very least we should be giving a clearer out of memory error, if that is indeed what is happening here.

Regards