Behavior of GROUP BY (double datatype multiplication)


#1

I am trying to understand how GROUP BY performs .

In my table I have

col1 INT with 12 unique values
col2 TEXT ENCODING DICT(8) with 45 unique values
col3 TEXT ENCODING DICT(8) with 4 unique values
col4 INT

with 11 Millions rows.

  1. If I run : SELECT col1, SUM(col4) from table1 GROUP BY col1 the query is superfast.

  2. If I run : SELECT col1, col2, col3, SUM(col4) from table1 GROUP BY col1, col2, col3 the query takes 3 seconds.

  3. If I run : SELECT col1, col3, SUM(col4) from table1 GROUP BY col1, col3 the query takes 60 seconds !!!

I am wondering why the third one should last so much, and in particular more then the second one having one less grouping level.

CENTOS 7.3, GTX 1080 8GB, 128GB RAM, Mapd 3.2.2

This are the EXPLAIN CALCITE:

1) LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])
  LogicalProject(col1=[$3], $f1=[*($13, $14)])
    EnumerableTableScan(table=[[mapd, table1]])

2) LogicalAggregate(group=[{0, 1, 2}], EXPR$3=[SUM($3)])
  LogicalProject(col1=[$3], col2=[$43], col3=[$26], $f3=[*($13, $14)])
    EnumerableTableScan(table=[[mapd, table1]])

3) `LogicalAggregate(group=[{0, 1}], EXPR$2=[SUM($2)])
   LogicalProject(col1=[$3], col3=[$26], $f2=[*($13, $14)])
     EnumerableTableScan(table=[[mapd, table1]])

`


#2

try this query on example data provided with mapd

select flight_year,dest_city,origin_city,sum(arrdelay+depdelay) from flights_2008_7M group by 1,2,3;

it has a cardinality >5k and it would run quite fast (20ms or so)

i guess you are querying a view; maybe the problem relies on it


#3

Thanks for your quick support aznable.

I am using a plain table, not a view.

BTW, after imported the flights_2008_7M dataset, the query you suggested run in 20ms.

Now, I would like to troubleshooting the issue in my table, what are the best practices in this case?


#4

I just fund that the cause is the DOUBLE datatype ( and therefore not strictly related to the GROUP BY function) .

If the metric I am grouping over is DOUBLE*DOUBLE the computing is VERY slow.

Casting that to FLOAT will make MapD core come back to its usual millisecond speed.

Now, the lack of precision of the FLOAT datatype wouldn’t be a problem in case of multiplicative aggregates (TOTALS), but IS A BIG problem in case of computing of unit values where we have very precise numeric result by several decimal places.


#5

i thought you where querying a view because of this part in the plan of all queries

$f3=[*($13, $14)])

Looks like a multiplication and I can’t see on queries; a 128bit operation is very expensive on gpus because there isn’t any hardware support for them; maybe they would generate ir code optimized for cpus


#6

Thanks, aznable, the actual queries have SUM(col_y * col_x) where col_y and col_x are DOUBLE.

Now what do you suggest in order to speed up the dashboard? There is something I can change ?


#7

an idea i havent tested yet assuming the problem is the multiplication between two doubles

try to change from sum(a*b) to sum(a)*sum(b) and let’s see if the optimizer change the plan and will perform the multiplication in the reductions phases and now for every row.

writing an sql to change the beaviour is trivial task, but i guess you are using immerse and you dont have the full control on sql

select col1,col2,col3,sum_col4*sum_col5
from (select col1,col2,col3,sum(col4) as sum_col4,sum(col5) as sum_col5 from table1 group by col1,col2,col3)

with a query like that the multiplications between double would be just a few tousands instead of millions

i will try to reproduce on my system anyway

i loaded 11m records on a table and tried a query like select text encoding dictionary(32),small integer ,sum(doubledouble) from table group by 1,2 and the response time is around 30ms with a cardinality of 1300
changing the query with a sum(double)*sum(double) and the plan as you can see change as intended

LogicalProject(TEXT_KAEN_KPRV_DESC_PROVINCIA=[$0], FK_FSPE_ESFI_ID=[$1], EXPR$2=[*($2, $3)])
  LogicalAggregate(group=[{0, 1}], agg#0=[SUM($2)], agg#1=[SUM($3)])
    LogicalProject(TEXT_KAEN_KPRV_DESC_PROVINCIA=[$0], FK_FSPE_ESFI_ID=[$1], IMPO_FSPE_IMP_CC=[$2], IMPO_FSPE_IMP_CR=[$3])
      EnumerableTableScan(table=[[mapd, fspe_double3]])

but the response time is around 60ms so is higher; i would take a look to logs to be sure isnt occurring trashing of cpu caches forcing disk reads (it would be weird because the memory consumption would be so large)


#8

Thanks for your support aznable, here some additional details :

  1. The table is 34M rows, 11M are the subset of my early tests, resulting from filtering.
  2. Your change in the query didn’t helped: I have the same time 46 to 50 seconds.
  3. I am not sure about the DOUBLE multiplication problem as cause of the trouble: I performed several test as following, and got some weird results:

Table: 34M rows, 55 columns.

 Datatype:
col1		: INT NOT NULL				cardinality = 12
col2		: TEXT ENCODING DICT(8)		cardinality = 3
col3		: TEXT ENCODING DICT(8)		cardinality = 21
col4		: TEXT ENCODING DICT		cardinality = 235
col5		: TEXT ENCODING DICT(16)	cardinality = 59
value1		: DOUBLE
value2		: DOUBLE



select col3, sum(value1*value2)						from table1   group by 1				/* time = 4 milliseconds */

select col1, sum(value1*value2)						from table1   group by 1				/* time = 6 milliseconds */
	
select col2, sum(value1*value2)						from table1   group by 1				/* time = 6 milliseconds */

select col1, col3, col2, col4, sum(value1*value2)	from table1   group by 1, 2, 3, 4		/* time = 1.2 seconds    */

select col1, col4, sum(value1*value2)				from table1   group by 1, 2				/* time = 1.43 seconds   */

select col1, col3, sum(value*value2)				from table1   group by 1, 2				/* time = 4.5 seconds    */

select col1, col5, sum(value1*value2)				from table1   group by 1, 2				/* time = 46.5 seconds   */

select col1, col5, col2, sum(value1*value2)			from table1   group by 1, 2, 3			/* time = 46.5 seconds   */

select col1, col3, col2, sum(value1*value2)			from table1   group by 1, 2, 3			/* time = 46.9 seconds   */

select col1, col2, sum(value1*value2)				from table1   group by 1, 2				/* time = 48.5 seconds   */

Take, for example, these two tests:

select col1, col4, sum(value1*value2) from table1 group by 1, 2 /* time = 1.43 seconds */

and

select col1, col5, sum(value1*value2) from table1 group by 1, 2 /* time = 46.5 seconds */

What could be the culprit for the difference in performance?

and also these two:

select col1, col2, sum(value1*value2) from table1 group by 1, 2 /* time = 48.5 seconds */

and

select col1, col3, col2, col4, sum(value1*value2) from table1 group by 1, 2, 3, 4 /* time = 1.2 seconds */

where both col1 and col2 are in the second query, but alongside other two columns lead to a different level of speed.

Weird…


#9

Hi,

Please include you mapd_server.INFO log for these query runs so we can investigate

Plus run an explain select col1, col4, sum(value1*value2) from table1 group by 1, 2 and an explain select col1, col5, sum(value1*value2) from table1 group by 1, 2 and send on the output.

Regards


#10

HI,

Additionally could you share the results and timings of the following queries

select col1, col4, count(*) from table1 group by 1, 2

select col1, col5, count(*) from table1 group by 1, 2

We need to see if any of your result groups are particularly busy

Regards


#11

Hello,
I executed an

explain select col1, col4, sum(value1*value2) from table1 group by 1, 2
Result:
IR for the GPU:
===============

; Function Attrs: uwtable
define void @query_group_by_template(i8** nocapture readnone %byte_stream, i8* nocapture readonly %literals, i64* nocapture readnone %row_count_ptr, i64* nocapture readonly %frag_row_off_ptr, i32* %max_matched_ptr, i64* %agg_init_val, i64** %group_by_buffers, i64** %small_groups_buffer, i32 %frag_idx, i64* %join_hash_tables, i32* %total_matched, i32* %error_code) #21 {
.entry:
  %0 = getelementptr i8*, i8** %byte_stream, i32 0
  %1 = load i8*, i8** %0
  %2 = getelementptr i8*, i8** %byte_stream, i32 1
  %3 = load i8*, i8** %2
  %4 = getelementptr i8*, i8** %byte_stream, i32 2
  %5 = load i8*, i8** %4
  %6 = getelementptr i8*, i8** %byte_stream, i32 3
  %7 = load i8*, i8** %6
  %8 = load i64, i64* %row_count_ptr, align 8
  %9 = load i32, i32* %max_matched_ptr, align 4
  %crt_matched = alloca i32
  %old_total_matched = alloca i32
  %10 = call i32 @pos_start_impl(i32* %error_code)
  %11 = call i32 @pos_step_impl()
  %12 = call i32 @group_buff_idx_impl()
  %13 = sext i32 %10 to i64
  %14 = getelementptr i64*, i64** %group_by_buffers, i32 %12
  %15 = load i64*, i64** %14, align 8
  %16 = call i64* @init_shared_mem_nop(i64* %15, i32 0)
  %17 = icmp slt i64 %13, %8
  br i1 %17, label %.loop.preheader, label %.exit

.loop.preheader:                                  ; preds = %.entry
  %18 = sext i32 %11 to i64
  br label %.forbody

.forbody:                                         ; preds = %.forbody, %.loop.preheader
  %pos = phi i64 [ %13, %.loop.preheader ], [ %20, %.forbody ]
  %19 = call i32 @row_func(i64* %16, i64* null, i32* %crt_matched, i32* %total_matched, i32* %old_total_matched, i64* %agg_init_val, i64 %pos, i64* %frag_row_off_ptr, i64* %row_count_ptr, i8* %literals, i8* %1, i8* %3, i8* %5, i8* %7, i64* %join_hash_tables)
  %20 = add i64 %pos, %18
  %21 = icmp slt i64 %20, %8
  br i1 %21, label %.forbody, label %._crit_edge

._crit_edge:                                      ; preds = %.forbody
  br label %.exit

.exit:                                            ; preds = %._crit_edge, %.entry
  call void @write_back_nop(i64* %15, i64* %16, i32 0)
  ret void
}

; Function Attrs: alwaysinline
define i32 @row_func(i64* %group_by_buff, i64* %small_group_by_buff, i32* %crt_match, i32* %total_matched, i32* %old_total_matched, i64* %agg_init_val, i64 %pos, i64* %frag_row_off, i64* %num_rows_per_scan, i8* %literals, i8* %col_buf0, i8* %col_buf1, i8* %col_buf2, i8* %col_buf3, i64* %join_hash_tables) #22 {
entry:
  %0 = load i64, i64* %frag_row_off
  br i1 true, label %filter_true, label %filter_false

filter_true:                                      ; preds = %entry
  %1 = alloca i64, i32 2
  %2 = call i64 @fixed_width_int_decode(i8* %col_buf0, i32 4, i64 %pos)
  %3 = trunc i64 %2 to i32
  %4 = sext i32 %3 to i64
  %5 = getelementptr i64, i64* %1, i32 0
  store i64 %4, i64* %5
  %6 = call i64 @fixed_width_int_decode(i8* %col_buf3, i32 4, i64 %pos)
  %7 = trunc i64 %6 to i32
  %8 = call i64 @translate_null_key_int32_t(i32 %7, i32 -2147483648, i32 235)
  %9 = sext i32 %7 to i64
  %10 = getelementptr i64, i64* %1, i32 1
  store i64 %8, i64* %10
  %11 = call i32 @perfect_key_hash(i64* %1)
  %12 = call i64* @get_matching_group_value_perfect_hash(i64* %group_by_buff, i32 %11, i64* %1, i32 2, i32 5)
  %13 = getelementptr i64, i64* %12, i32 0
  %14 = sext i32 %3 to i64
  call void @agg_id_shared(i64* %13, i64 %14)
  %15 = getelementptr i64, i64* %12, i32 1
  %16 = sext i32 %7 to i64
  call void @agg_id_shared(i64* %15, i64 %16)
  %17 = call double @fixed_width_double_decode(i8* %col_buf1, i64 %pos)
  %18 = call double @fixed_width_double_decode(i8* %col_buf2, i64 %pos)
  %19 = call double @mul_double_nullable(double %17, double %18, double 0x10000000000000)
  %20 = getelementptr i64, i64* %12, i32 2
  call void @agg_sum_double_skip_val_shared(i64* %20, double %19, double 0x10000000000000)
  br label %filter_false

filter_false:                                     ; preds = %filter_true, %entry
  ret i32 0
}

and an

explain select col1, col5, sum(value1*value2) from table1 group by 1, 2
Result:
IR for the GPU:
===============

; Function Attrs: uwtable
define void @query_group_by_template(i8** nocapture readnone %byte_stream, i8* nocapture readonly %literals, i64* nocapture readnone %row_count_ptr, i64* nocapture readonly %frag_row_off_ptr, i32* %max_matched_ptr, i64* %agg_init_val, i64** %group_by_buffers, i64** %small_groups_buffer, i32 %frag_idx, i64* %join_hash_tables, i32* %total_matched, i32* %error_code) #21 {
.entry:
  %0 = getelementptr i8*, i8** %byte_stream, i32 0
  %1 = load i8*, i8** %0
  %2 = getelementptr i8*, i8** %byte_stream, i32 1
  %3 = load i8*, i8** %2
  %4 = getelementptr i8*, i8** %byte_stream, i32 2
  %5 = load i8*, i8** %4
  %6 = getelementptr i8*, i8** %byte_stream, i32 3
  %7 = load i8*, i8** %6
  %8 = load i64, i64* %row_count_ptr, align 8
  %9 = load i32, i32* %max_matched_ptr, align 4
  %crt_matched = alloca i32
  %old_total_matched = alloca i32
  %10 = call i32 @pos_start_impl(i32* %error_code)
  %11 = call i32 @pos_step_impl()
  %12 = call i32 @group_buff_idx_impl()
  %13 = sext i32 %10 to i64
  %14 = getelementptr i64*, i64** %group_by_buffers, i32 %12
  %15 = load i64*, i64** %14, align 8
  %16 = call i64* @init_shared_mem_nop(i64* %15, i32 0)
  %17 = icmp slt i64 %13, %8
  br i1 %17, label %.loop.preheader, label %.exit

.loop.preheader:                                  ; preds = %.entry
  %18 = sext i32 %11 to i64
  br label %.forbody

.forbody:                                         ; preds = %.forbody, %.loop.preheader
  %pos = phi i64 [ %13, %.loop.preheader ], [ %20, %.forbody ]
  %19 = call i32 @row_func(i64* %16, i64* null, i32* %crt_matched, i32* %total_matched, i32* %old_total_matched, i64* %agg_init_val, i64 %pos, i64* %frag_row_off_ptr, i64* %row_count_ptr, i8* %literals, i8* %1, i8* %3, i8* %5, i8* %7, i64* %join_hash_tables)
  %20 = add i64 %pos, %18
  %21 = icmp slt i64 %20, %8
  br i1 %21, label %.forbody, label %._crit_edge

._crit_edge:                                      ; preds = %.forbody
  br label %.exit

.exit:                                            ; preds = %._crit_edge, %.entry
  call void @write_back_nop(i64* %15, i64* %16, i32 0)
  ret void
}

; Function Attrs: alwaysinline
define i32 @row_func(i64* %group_by_buff, i64* %small_group_by_buff, i32* %crt_match, i32* %total_matched, i32* %old_total_matched, i64* %agg_init_val, i64 %pos, i64* %frag_row_off, i64* %num_rows_per_scan, i8* %literals, i8* %col_buf0, i8* %col_buf1, i8* %col_buf2, i8* %col_buf3, i64* %join_hash_tables) #22 {
entry:
  %0 = load i64, i64* %frag_row_off
  br i1 true, label %filter_true, label %filter_false

filter_true:                                      ; preds = %entry
  %1 = alloca i64, i32 2
  %2 = call i64 @fixed_width_int_decode(i8* %col_buf0, i32 4, i64 %pos)
  %3 = trunc i64 %2 to i32
  %4 = sext i32 %3 to i64
  %5 = getelementptr i64, i64* %1, i32 0
  store i64 %4, i64* %5
  %6 = call i64 @fixed_width_unsigned_decode(i8* %col_buf3, i32 2, i64 %pos)
  %7 = trunc i64 %6 to i32
  %8 = trunc i32 %7 to i16
  %9 = call i32 @cast_uint16_t_to_int32_t_nullable(i16 %8, i16 -1, i32 -2147483648)
  %10 = call i64 @translate_null_key_int32_t(i32 %9, i32 -2147483648, i32 59)
  %11 = sext i32 %9 to i64
  %12 = getelementptr i64, i64* %1, i32 1
  store i64 %10, i64* %12
  %13 = call i32 @perfect_key_hash(i64* %1)
  %14 = call i64* @get_matching_group_value_perfect_hash(i64* %group_by_buff, i32 %13, i64* %1, i32 2, i32 5)
  %15 = getelementptr i64, i64* %14, i32 0
  %16 = sext i32 %3 to i64
  call void @agg_id_shared(i64* %15, i64 %16)
  %17 = getelementptr i64, i64* %14, i32 1
  %18 = sext i32 %9 to i64
  call void @agg_id_shared(i64* %17, i64 %18)
  %19 = call double @fixed_width_double_decode(i8* %col_buf1, i64 %pos)
  %20 = call double @fixed_width_double_decode(i8* %col_buf2, i64 %pos)
  %21 = call double @mul_double_nullable(double %19, double %20, double 0x10000000000000)
  %22 = getelementptr i64, i64* %14, i32 2
  call void @agg_sum_double_skip_val_shared(i64* %22, double %21, double 0x10000000000000)
  br label %filter_false

filter_false:                                     ; preds = %filter_true, %entry
  ret i32 0
}

and run the two queries, here is the clean mapd_server.INFO

In addition, the timing for the following two queries is very fast:

select col1, col4, count(*) from table1 group by 1, 2 /* time = 1 millisecond */

select col1, col5, count(*) from table1 group by 1, 2 /* time = 2 milliseconds */

And the results are:

col4 group count

col5 group count


#12

Hi Dwayne,
any hints? Could I help anyway ?


Decimal datatype precision