Immerse 4.0 suboptimal query generation for choroplet


#1

Just installed the CE 4 and i noticed that the query generation for choroplet Viz is suboptimal performance wise.

With a modest dataset of 123M of records based on a view the query Immerse generated is the following

SELECT mapd_states.rowid, v_flights_log.dep_state as key0, avg(arrdelay) as color FROM v_flights_log, mapd_states WHERE (v_flights_log.dep_state = mapd_states.abbr) AND ((CAST(flight_timestamp AS TIMESTAMP(0)) >= CAST(β€˜2001-11-02 19:15:58’ AS TIMESTAMP(0)) AND CAST(flight_timestamp AS TIMESTAMP(0)) <= CAST(β€˜2002-01-29 03:07:30’ AS TIMESTAMP(0)))) GROUP BY mapd_states.rowid, key0;

this costs over 20G of ram consumption and a big performance hit on query wall time.

generating the query this way

with v_flights_log as (select dep_state,avg(arrdelay) avg_delay from v_flights_log WHERE (v_flights_log.dep_state = mapd_states.abbr) AND ((CAST(flight_timestamp AS TIMESTAMP(0)) >= CAST(β€˜2001-11-02 19:15:58’ AS TIMESTAMP(0)) AND CAST(flight_timestamp AS TIMESTAMP(0)) <= CAST(β€˜2002-01-29 03:07:30’ AS TIMESTAMP(0)))) group by 1) SELECT mapd_states.rowid, v_flights_log.dep_state as key0, avg_delay as color FROM v_flights_log, mapd_states WHERE (v_flights_log.dep_state = mapd_states.abbr) ;

the response time of the query is reduced to 60-70 ms and the memory consuption falls to 1.3GB.


#2

Hi @aznable,

Thanks for this.

To get your query to run I had to rewrite it a bit (removing the inner v_flights_log in the CTE) to be this:

with v_flights as (select dest_state,avg(arrdelay) avg_delay from flights WHERE ((CAST(dep_timestamp AS TIMESTAMP(0)) >= CAST('2001-11-02 19:15:58' AS TIMESTAMP(0)) AND CAST(dep_timestamp AS TIMESTAMP(0)) <= CAST('2002-01-29 03:07:30' AS TIMESTAMP(0)))) group by 1) SELECT mapd_states.rowid, v_flights.dest_state as key0, avg_delay as color FROM v_flights, mapd_states WHERE (v_flights.dest_state = mapd_states.abbr)

That ran in 45-55ms on a machine with 123M records, which compares favorably to the Immerse generated query which takes ~85-90ms.

Another way to do this is with our new LAST_SAMPLE operator, which is even a bit faster at 38 ms.

SELECT last_sample(mapd_states.rowid), flights.dest_state as key0, avg(arrdelay) as color FROM flights, mapd_states WHERE (flights.dest_state = mapd_states.abbr) GROUP BY key0

The likely performance and memory culprit is the needless double group by, although we should measure the LAST_SAMPLE query above for memory consumption. We’ll likely try to add something more optimized for the next release.

Regards


#3

thanks @darwin for pointing me out to last_sample, but using it with my view isn’t helping too much

to make the queries run (both of them), i had to disable the watchdog and as you can see from the calcite explain plan the join between the view and the geo table produce an unuseful projection

LogicalProject(EXPR$0=[$1], key0=[$0], color=[$2])
  LogicalAggregate(group=[{0}], EXPR$0=[LAST_SAMPLE($1)], color=[AVG($2)])
    LogicalProject(key0=[$6], rowid=[$33], arrdelay=[$18])
      LogicalFilter(condition=[AND(=($6, $25), >=(CAST($8):TIMESTAMP(0), CAST('2001-11-02 19:15:58'):TIMESTAMP(0) NOT NULL), <=(CAST($8):TIMESTAMP(0), CAST('2002-01-29 03:07:30'):TIMESTAMP(0) NOT NULL))])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalProject(dep_airport=[$32], dep_city=[$33], dep_state=[$34], dep_state_3166_2=[$35], arr_airport=[$41], arr_city=[$42], arr_state=[$43], arr_state_3166_2=[$44], flight_timestamp=[$0], c1_year=[$1], c2_month=[$2], dayofmonth=[$3], dayofweek=[$4], deptime=[$5], arrtime=[$7], uniquecarrier=[$9], actualelapsedtime=[$12], airtime=[$14], arrdelay=[$15], depdelay=[$16], origin=[$17], dest=[$18], distance=[$19], ucarrier_desc=[$50])
            LogicalJoin(condition=[=($9, $49)], joinType=[inner])
              LogicalJoin(condition=[=($18, $40)], joinType=[inner])
                LogicalJoin(condition=[=($17, $31)], joinType=[inner])
                  EnumerableTableScan(table=[[mapd, flights]])
                  EnumerableTableScan(table=[[mapd, airports]])
                EnumerableTableScan(table=[[mapd, airports2]])
              EnumerableTableScan(table=[[mapd, carriers]])
          EnumerableTableScan(table=[[mapd, mapd_states]])

i will retry materializing the view and/or trying in another system


#4

the problem reply on the plan generated when joining a view with another table.

e.g.

mapdql> create view v_test_flights as select * from flights where c1_year=2008;
mapdql> explain calcite select count() from v_test_flights join airports on origin=iata;
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
  LogicalProject($f0=[0])
    LogicalJoin(condition=[=($17, $30)], joinType=[inner])
      LogicalProject(flight_timestamp=[$0], c1_Year=[$1], c2_Month=[$2], DayofMonth=[$3], DayOfWeek=[$4], DepTime=[$5], CRSDepTime=[$6], ArrTime=[$7], CRSArrTime=[$8], UniqueCarrier=[$9], FlightNum=[$10], TailNum=[$11], ActualElapsedTime=[$12], CRSElapsedTime=[$13], AirTime=[$14], ArrDelay=[$15], DepDelay=[$16], Origin=[$17], Dest=[$18], Distance=[$19], TaxiIn=[$20], TaxiOut=[$21], Cancelled=[$22], CancellationCode=[$23], Diverted=[$24], CarrierDelay=[$25], WeatherDelay=[$26], NASDelay=[$27], SecurityDelay=[$28], LateAircraftDelay=[$29])
        LogicalFilter(condition=[=(CAST($1):INTEGER, 2008)])
          EnumerableTableScan(table=[[mapd, flights]])
      EnumerableTableScan(table=[[mapd, airports]])

1 rows returned.
Execution time: 162 ms, Total time: 162 ms
mapdql> select count() from v_test_flights join airports on origin=iata;
7000787
1 rows returned.
Execution time: 2088 ms, Total time: 2091 ms
mapdql> select count() from v_test_flights join airports on origin=iata;
7000787
1 rows returned.
Execution time: 1339 ms, Total time: 1340 ms

so i think i opened a topic in the wrong section with a wrong title; i am sorry