Filtering and ordering of large dataset


#1

Hi,

my use-case is: I have a real large dataset (>50.000.000 Rows) which I want to filter and sort. 5M rows works fine, next test with 10M rows I get the following error:

select * from com10 where column_19 > 200 and column_5 <15600 order by column_19 asc limit 100;
Exception: Query couldn’t keep the entire working set of columns in GPU memory

Is there a workaround for this or simply not possible?

Best,
Peter


#2

Hi @pvoringer, what sort of hardware are you running on? Typically adding more GPUs or a GPU with more VRAM will allow you to run on GPU.

You can switch the execute mode for your session in mapdql by typing in the \cpu flag. In a future release we are looking to add functionality to automatically detect if there is not enough GPU memory available and intelligently fall back to CPU directly. Right now there is a flag to allow retry on CPU as described here: Is there a memory replacement mechanism for mapd?.

Finally we are about to land functionality which will significantly accelerate a sort with a limit like you have here by using a GPU-accelerated heap algorithm. Right now we are not efficient for this sort of query (sort-by without a grouping).


#3

Hi @darwin,

it’s a AWS p2.8xlarge instance, so 8GPUs, 96GB GPU RAM, 32vCPUs, 488GB CPU RAM, so should be more than enought for 5M rows, when p2.xlarge (1/8 of everything above) can handle 1M rows.

If I have a fallback to CPU I don’t need to use mapd, then I can you every other in-memory engine (Ignite, Hazelcast, …).

I have a limit/offset on my query, so it should normaly be easy to not sort everything or only a part of it on the fly. Is there a plan when you will have this sorting with limit what you mentioned in your reply? Without this my use case will not be able on mapd and I will unfortunatelly move forward to something else for now.

Best,
Peter


#4

Hi ,

i took a look to your query and the engine has to sort the entire result after filtering because you are asking “give me the top 100 records with the lower values of column_19”.

i guess the problems is relying how mapd run this query, i wrote a similar query an looking at calcite plan seems that the server projects all columns before perform the sort, but the memory consumption on GPU is negligible

for best performance i had to rewrite the query in this way (nume_prog_fattura contains unique values)

select * from fatture join (select nume_prog_fattura from fatture where giorni_pagamento>20 and IMPO_TOT_PAGATO<400 order by giorni_pagamento limit 100) as a on a.nume_prog_fattura=fatture.nume_prog_fattura;

this query runs on a second or so, the number of records after the filtering is 8.3m, the total number of records of table fatture is >50m, the GPU’s memory consumption is 400mb.

if result of selection is bigger you have to disable watchdog because i guess there is a soft limit defined on the software with the parameter enable-watchdog=false .

with both parameters the sort looks to be performed on cpu, the filtering on gpu


#5

I want to point out that GPU-accelerated heap sort is already in master (ie you can build a version with this) and will be in 3.2.2. It should speed up non-grouped sort with limit operations significantly.


#6

Hi,

Seems to be a rush on the responses :slight_smile: I was writing up some details when the flood came

Just wanted to show what is coming down the pipe

Here is an example on 180M records on a commodity 2 GPU system that I had handy

mapdql> select trip_distance, tip_amount from nyc_taxi_nolion_2014 where trip_distance > 3 and tip_amount > 5 order by tip_amount asc limit 10;
trip_distance|tip_amount
7.710000|5.010000
4.000000|5.010000
4.720000|5.010000
8.500000|5.010000
3.600000|5.010000
4.500000|5.010000
7.460000|5.010000
3.650000|5.010000
7.090000|5.010000
7.600000|5.010000
10 rows returned.
Execution time: 342 ms, Total time: 342 ms
mapdql> select count(*) from nyc_taxi_nolion_2014;
EXPR$0
180953629
1 rows returned.
Execution time: 26 ms, Total time: 27 ms

Also as highlighted by @aznable all the data that passes the filter does need to be sorted so this query required over 9M record to be sorted out of the original 180M

mapdql> select count(*) from nyc_taxi_nolion_2014 where trip_distance > 3 and tip_amount > 5;
EXPR$0
9252933
1 rows returned.
Execution time: 32 ms, Total time: 32 ms

I am trying to be careful to not sound defensive, but I am interested to ask if you have run any of these tests on a CPU in memory systems like (Ignite, Hazelcast, …) on a 180M (or 50M) record data set and have some results you would be willing to share? Our own testing has shown that our CPU performance is often significantly faster than the dedicated CPU in memory DB’s, but are very interested to hear of others experiences.

Regards


#7

Hi,

thanks for all the answers.

@dwayneberry I will try to do the 180M rows test on AWS Aurora and Ignite (we tried Hazelcast at the start, but we got better results with Ignite) and share the results.
@darwin When functionality is already in 3.2.2 I will have a look at it again. Do you know when the “default” AWS images getting updated? I only see the image is version 3, but it’s not more specific.

Best,
Peter


#8

Hi,

You can check latest verion here

Currently we are 3.2.1

New AWS instances get the latest version on creation. To update an existing image run the following series of commands.

sudo systemctl stop mapd_server
sudo systemctl stop mapd_web_server
cd /raidStorage/installs
curl "https://aws-ee:Dohshe4phahj2eeloxeingiTeeVahmix@builds.mapd.com/aws-ee/mapd-latest-Linux-x86_64-render-aws-ee.tar.gz" | tar xzf -
latest=$(ls -t1 |  head -n 1)
ln -sfn /raidStorage/installs/${latest} /raidStorage/prod/mapd
sudo systemctl start mapd_server
sudo systemctl start mapd_web_server

We will be going over to yum shortly to simplify this process

regards


#9

@Darwin

i built from actual master, but the sort looks to be done on cpu and selecting a table the sort is performed on GPU (judging by the response times) while selecting other tables it takes the old path and the sort is performed on CPU; is there a specific sintax for top_n queries?

@dwayneberry

yes even using the CPU only MPAD is faster than other in-memory solution i tried;
e.g. mapd CPU-only on a machine with a spec-int rate of 200 is 3-4 times faster than an oracle with in-memory option running on a machine with 520 spec-int rate using joins, grouping and filtering on a dataset of 120m records.

anyway now i tried with exasol 6.0 installed on the same machine of MAPD (200 specint-rate) with this query
select * from flights where distance <202 and arrdelay >9 order by arrdelay limit 100; (3m after filtering) and it takes
2.5 secs while mapd (cpu+gpu) 1.5, mapd (cpu only) 2 secs

selecting more records changing filtering parameters
select * from flights where distance <502 and arrdelay >9 order by arrdelay limit 100; (14m after filtering)

exasol 12 secs, mapd 21 secs (a lot of time spent on projection and some paging on mapd)

rewriting the query to force the final projection on top_n records only

select * from flights join (select flight_timestamp,flightnum,deptime from flights where distance <502 and arrdelay >9 order by arrdelay limit 100) as a on a.flight_timestamp=flights.flight_timestamp and a.flightnum=flights.flightnum and flights.deptime=a.deptime;

3.1 secs cpu+gpu and 3.7 secs cpu-only without paging.

on other scenarios (join,group by etc.) mapd is significantly faster than exasol