Performance CPU ONLY


#1

Hello
We are doing a quick proof of concept to see if mapd could replace our OLAP Cube.

We’ve been able to do load up most of our data from the database pretty easily and it seems that the performance is acceptable.

However, we have another part of our dataload that requires to get some static data from the mapd database to enrich a record we are trying to insert into a table that contains about 150 columns (fact table).

So view it this way:

  1. get a record.
  2. enrich it with data from mapd already
  3. insert 1 by 1.

this process is extremely slow.
We tried to batch records where we do:

  1. get n records
  2. process all records in parallel and enrich them
  3. batch insert.

Still slow.

If we only do 1 and 3 with 5000 messages, we get decent performance. Using dropwizzard to calculate performance, this is what we get:
count=370,
min=3.663432,
max=1039.916237,
mean=463.2558883220592,
stddev=40.449861693479185,
median=457.15358

So in average it takes 1/2 a second to insert 5000 records. which is not too bad considering the number of columns.

Now, when we insert the pre-process part before insert, we are not able to process that many messages because the pre-process part is extremely slow.

So this time, we go from 5000 to 20 messages.
To pre-process 7500 messages (20 messages processed in parallel using parallel streams in java) we get these timings:
count=7511,
min=254.77307,
max=1339.014478,
mean=763.5231923230446,
stddev=290.4453464479874,
median=671.951292
on average it takes 700 ms to enrich our data.

We have 5 tables we enrich from:
1 table 2 columns 41 records . fragment_size default
1 table 6 columns 20k records . fragment_size default
1 table 28 columns 600k records fragment_size 30k
1 table 64 columns 200k records fragment_size 10k
1 table 7 columns 11M records fragment_size 500k

The only other property we have changed is cpu-buffer-mem-bytes to 4GB

The machine we run the mapd server CE 4.0.2 has
16cores hyper threaded showing 32 CPUs
380 GB of memory
I am not sure about the hard drive but i am pretty sure it is an SSD drive.

Running a profiler, it showed a lot of time wasted in thrift.

Now i understand that this is the CPU only version.
But i was not expecting such poor performance.

Is there anything else we can do to improve performance other than trying with GPUs?

Thank you very much


#2

If my assumptions are not wrong i dont think using GPUs would help you mauch, you have just to change your workflow if you want to use mapd for your transformation stage on ETL process.

my assumptio is you are reading records from a staging database or whatever, doing one of more select in mapd to get data from your (dimensional?) tables, then inserting on fact table. is this right?

you noticed that serial processing is slow, so you rightly decided to hide lag parallelizing the process, but MAPD is an analytical MPP database and runs just ONE QUERY at once using all your hardware resources to give back a response as fast as possible; this works well with heavy analytical queries not with small lookup queries.

you can try to load a staging table doing a batch insert, launch a query on mapd joining the staging table with the tables already present on mapd, then doing another batch insert on fact table (i dont know if this is possible because i have no idea about kind of trasformations you are doing).
If you eant to save roundtrips between mapd and java, you can issue a copy command on mapd to save the query result on a text file on local machine, then another copy command to load the dato to fact table.