GPU query takes longer than CPU


#1

Hi, im doing a simple research comparing the performance between GPU vs CPU. To do that I create a table with 216M rows and 152 columns and i deployed one docker with nvidia-docker and another one normal so i can compare them both.

I made the same query for both environments and here are the results:
select x1,cast((sum(x2)/1024/1024/1024) as float) as y1 from table where x2 is not null group by x1 order by y1 desc limit 10;

Nvidia-docker:
Execution time: 46558 ms, Total time: 46564 ms
mapdql> \memory_summary
MapD Server CPU Memory Summary:
MAX USE ALLOCATED FREE
309351.94 MB 2478.79 MB 4096.00 MB 1617.21 MB

MapD Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 10835.04 MB 1380.16 MB 2048.00 MB 667.84 MB
[1] 10835.04 MB 1098.63 MB 2048.00 MB 949.37 MB

CPU docker:
Execution time: 2602 ms, Total time: 2626 ms
mapdql> \memory_summary
MapD Server CPU Memory Summary:
MAX USE ALLOCATED FREE
309351.94 MB 2478.79 MB 4096.00 MB 1617.21 MB

MapD Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE

Im using a TeslaK80, and my server has 2 CPUs Intel E5-2680v3 2.5GHz/30M/2133MHz 12-Core 120W
I am doing something wrong? or just misunderstand the GPU concept?


#2

Hi, did you run the query multiple times or only once? The data is lazily loaded to the GPU, so the first query will usually be slower than subsequent ones because it includes the time required to move the data from CPU mem or disk over to the GPU.

For example, on my desktop a 216M row table of (x1 smallint, x2 bigint) takes about 12 seconds to run your query cold, but only 50ms once the data is on the GPU. CPU takes ~160ms warm. (4x GTX1080Ti, 1950x)

What are the types of x1 and x2? In mapdql do \d tablename.

If GPU is still slow after running multiple times, could you try reducing the table down to just the two columns you’re querying? Only the queried columns are loaded, but if the reduced table is faster than the full one, that would give us some hints to where the problem might be.


Also you can switch between CPU and GPU modes in the same session, no need to run two separate containers. GPU mode is enabled by default if available, can switch between them using \cpu and \gpu.


#3

Thanks Andrew! in the example above i just run the query once, i knew that the process to load the data to the GPU was slow but i wasn’t aware about how much time it takes so i don’t compare them after the first run, this is one of my first gpu approaches.
Running the query several times do the work, now i see that gpu query takes about 200-300ms once the data is already loaded to gpu and 350-450ms running the same cpu query.
The type of x1 is BIGINT while x2 is TEXT.
Again, thanks!


#4

a diffence of 44 secs between CPU and GPU can be only explained by disk reads of a very slow disk; the CPU run has found a big chunk of data on filesystem cache


#5

I didn’t understand your response. In my understanding, under ideal circumstances the CPU run should be finding the data entirely in RAM, and the GPU run entirely in GPU ram.

I’m currently waiting on more GPU cards, so at the moment I’m not too concerned that I also run into the bottleneck of transferring data from system ram to gpu ram.

I do not benchmark by running the same query twice, because on some RDBMS that could result in getting the answer from a result cache. If the optimizer is smart enough to know the data didn’t change and the same question was asked twice, it may simply repeat the answer and not calculate it. That’s fine in real world, but isn’t a test of the database speed, as defined by its ability to get results for an ad hoc query.

I would do a different query on the same column, to make sure the column was loaded to GPU ram, and then test the query I wanted, running it only once.

By that I mean, on other RDBMS I’m sure to clear the result cache before running a query again, unless I was specifically testing the result cache.

Anyway - could MAPD clarify, this database has no result cache?


#6

i was referring to runtime times posted int he first message of the thread

and

the difference could not explained by the PCI-Ex transfer time, but it’s likely you was querying the coulumns involved for the first time so the data has to be read from disk; in every system there is a filesystem cache, so the huge difference could be explained by the status of FS caches.

the system memory and gpu memory are used like caches by mapd, so if the data isnt in any cache has to be read from disk, then the system RAM is populated and, if present, the GPU RAM; after that the query is executed.

At the present moment MAPD does not use any result cache, so the timing results are always valid ones, even if you run the same query.

to improve performace on cpu you can try changing the fragment size, because if you are using the default one (32 millions) you are using just 8 of 24 cores.
If you want to save memory without sacryfing performance, if you can, you should normalize your data; mapd is very fast on joins (expecially gpu side) and the GPU ram is limited


#7

Well the documentation states: " Even though MapD is an “in-memory” database, when the database first starts up it needs to read data from disk"

That’s where I got the impression that startup was when things load into memroy. Although I definitely did experience a slow down on the first run of a query as data goes into GPU Ram.

Would you agree that say you ran this first:

  1. SELECT SUM(COLUMN1) FROM TABLE - that loaded Column1 from disk?

And if you ran

  1. SELECT AVG(COLUMN1) FROM TABLE ;

and subsequently the same query again:

  1. SELECT AVG(COLUMN1) FROM TABLE;

Runs 2 and 3 should be approximately the same? Neither have the benefit of a result cache, and all we are going for is making sure the data is in GPU Ram, and that was accomplished with step 1?

I honestly don’t recall it working that way. But I’ll double check when I get home. Also I will try the partitioning step mentioned in the thread as well…


#8

That is correct. We cache the memory in two tiers not the results.