Warm Data/Hot Data and Cold Data


#1

Hello…

How does MapD categorize the data as Hot/Warm/cold ? Do we have to train the database by running multiple queries or something we can set-up when we create the tables etc.

Basically I can categorize the data before hand and decide where it should get processed CPU RAM or GPU RAM. But does it need to part of the Database architecture as in while creating the tables ? or its has some AI based on which database decides ?

This may be vague questions… But would like to understand how to set-up this data (Hot/Warm/Cold) in MapD. Any help would be appreciated.

Thanks,


#2

Nope it basically work this way

when you execute a query, the data needed for processing (filtering, sum, group by) is read from disk, placed on system ram, then into GPU Ram and processed, then for the final projection the data is reed on system memory, and you will get the result of your query.

when you launch another one the additional data is reed from disk; if there is enough system/gpu memory the additional data will be appended on memory pool, if there isn’t, the pools will be freed to accomodate the new data.

so it’s an automated cache mechanism and you havent to do anything, but the data you have on gpu is always on system ram


#3

Thanks. However, do we take the data transfer time between GPU and CPU in this case? I agree the GPU is meant to speedup but how about the transfer between CPU and GPU.

Also, as I read (through other blogs), hardware needed is the best to be able to use GPU power. Can the commodity hardware be able to handle etc.

Just thinking in-lines of hardware if we can plan out well.


#4

Hi,

There is a cost of moving the data from disk to CPU and then CPU to GPU. There is a cost to all data movement, hopefully this is something we all understand. The speed between CPU and GPU on a GPU server is generally many times faster than the speed from disk to CPU, A Fast Disk subsystem might have 1-1.5GB/s where CPU to GPU pci-e bandwidth for a single modern card is 4GB/s+ (a modern dual intel cpu server with 8 GPU can have 80 lanes of PCI-e 3 which has theoretically 80GB/s bandwidth, the new AMD motherboards and CPU we are talking about 128 lanes of PCI-e 3) so it is not as much of a bottleneck as it seems initially. MapD as a GPU in memory database tries to absolutely minimize the movement of data from any of our cache layers whenever possible, if data changes in the system we try to only move the subset of data that has been added, rather than reloading entire tables or columns.

On the hardware front we have some details here That hopefully can help you decide wha might be appropriate for the size of your problem.

regards


#5

did some benchs on a gaming desktop (super commodity hardware…sub 1500€ total costs)

peak bandwidth of pci-ex bus 15GB/sec and mid sized star schema, 1 fact table of 455M and three dimensions

with a query like that

select dim1_desc,dim2_desc,count(*),sum(fact_measure1),avg(fact_measure1),stddev(fact_measure1),stddev(fact_measure2),avg(fact_measure1) from fact join dim1 join dim2 join dim3 group by 2,1 order by 2,1

the memory needed for joining and rest of operation is
filed bytes total size
join/group by filed 1 1 434,51
join/group by field 2 4 1738,04
join field 3 1 434,51
calc field 1 4 1738,04
calc field 2 4 1738,04

total 6083,13 MB

using GPU
GPU accelerated - time with data only in system memory and gpu cache un-initialized 1780ms
GPU accelerated - time with data only in system memory and gpu cache initialized 1580ms
GPU accelerated - time with data in gpu memory 590ms
CPU only (4 cores) - 2920ms

so the transfer time took more or less a second with a transfer rate of 6GB/sec (i guess there is software overhead from the driver and mapd caching mechanism)

as you can see on a sub 1500€ hardware the performance is consistent and the caching mechanisms isnt so taxing, but with better hardware with more pci-ex lanes it’s going to be even better


#6

This is awesome!! Thanks @aznable a lot !!

I did look at some other bench markings done over all and seems consistent on what you have. This will really help in building my case studies.