Use Case: Large Customer-Order-Item Dataset


#1

Hello,

I am currently doing some research about in-memory databases both CPU and GPU accelerated. I don’t know if this is the right place to ask about our use case but I will give it a shot.

We have 3 tables:

  1. Order (600GB 20 cols wide, 3 billion rows)
  2. Customer (60GB 12 cols wide, 1 million rows)
  3. Item (200GB 32 cols wide, 500 million rows)

The star schema is standard with order as fact. We have about 500 analysts using this at the same time. Today our solution is based on Hadoop and aggregates.

Let’s say 200 GB of the data is hot data, to use MapD what would our setup need to be, just to get an overview over the infrastructure cost?

I also have a couple of questions regarding MapD:

  • Does MapD keep all the table data loaded in DRAM?
  • Does MapD fill out VRAM with everything it needs for the SQL query?
  • What happens if the data doesn’t fit, will it fail, run slower?
  • Is there logic to keep datasets that are used by other concurrent users (hot data)?
  • Complex queries require a lot more memory to create the result set, what happens if this overflows the VRAM or is this stored somewhere else?
  • Will it be a problem for MapD to share the GPU machines with other tenants?
  • We also have event streams with 100 GB / day, I understand this is not feasible to store in in-memory but is MapD planning to develop a component for loading on demand from HDD storage ex Hadoop etc. Or is MapD purely for a set of data you know that you will query, almost the same as for the MOLAP cube use cases.
  • I read upsert is coming in the future, is it in this year’s roadmap?

Thank you for your time,
Robert


#2

I will try to respond

  1. Does MapD keep all the table data loaded in DRAM? It loads in DRAM then transfers to VRAM only columns needed for the query for filtering, grouping, calculating; for simple projections it stays in DRAM.
  2. Does MapD fill out VRAM with everything it needs for the SQL query? Mainly Yes, See previous response
  3. What happens if the data doesn’t fit, will it fail, run slower? In the CE fails, but the software is improving at a steady rate
  4. Will it be a problem for MapD to share the GPU machines with other tenants? No, but depending of the other software/query the performance would be affected

Btw this software is really fast; on a single i5-4670k and a gtx 1080ti, so a desktop level card on a small dataset (120 M rows main table and other lookup tables) is between 20-30 times faster than an Oracle 12c with In-Memory option running on a 16 cores machine (2xE5-2640 v2) launching the quries in a serial way and 80-100 faster with concurrent queries, with a comparable level of compression on data and with minimal tuining effort (maily you have to choose the right encoding according your data). The load troughtput is also very good (700k rows sec on a 22 cols dataset on a quad core cpu).

Even without using a graphic card my i5 is at least 10 times faster than 16 core oracle machine


#3

Thank you for your reply.

Does MapD keep all the table data loaded in DRAM? It loads in DRAM then transfers to VRAM only columns needed for the query for filtering, grouping, calculating; for simple projections it stays in DRAM.

This means we would need 860GB of DRAM and 200GB of VRAM plus any extra DRAM or VRAM memory needed for the resultset depending on how its stored.

Thank you,
Robert


#4

As far is I know the data needed to stay in vram is the one needed by the query is running, so if your query is referencing 1/4 of your columns and the entire dataset (there would be some statistics at extent level that would save some memory and processing) you could need enough vram to fit the 25% of size of your dataset; if another query need other data the old data in vram would be “swapped out” to fit the new data.
You should test taking enough attention to which encoding use for single columns; the default proposed by the import tool aren’t the best space wise


#5

Hi,

Robert, glad you are reviewing MapD.

Are the GB sizes you are sharing the size of the orignal csv file?

MapD lazily uses memory so unless your queries require every column and every row of every table it is unlikely you will need everything in vram and dram. Also what is stored in memory is compressed as much as possible.

Ideally to help you size this we would need to see your schema and have a discussion around the type of query you expect to run.

When moving from an older style database to MapD and immerse people are often in a “report generation” mindset as that is what current tools focus on, with MapD and Immerse you can rethink the way you interact with the data, build dashboards to explore the real data in depth, rather than producing large “excel” like reports. Let the analyst drill down into the subset of data they are specifically interested in.

regards


#6

Use case:
Look at the user journey (page flow resulting in an order) for one item genre and get the top 10 orders resulting in the most sale for year 2015-2017 segmented by gender

Below tables are schemas for this specific use case, the real tables have more columns. The SQL statement is not tested but it should give you a good picture. Also the table sizes are for the uncompressed textfiles and all our data is in Hadoop.

CREATE TABLE orders( --contains the order lines
order_number string,
item_id int,
price double,
customer_id bigint)
PARTITIONED BY (order_date date)
STORED AS PARQUET;

CREATE TABLE page_views(
event_datetime timestamp,
order_number string,
page_sequence int,
page_name string,
duration int)
PARTITIONED BY (event_date date)
STORED AS PARQUET;

CREATE TABLE customers(
customer_id bigint,
gender string)
STORED AS PARQUET;

CREATE TABLE items(
item_id int,
genre string)
STORED AS PARQUET;

SELECT t1.order_date,
t1.order_number,
t1.price,
t3.gender,
t2.page_sequence,
t2.page_name,
t2.duration – duration the user stayed on the page
FROM ( SELECT t1.order_date,
t1.order_number,
SUM(t1.price)
FROM ORDERS t1
JOIN ITEMS t2 ON t1.item_id = t2.item_id
WHERE t2.genre = ‘Some Genre’ AND Year( t1.order_date ) IN ( 2015, 2016, 2017 )
GROUP BY t1.order_number
ORDER BY t1.price DESC
LIMIT 10 ) t1
LEFT JOIN PAGE_EVENTS t2
ON t1.order_number = t2.order_number
LEFT JOIN CUSTOMERS t3
ON t1.customer_id = t3.customer_id
ORDER BY t1.order_number, t1.price, t2.page_sequence

Resultset
order_date | order_number | price | gender | page_sequence | page_name | duration
2017-01-01 | 12345 | 155.0 | Male | 1 | Main | 9
2017-01-01 | 12345 | 155.0 | Male | 2 | Item | 5
2017-01-01 | 12345 | 155.0 | Male | 3 | Cart | 16
2017-02-01 | 12346 | 1644.2 | Female | 1 | Main | 32
2017-02-01 | 12346 | 1644.2 | Female | 2 | Search | 29
2017-02-01 | 12346 | 1644.2 | Female | 3 | Cart | 9

Page views event stream uncompressed size
16GB of data for the page views a day resulting in a total size of 17TB for the years 2015-2017

This query runs for a couple of minutes with a 22TB allocation using Tez with a lot of optimization settings

Sorry for the unstructured post


#7

Hi,

What is the records count on the PAGE_EVENTS/VIEW table?

I am also assuming you want to order by the sum of the order

so inner query would be more like

SELECT
   t1.order_date,t1.order_number,SUM(t1.price) ordersum
   FROM ORDERS t1
   JOIN ITEMS t2 ON t1.item_id = t2.item_id
   WHERE t2.genre = ‘Some Genre’
   AND Year( t1.order_date ) IN ( 2015,2016,2017 )
   GROUP BY t1.order_number
   ORDER BY ordersum DESC LIMIT 10

I am doing the calculations for sizing using the schema you shared and these numbers

Order ( 3 billion rows)
Customer ( 1 million rows)
Item ( 500 million rows)

but I need the record count on PAGE_EVENTS

Ignoring PAGE_EVENTS just to get a ball park worse case

ORDER x 3B

item_id (join)
order_date (filter)
order_number (groupby)
price(order by)
customer_id (join)

~20Bytes x 3B = ~60G

ITEM 500M

item_id (join)
genre (filter)

~8 bytes x 500M = ~4G

CUSTOMER 100M

customer_id (join)

~4 bytes x 100M = ~400M

So on a machine with 3 P40 GPU’s we should be able to run a query like this in a timely manor. Assuming that PAGE_EVENT has some volume I would think we might want to look at 4 GPU’s

regards


#8

Excellent post, thank you.

The row count is 48 billion for the years 2015-2017 for page views.

Ok so if I understand this correctly, all the data for the calculation needs to fit in VRAM while DRAM needs to fit the complete set of data. What about loading from HDD? For instance if we were to keep year 2016 and 2017 in memory but 2015 in Hadoop, that doesn’t work now and is not on the roadmap?

Perhaps your calculations changes with the updated row count, but with the current calculation using 4 P40 24*4(96GB VRAM) and 1 TB of DRAM would do the trick? (18TB of DRAM if we were to store all the page views too)


#9

Hi

Yes the size of page_views certainly affects the calculations

There are two columns that need to go to GPU

order_num (join)
page_sequence (order by)

so total VRAM for 2015-2017 is ~ 8 bytes x 48B = ~ 384GB

This now brings the total memory required to 448GB VRAM.

So assuming we can still apply your same shard key (date) we could distribute this data across 3 machines similar to Amazons p2.16xlarge (192GB VRAM each).

These calculations are based on you wanting to rapidly query data from any year in this time span.

Regarding your question about DRAM, only the columns used in the calculation + the columns used in the projection get loaded to DRAM, the entire tables contents do not get loaded unless they are used.

What stays on disk is defined by your query access patterns, if you never ran a query that accessed 2015 (ie filter where date > 2105) data it would never be moved off the disk. If and when you then ran a query on the older data, it would only then be loaded into vram. We try to be precious with all RAM and lazy load data as much as we possibly can.

regards


#10

Thank you Dwayne, I appreciate you taking the time. I got a much better picture now of what MapD can do and how it would fit in to our organization.

Happy to hear that reading from HDD is not an issue. Next step would be to invest in some machines to run a PoC. Will keep the sharding in mind for distribution, I am not familiar with the table creation process in MapD yet.

Thank you.


#11

if you use the table page_views mainly as a (giant) lookup in you query you can consider those two options

  1. split the query on two statements. the first one will run on GPU and is a create table as select to create an intermediate table doing the first aggregation with orders and items and then joining customers; the second statement one will join the page event table to intermediate table and could run in a resoanable time on CPU

  2. leave the page event table on a index capable database and run the query against a data virtualization tool, like Informatica Data Services or similar, so part of the query would run MAPD and the final part (the lookup one) on an external database


#12

Thank you aznable, yeah going for a mix solution might be an idea. We have a lot of aggregates running on the page view data so its not only for lookup but its worth noting. I have yet to implement MapD so I don’t know how flexible and easy it is to do something like you described but I am eager to get started.


#13

it would be not so difficult, mapd is straightforward, and the software hhas the option to fallback on CPU if for watherver reason the GPU is unable to complete the query; so with 72GB of VRAM you would be able to process the first aggregation and the join with customers table creating a small intermediate table, then assuming you have enough ram on your system the CPU would perform the join between page_views as inner table with the small intermediate table you created.
MAP>D is very fast also on CPU when it has to aggregate or join with small outer tables (an i5-4670k is only 5x times slower than a gtx 1080ti) but if outer tables has a significant number of records (i tried joining an inner table of 200m with an outer one with 35m) the penalty is bigger (more or less 10x).

Of course in the specific scenario with data virtualization the query would run faster because the access to page_views would be a few dozen NL indexed accesses