Exception: Hash join failed, reason(s): Not enough memory for the columns involved in join


#1

I have a table with 1814733986 rows and 42 columns approx 360 GB of data. I am executing the following query
select * from tbl1 a join tbl1 b on a. some_id =b. some_id;
And I am getting below given exception:-
Exception: Hash join failed, reason(s): Not enough memory for the columns involved in join

Configurations using community edition
8 vCPUs, 52 GB, 1 Tesla K80 GPU on Google Cloud Platform.

And also facing problem in basic query like
select * from tbl where col = 'some_value’
the query is taking forever to respond not returning any result


#2

Hi @sumit7986 -

With the two queries you’ve specified, you are asking MapD to return all 1.8B rows during a query. Surely that’s not actually what you want to do? The error message you are seeing is because your dataset doesn’t fit in GPU memory, which 1 Tesla K80 isn’t going to be able to handle.

If you add a limit 1000 statement to your queries, you might see a result, but the best thing you can do is specify the columns you actually are interested in analyzing. MapD will load just those columns to memory.

Thanks,
Randy


#3

Given that it’s a self join and so unlikely is to be an equijoin the cardinaliries are likely to be much higher (if each key only self joins to a thousand other keys then in the trillions). So even with taking out the select * or putting a limit, it’s unlikely this will work (in any other in memory dB as well).

Do you mind sharing what you are trying to achieve? Perhaps then we could be of more help.

Regards.


#4

Regarding the second query, in general MapD is not optimized for large projections without limits. Even if your predicate is selective, since we store data in large blocks as an atomic unit, it may mean that all the data (360GB) will need to come off disk to satisfy the project *. As @randyzwitch mentioned, it will work better if you could just specify the columns you need.

We may optimize for such use cases in the future (pulling specific rows from disk that pass a predicate than pulling block-wise, or adding iterator support), but for the analytical workloads in MapD’s sweet spot this has not been a high priority.


#5

Hi,
Me and Sumit are looking for a posiblity to move our work load from Google’s Big Query to MapD.
We regularly have such big tables to query and work with our Data science team to help them analyze huge set of data.
This is one query which was taking us 130seconds on Big query.
@randyzwitch we have put limits on the query and have faced the same issue.
So if MapD pulls all the data to GPU then we will need huge amount of GPUs to hit the 356gb which is not possible.

Also, does that mean MapD is meant to query tables which are of the size of the memory present in GPU’s?

We have tried quering both the queries with 2-3 colmns also and have face the same issue.


#6

Hi @anantmb,

Thanks for this info. To be clear, when you do a projection (i.e select *) with columns that do not involve any calculations (i.e. x * 2), those columns are only brought into CPU memory. However, that still won’t fit into the 52GB of CPU memory you have on that instance.

Apart from the join, is your goal to just return all values (or subeset of columns) matching a certain filter? MapD can do that, but most of our users and customers would likely do some additional analytics in-situ (i.e. aggregates and group bys) such that a significant portion of the database is not returned.

To begin with, you’ll likely need a bigger instance to have a shot running that query even without the select * (i.e. with 4 or 8 GPUs). Then it becomes a matter of cardinality of that join. With more GPUs, it may be possible with sharding. In BigQuery can you report the count of select count(*) from tbl1 a join tbl1 b on a. some_id =b. some_id;?


#7

Hi Randy,

Now I have tried with the selected columns and also apllied the limit as you suggested but the query responded with the same exception.

mapdql> select a.col1, a.col2, a.col3, a.col4 from tbl1 a join tbl1 b on a.some_column =b.some_column limit 10;
Exception: Hash join failed, reason(s): Not enough memory for the columns involved in join

Thanks-
Sumit


#8

Hi @randyzwitch,

I also tried by increasing the GPU’s from 1 to 4 but still getting the same exception.

mapdql> select a.col1, a.col2, a.col3, a.col4 from tbl1 a join tbl1 b on a.some_column =b.some_column limit 10;
Exception: Hash join failed, reason(s): Not enough memory for the columns involved in join

The column data type in which I am joining the table is TEXT ENCODING DICT.

I am not able to find out the exact reason of the exception which I am getting.

Thanks-
Sumit


#9

Hi @sumit7986,

you are facing two problem in your queries.

in the first one, the one with the self join, you are running out of GPU memory, because the engine has to materialize at least the result of joined columns on GPU memory; the join fail so limiting the number of columns or rows doesnt help. you can try with \cpu command on mapdql, then rerun you query; maybe 52gb of CPU memory are enough.

without limits, expecially if you are asking for a lots of columns, you will likely have a problem on projection stage, as @darwin stated before, is done on CPU, so It will use system Memory only and CPU resources; assumed you have enough Memory, you could face performance problems, so it’s better to push calcs on mapd engine to reduce the Number of rows returned.

To use multiple GPUs on joins with two tables with an high nber of records you have to shard them on join column.

Take a look to docs regarding table definition
http://docs.mapd.com/latest/5_tables.html


#10

Hi,

Now I have tried with different query and getting the below given exception:-

*mapdql> select * from transactions where TransId in ( select TransId from ( select SKU,TransId,count(distinct(Promotion)) as Cnt from transactions where Promotion is not null and length(Promotion)>4 and Promotion <> ‘0000000000’ and EXTRACT(YEAR FROM TransDate_dt)=2017 and TransId in (select TransId from ( select ,case when Quantity < 0 then 1 else 2 end as Tag from transactions where TransId in (select TransId from transactions where Promotion is not null group by 1 ) ) group by 1 having avg(Tag)=2 ) group by 1,2 ) where Cnt>1 group by 1 ) order by CustomerId,SKU;
Exception: Query would use too much memory

Configurations using community edition
8 vCPUs, 52 GB, 4 Tesla K80 GPU on Google Cloud Platform.

Thanks-
Sumit


#11

Hi,

Now I tried by increasing the GPU’s to 8 and still having memory exception with the simple query

mapdql> select TransId from transactions where Promotion is not null group by 1;
Exception: Query would use too much memory

Configurations using community edition
8 vCPUs, 100 GB, 8 Tesla K80 GPU on Google Cloud Platform.


#12

In cases like these, it’s useful to start from the beginning. Does any super simple query work, like "select col from table limit 100"? Same thing with a simple group by query. Testing these things will let us know whether there is a GCP install problem.

Once the GCP install is working correctly, where does the query fail when you try to run each piece of this?

select 
* 
from transactions 
where TransId in ( 
  select 
  TransId 
  from ( 
        select 
        SKU,
        TransId,
        count(distinct(Promotion)) as Cnt 
        from transactions 
        where Promotion is not null and 
        length(Promotion)>4 and 
        Promotion <> ‘0000000000’ and 
        EXTRACT(YEAR FROM TransDate_dt)=2017 and 
        TransId in (
                    select 
                    TransId 
                    from ( 
                          select 
                          ,case when Quantity < 0 then 1 else 2 end as Tag 
                          from transactions 
                          where TransId in (
                                            select 
                                            TransId 
                                            from transactions 
                                            where Promotion is not null 
                                            group by 1) 
                          ) 
                     group by 1 
                     having avg(Tag)=2 
                     ) group by 1,2 
        ) where Cnt>1 group by 1 
                ) 
order by CustomerId,SKU;

It’s likely that the levels of nesting in this query are what are causing issues. It’s also likely that it can be re-written to put all of the predicates as early as possible in the query, so that you are operating on a much smaller subset of your data.


#13

Hi Randy,

This simple query I have tried and it fails
mapdql> select TransId from transactions where Promotion is not null group by 1;
Exception: Query would use too much memory

Thanks-
Sumit


#14

Hi Randy,

Below given query returns the result:-
mapdql> select SKU, count(*) from transactions group by 1;

Thanks-
Sumit


#15

Great. So in this case, we’ve validated that your install is working as expected. I suspect your issue here is that transaction id has so many potential values that your mapd instance isn’t large enough to return all of them to your console/keep them in memory. SKU probably has a few thousand at most, which can fit in memory.

So now you are back to sizing your VM properly. I’m not familiar with GCP specifically, but for a table with this expected size and cardinality, we usually use something like 4 or more GPUs/48GB of GPU RAM and 256GB CPU RAM or more. If you’re going to be regularly manipulating the entire 1.8 Billion row table, you’re going to need more hardware. This is a slightly different thought process than BigQuery, which as I understand auto-scales for the user.


#16

Hi Randy,

I tried with below given configurations:-

8 vCPUs, 100 GB RAM, 8 Tesla K80 GPU on Google Cloud Platform.

Thanks-
Sumit


#17

Please see the following link, which shows recommended hardware sizing, specifically the table " Hardware Sizing Schedule":

https://www.mapd.com/docs/latest/4_hardware_configuration_guide.html

With 8 Tesla K80s, you don’t have nearly enough CPU RAM. It is recommended that you have between 4x and 8x the CPU RAM as GPU RAM.


#18

You should try to disable watchdog for the instance. Another user got a problem similar and solved adding a column with the same contents with a text dictionary encoded.

I suspect the engine for distintics or group bys is building a bitmap with the size of maximum value of columns regardless the cardinality.

Check this topic for reference


#19

Hi

Now I tried with below given machine configurations:-
32 vCPUs, 415 GB RAM, 8 Tesla K80 GPU on Google Cloud Platform.
I am not able to increase the CPU RAM above 415 GB because Google Cloud Platform is not allowing to increase the RAM above it.

and still the same query respond with same exception

mapdql> select distinct(TransId) from PosB_all_data where Promotion is not null;
Exception: Query would use too much memory

Thanks-
Sumit


#20

Hi @sumit7986

It’s the watchdog that’s blocking Query execution regardless the memory of hosting/GPU.

You Can disable It changing the file mapd.conf adding this line
enable-watchdog = false
and restarting instance

or changing the datate of tour field (assuming actually you have skewed values)