Hash Join exception and multi-table JOIN?


#1

I run the following query,
SELECT count(a.imsi),sum(a.download) from test_a a JOIN test_zc1 b ON a.imsi =b.imsi;
and it returns an exception:
Exception: Hash join failed, reason: Could not build a 1-to-1 correspondence for columns involved in equijoin
The test_a table contains 200 millions of records, and the test_zc1 contains 800 millions. So what’s the meaning of the exception? I’m sure that the imsi attribute is unique in both of the tables.
More, I have some doubts about the JOIN operation in Mapd.

Whether Mapd support multi-table JOIN or not?
Can I treat the subquery result table as the right table of JOIN ? Like this query:

 select MSISDN,20140926,AREA_ID,USER_ID from DATUM.MD_PER_INF_ATTRIBUTE_DAY_03 z
 join (select S00000000023 from EDC_SUB_TEMP_10000049978_ZZ5 a 
 left join DATUM.MD_VW_PER_USER02_DAY_03 b on b.USER_ID=a.S00000000023 and b.DAY_NUMBER = 25 and b.IS_HMD_MDR_MG=1 where b.USER_ID is null ) y  
 on z.USER_ID=y.S00000000023 where z.DAY_NUMBER = 26 and z.AREA_ID in (select t.area_id from edc_area t where t.parent_area_id in (1000515));

#2

Hi @neko940709,

Thanks for your question. As per our docs, the fact table (table with non-unique keys) always has to go first in the join followed by any dimension tables (tables with unique keys). Can you reorder the join and give that a try?

Its a restriction that we might relax at some point so stay tuned!

Regards


#3

Hi @darwin ,
Thanks for your answer. I’ve changed the order of the tables, and also tried to modify the type of imsi into text encoding dict then reload the data, but still got the same exception.
This is the result of the DISTINCT operation of these two tables:

mapdql> select distinct count(test_a.imsi) from test_a;
EXPR$0
19999998
mapdql> select count(test_a.imsi) from test_a;
EXPR$0
19999998
mapdql> select distinct count(test_zc1.imsi) from test_zc1;
EXPR$0
79999998
mapdql> select count(test_zc1.imsi) from test_zc1;
EXPR$0
79999998

This is the exception:

mapdql> SELECT count(a.imsi),sum(a.download) from test_zc1 b JOIN test_a a ON b.imsi =a.imsi;
E0702 11:57:35.148865 103746 MapDHandler.cpp:2331] Exception: Hash join failed, reason: Could not build a 1-to-1 correspondence for columns involved in equijoin
Exception: Hash join failed, reason: Could not build a 1-to-1 correspondence for columns involved in equijoin

#4

Hi

I ran the following step with the latest build of mapd-core open source code

create data

awk 'BEGIN { for (i = 1; i <= 80000000; ++i) print i,",",i }' > test_a.csv
awk 'BEGIN { for (i = 1; i <= 20000000; ++i) print i,",",i }' > test_b.csv

I then ran the following queries in mapdql which seems to duplicate your steps

mapdql> create table test_a (imsi text, download bigint);
mapdql> create table test_b (b_imsi text, b_download bigint);
mapdql> copy test_a from '/data/test_a.csv' with (header='false');
Result
Loaded: 80000000 recs, Rejected: 0 recs in 111.367000 secs
mapdql> copy test_b from '/data/test_b.csv' with (header='false');
Result
Loaded: 20000000 recs, Rejected: 0 recs in 30.025000 secs
mapdql> select count(*) from test_a;
EXPR$0
80000000
mapdql> select distinct count(imsi) from test_a;
EXPR$0
80000000
mapdql> select count(*) from test_b;
EXPR$0
20000000
mapdql> select distinct count(b_imsi) from test_b;
EXPR$0
20000000
mapdql> SELECT count(a.imsi),sum(a.download) from test_a a JOIN test_b b ON a.imsi =b.b_imsi;
EXPR$0|EXPR$1
20000000|200000010000000
mapdql> \version
MapD Server Version: 3.1.1dev-20170702-2966fac6
mapdql> 

Could you try these steps and see where your behaviour diverges, as based on your details above this should simulate roughly your environment.

Please confirm exactly which version of MapD you are running

regards


#5

Its a restriction that we might relax at some point so stay tuned!

@darwin - Is this on the roadmap? This is a big limiter for us at the moment.


#6

Hi @juanschwartz, do you mean one-to-many joins, i.e. where there are duplicate join keys in the dimension table? If so that will be landing in 3.2.0 which should be released this week.


#7

@darwin :

Correct. Currently you can only join tables with unique rows on the dimension tables. Ideally, I’d like to be able to join on multiple non-unique table columns. We are in the process of building a query engine on top of MapD as part of a proof of concept and we’ve had to work around the existing limitation. While we’ve had some success, removing the limitation at the cost of performance or requiring more hardware would be welcomed.

Thanks again!