Can Broken Pipe error also indicate data corruption?


#1

I have a very simple query like below -
select a, b, c, d, e,count(d) from mytable group by a, b, c, d, e HAVING COUNT(d) > 1;

mytable occupies ~1 TB of diskspace, has 31 columns, stored as single table on single node community MapD 4.0.

When I run above query through mapdql, I receive following message -
Thrift: Tue Jul 17 20:59:31 2018 TSocket::write_partial() send() <Host: localhost Port: 9091>Broken pipe

In INFO log I see following -
I0717 20:58:12.672221 7887 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
F0717 20:59:17.350100 10372 StringDictionaryProxy.cpp:91] Check failed: it != transient_int_to_str_.end()

Does above error indicate memory overrun? Also ever since this error has occurred following queries are running for indefinite time which earlier used to revert response in subseconds does this indicate data corruption? I restarted the server and tried again.

select count(*) from dfd_result;
select * from dfd_result limit 1;

Am I missing something?


#2

@mbaxi before we get into the error, can you tell me simpler queries work? Can you select fewer columns? Can you select all of those columns without the GROUP BY clause?


#3

@mbaxi also what type of data is in those columns?


#4

yes I am able to run simple queries and even group by queries with filters. The group by query I have mentioned above will parse entire table data therefore I am anticipating it to be memory overrun …was not able to check the \memory_summary while the query was running.

I manually cleaned up page cache and restarted mapd server the count and projection queries worked fine like before so data wasn’t corrupted but mapd server was somehow taking indefinite time to run simple queries too.

Columns a to d are of type TEXT ENCODING DICT(16) and column e is of type BIGINT.

Hope this helps.


#5

the cache memory consumption of your query is 16 bytes per row, 8 for the first 4 files and 8 for the bigint one; so if youi have 1B rows you would need just 16GBs for caching and looking at error semms there is a problem with data dictionary of text encoded field.
You can get performance problems because the use of bigint. could you tell the max,min and count distinct of that column?


#6

Here are the details for each field type.

Field datatype distinct count min max Size (Bytes) Memory (GB)
a TEXT ENCODING DICT(16) 9 NA NA 2 2.19
b TEXT ENCODING DICT(16) 54 NA NA 2 2.19
c TEXT ENCODING DICT 864 NA NA 4 4.38
d TEXT ENCODING DICT 21600 NA NA 4 4.38
e BIGINT 54491 1527778921000 1530478850900 8 8.75
Total 20 21.88

The amount of memory consumed for following queries is same as you have mentioned (num rows * size of data type ) -
select distinct from mytable;
select , count() from dfd_result group by ;

Next I ran following query, memory utilization was incremented by ~22 G which is as expected.
select a, b, c, d, e,count(d) from mytable group by a, b, c, d, e HAVING COUNT(d) > 1;

It took 20 min for failure to occurr, here is the log -
I0719 17:30:28.634575 2783 Calcite.cpp:316] User mbaxi catalog test sql ‘select a, b, c, d, e,count(d) from mytable group by a, b, c, d,e HAVING COUNT(d) > 1;’
I0719 17:31:05.237534 3011 MapDHandler.cpp:332] User mbaxi connected to database test
I0719 17:46:55.143383 2793 FileMgr.cpp:174] Completed Reading table’s file metadata, Elapsed time : 985283ms Epoch: 6 files read: 2264 table location: ‘/data/mapd/data/mapd_data/table_3_1/’
I0719 17:46:57.562595 2793 Catalog.cpp:2322] Instantiating Fragmenter for table mytabletook 987704ms
I0719 17:46:58.069736 2783 Calcite.cpp:329] Time in Thrift 18 (ms), Time in Java Calcite server 989417 (ms)
I0719 17:46:58.159112 2783 Catalog.cpp:2375] Time to load Dictionary 3_1 was 46ms
I0719 17:46:58.189857 2783 Catalog.cpp:2375] Time to load Dictionary 3_6 was 30ms
I0719 17:46:58.253661 2783 Catalog.cpp:2375] Time to load Dictionary 3_2 was 63ms
I0719 17:46:58.284938 2783 Catalog.cpp:2375] Time to load Dictionary 3_5 was 31ms
I0719 17:46:58.352494 9990 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
I0719 17:47:46.448696 10437 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
I0719 17:48:19.384099 10589 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
I0719 17:48:54.944201 10571 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
I0719 17:49:32.283368 10485 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
I0719 17:50:13.684221 10665 BufferMgr.cpp:283] ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
F0719 17:51:51.272341 2783 StringDictionary.cpp:318] Check failed: string_id < static_cast<int32_t>(str_count_) (32800 vs. 9)

I re-ran the query after removing having count > 1 filter, the query still failed in ~20 min, but error logged was slightly different -
F0719 18:23:36.634987 19775 StringDictionaryProxy.cpp:91] Check failed: it != transient_int_to_str_.end()

Should I increase log level , would that be helpful?