Loading errors into MapD


#1

Hello!

My h/w specs are dual intel xeon 8core CPUs, 8xGTX 1080Tis as GPU, 128GB of RAM and enough HDD space.

I started doing some loads into the DB. This was a 10column data table. Initially I started with 45mil records and that went thru without any issues. The load was done in small batches (5-10mil per batch). Once the load was done, I was able to do some analysis thru the UI.

Next I consolidated all my files (has about 280mil records) and did a load. This failed at the 90Mil mark. Another try with 120Mil also failed with a partial load in the DB. The error that I noticed is ‘not enough host memory’.

Any recommendations on load parameters to be set or should I go in smaller batches ? Also any other OS level parms that needed tweaking ?

_Ben


#2

i loaded over 500M records (i cant remember the batches size) on a machine with 16GB without problems; could you share the DDL of table? Are you using encoding for text fields (assuming is possible)?


#3

Here you go:

mapdql> \d dc_op_addr
CREATE TABLE dc_op_addr (
lon FLOAT,
lat FLOAT,
nbr TEXT ENCODING DICT(32),
addr TEXT ENCODING DICT(32),
unit TEXT ENCODING DICT(32),
city TEXT ENCODING DICT(32),
district TEXT ENCODING DICT(32),
region TEXT ENCODING DICT(32),
postalcode TEXT ENCODING DICT(32),
id TEXT ENCODING DICT(32),
hash TEXT ENCODING DICT(32))

Also here is a data snapshot.

-84.5852233,38.2056161,138,ANNE JENNINGS WAY,GEORGETOWN,40324,4830a9d356d77601
-84.584946,38.2059878,135,ANNE JENNINGS WAY,GEORGETOWN,40324,eaf13cea21bd0d50
-84.5253863,38.2198444,100,JENNIFER DR,GEORGETOWN,40324,868a306c07f9ce6b
-84.660558,38.3592474,5221,LONG LICK PIKE,STAMPING GROUND,40379,31b9048b31ee9ec6

Does it make sense to adjust the encoding size.

_Ben


#4

Hi,

Can you share the log where the error occurred? And the exact COPY command you were executing.

I have two current suspicions, the first would be some odd formatting in the CSV to be loaded, have you done some basic sanity tests on the load csv like wc etc to make sure it is not got some single line that it is huge?
The second is environmental like limited filehandles available to mapd_server. Please check what the ulimit -n setting is whether you are running the mapd-server. If you are running mapd via startmapd rather than via systemd this can be an issue.

There is normally no advantage in going to smaller batches. In this case I am going to recommend it to help to identify which part of your data is problematic.

Regards


#5

that’s weird, i just did a load of 360M records on a single batch in my desktop and i havent got any problem

mapdql> copy b_dma_fspe_spese_el from ‘/opt/opendata/b_dma_fspe_spese_el11.out’ with (header=‘false’, delimiter=’|’, max_reject=100000) ;
Loaded: 362625809 recs, Rejected: 1 recs in 299.089000 secs
1 rows returned.
Execution time: 299090 ms, Total time: 299114 ms

as @dwayneberry said maybe is a soft limit on OS, check the limits for user running mapd with ulimit -a

if can help here is the output on my desktop
mapd@zion-ubuntu:~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 62948
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 62948
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

anyway check your csv files and set a limit on number of errors on loading to easly identify the badly formatted lines


#6

Here is the log snapshot:
I0912 16:14:42.088706 31013 Calcite.cpp:249] User mapd catalog mapd sql 'SELECT conv_4326_900913_x(lon) as x,conv_4326_900913_y(lat) as y,us_op_addr.rowid FROM us_op_addr WHERE (lon >= -123.60153375983421 AND lon <= -71.90225595578937) AND (lat >= 26.14909859875476 AND lat <= 51.429683875113426) AND MOD(us_op_addr.rowid * 265445761, 4294967296) < 88382569 LIMIT 2000000’
I0912 16:14:42.099082 31013 Calcite.cpp:262] Time in Thrift 1 (ms), Time in Java Calcite server 9 (ms)
I0912 16:14:42.185405 31013 MapDHandler.cpp:1920] render_vega-COMPLETED Total: 100 (ms), Total Execution: 24 (ms), Total Render: 76 (ms)
I0912 16:14:48.090440 31013 MapDHandler.cpp:563] sql_execute :2FVfwRsQufXxU1Q7iTQXvpR0AUuYcnRz:query_str:SELECT postalcode as key0,COUNT() AS x,APPROX_COUNT_DISTINCT(city) AS y FROM us_op_addr WHERE (lon >= -123.60153375983421 AND lon <= -71.90225595578937) AND (lat >= 26.14909859875476 AND lat <= 51.429683875113426) GROUP BY key0 ORDER BY x DESC,y DESC LIMIT 30
I0912 16:14:48.091356 31013 Calcite.cpp:249] User mapd catalog mapd sql 'SELECT postalcode as key0,COUNT(
) AS x,APPROX_COUNT_DISTINCT(city) AS y FROM us_op_addr WHERE (lon >= -123.60153375983421 AND lon <= -71.90225595578937) AND (lat >= 26.14909859875476 AND lat <= 51.429683875113426) GROUP BY key0 ORDER BY x DESC,y DESC LIMIT 30’
I0912 16:14:48.105998 31013 Calcite.cpp:262] Time in Thrift 1 (ms), Time in Java Calcite server 13 (ms)
E0912 16:14:48.276177 31957 ExecutionDispatch.cpp:287] Failed to allocate 3399884800 bytes of memory
E0912 16:14:48.286497 31958 ExecutionDispatch.cpp:287] Failed to allocate 3399884800 bytes of memory
E0912 16:14:48.349323 31959 ExecutionDispatch.cpp:287] Failed to allocate 3399884800 bytes of memory
E0912 16:14:48.392462 31960 ExecutionDispatch.cpp:287] Failed to allocate 3399884800 bytes of memory
E0912 16:14:48.392942 31013 MapDHandler.cpp:2818] Exception: Not enough host memory to execute the query

Now that I am putting all the pieces together, the load failed due to high number of rejections. While the load was going on, I had a colleague start creating a query on the same table to view the results in real time. It was a bubble chart that was bring experimented with when the above error popped up.

The above memory allocation translates to ~3.1G. I noticed in the server start-up messages, the calcite server allocates 1G of memory for its JVM process. Is that something that can be increased or tweaked, so I can avoid the above error ?

(We are good on the loading side. The file was a bit inconsistent with extra columns and the total number of rejects execeed 1M, hence the failure)

_Ben


#7

Hi,

The query that is failing is trying to set aside a large amount of memory to store the results of the count and approx_count_distinct. It is a concern because there are only about 40K zip codes in US and same number of cities so I don’t believe it should be trying to use so much memory. My first step would be to look at the data. My suspicion is that due to your bad etl process you may have populated the postalcode or city columns with bogus data that is giving one or both of these columns some very large cardinality? Could you please do a bit of a spot check on those two columns to confirm they are sane (an approx_count_distinct() is a good quick way to do this),

On your second question, adjusting the calcite memory usage will not help this issue. It can be adjusted with the config parameter calcite-max-mem.

Regards