MapD hangs when using a DB copy from an EBS Snapshot


#1

Hi,

I’m trying to use a copy of our production database in a test environment. The production MapD instance’s data directory is on an EBS volume mounted on the EC2 instance running MapD (file system is ext4 on Ubuntu). I took a snapshot of the production EBS volume, and on a new test MapD EC2 instance I mounted a new EBS volume using this snapshot from production, with the intent of having this test MapD instance have a copy of production data.

Before starting up the test MapD instance, I clean out all of the logs files in data/mapd_log, and remove the lock file in data/mapd_server_pid.lck. I then start the test MapD instance. It starts up fine, and it writes it’s log files to the mapd_log/ directory and creates a new .data/mapd_server_pid.lck file. However, when I try to run anything in mapdql it just hangs. Simply doing a \t to list tables just hangs indefinitely. Queries hang as well.

I’m trying to figure out where it’s running into a problem. Is there some sort of debugging knob I can turn to see what the MapD server is doing and why it’s hanging? All I see in the log after all of the start up info is this when it’s hanging:

I1020 12:59:22.741915   759 MapDHandler.cpp:349] User XXXXX connected to database XXXXXX
I1020 12:59:24.183737   759 FileMgr.cpp:116] Read table metadata, Epoch is 5787 for table data at '/mapd-storage/data/mapd_data/table_2_1/'

Or is there something special I need to do to MapD before taking the EBS snapshot so that it is in a "clean’ state when taking the snapshot?

Thanks in advance.
BP


#2

Testing this further, this may have nothing to do with EBS or the snapshot. I have a MapD instance running in Docker on AWS ECS. The database has data loaded and I can run queries and it works fine. I kill that ECS service, leaving the data directory and associated files behind (they’re mounted as a volume in the Docker container). Then I start up another MapD Docker container on the same machine pointing it’s data directory/volume to the same directory that the original MapD instance was using. The server starts up fine, but I see the same hanging behavior in mapdql.

So it seems like this is more about how MapD is shutdown than anything. Anyone have any ideas what would cause this, and the “proper” way to shut down a MapD instance so that the data files are safe and re-usable when the server is fired back up? Is there some specific signal we should send it to indicate it’s about to exit?


#3

OK, I can actually see the hang in the INFO log when I run the first query on the new MapD server instance:

I1020 20:01:53.941259   754 Calcite.cpp:260] Time in Thrift 13 (ms), Time in Java Calcite server 385827 (ms)
I1020 20:01:56.339835   763 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 71 ms GPU_MGR:15
I1020 20:01:56.341030   765 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 73 ms GPU_MGR:13
I1020 20:01:56.341763   766 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 73 ms GPU_MGR:12
I1020 20:01:56.342319   767 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 74 ms GPU_MGR:11
I1020 20:01:56.343431   764 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 75 ms GPU_MGR:14
I1020 20:01:56.344161   768 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 72 ms GPU_MGR:10
I1020 20:01:56.344682   769 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 72 ms GPU_MGR:0
I1020 20:01:56.346340   770 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 74 ms GPU_MGR:1
I1020 20:01:56.348371   772 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 75 ms GPU_MGR:3
I1020 20:01:56.350420   774 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 77 ms GPU_MGR:5
I1020 20:01:56.352376   776 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 79 ms GPU_MGR:7
I1020 20:01:56.354427   775 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 78 ms GPU_MGR:6
I1020 20:01:56.356390   773 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 80 ms GPU_MGR:4
I1020 20:01:56.358302   771 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 82 ms GPU_MGR:2
I1020 20:01:56.360303   777 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 84 ms GPU_MGR:8
I1020 20:01:56.362239   778 BufferMgr.cpp:266] ALLOCATION slab of 4194304 pages (2147483648B) created in 85 ms GPU_MGR:9
I1020 20:01:56.365770   754 MapDHandler.cpp:573] sql_execute-COMPLETED Total: 388265 (ms), Execution: 388264 (ms)

..... subsequent queries are normal speed.......

I1020 20:20:25.814374   753 MapDHandler.cpp:561] sql_execute :Fec90Pc2VxBmtDihvjbeWyIIpqNAaZmo:query_str:select count(*) from XXXXXX;
I1020 20:20:25.815446   753 Calcite.cpp:247] User XXXXXX catalog XXXXXX sql 'select count(*) from XXXXXX;'
I1020 20:20:25.838557   753 Calcite.cpp:260] Time in Thrift 3 (ms), Time in Java Calcite server 20 (ms)
I1020 20:20:25.866287   753 MapDHandler.cpp:573] sql_execute-COMPLETED Total: 50 (ms), Execution: 49 (ms)

So the first time we run a query after restarting a MapD instance with a lot of data will spend over 6 minutes in “Java Calcite server”. What exactly is this doing, and is there any way to speed it up?


#4

Hi

how are you adding records to your DB?

regards


#5

Using COPY FROM to load the data via CSV files.


#6

Hi

It looks like you are running an older version of MapD please update to 3.2.4 to get some more log on your DB restart

it is loading the metadata of the table. Longer discussion here https://github.com/mapd/mapd-core/issues/104

How large of a table are you using?

regards


#7

I’m running version mapd-ee-3.2.3-20170922-fbc71bb-Linux-x86_64-render

There’s only one table in the database and it has 500 million rows.


#8

Ah yes, that issue describes what I’m seeing to a T.

Thanks! Will keep an eye on that issue.


#9

Hi,

There are significant improvements in v3.2.4 you should download and try out.

regards


#10

Definitely will. Thanks!