Memory keep increasing causing mapd crashed


#1

I have a database in mapd core around 2m data and want to visualize it as a pointmap in frontend.

But, when I reloaded it few times, cpu memory (in htop) kept increasing (around 0.2G) after I rendered the image and not decreasing after the process finished. This accumulation causing my mapd crashed and need to be restarted.

My first guess is my javascript program for calling Mapdcon renderVega function did not disconnect after its connection to mapd. But I think I have disconnected it as I read in the documentation.

At my case, what is the cause of the increasing memory?

I would really appreciate your help and response.
Thanks.


#2

hi sorry for mentioned you @dwayneberry , i have got same problem like @easymavinmind , is there another way to decrease memory in cpu server thaht already consume by mapd?


#3

@easymavinmind can you tell us a bit more about your javascript program and how it is implemented so we can try to determine if your theory is a good place to start?

Also, what does your instance of MapD look like? Is it Community Edition installed locally?


#4
function init() {
  var vegaOptions = {}
  var con = new MapdCon()
    .protocol("http")
    .host(myhostserver)
    .port(myport)
    .dbName("mapd")
    .user("mapd")
    .password("HyperInteractive")
    .connect(function(error, con) {
      con.renderVega(1, JSON.stringify(exampleVega), vegaOptions, function(error, result) {
        if (error) {
          console.log(error.message);
        }
        else {
          var blobUrl = `data:image/png;base64,${result.image}`;
          google.maps.event.addDomListener(window, 'load', initMap(blobUrl));
          setTimeout(function(){
            console.log(blobUrl);
            $("#loading").addClass("hide"); 
          }, 1000)
        }
      });
    })
   .disconnect((error, con) => console.log(error, con))
   .sessionId() === null;
}

that is my javascript program…
I have installed mapd community edition in my remote server and access it from my local.

Also, when I queried to mapdql via jdbc, the cpu’s memory kept increasing and do not flushing (when I checked \memory_summary in mapdql, the used memory is always xx Mb for cpu and gpu), is it normal? And, what should I do to reduce its memory usage and to prevent my server from crashing?

Thank you for your reply :slight_smile:


#5

Also, if you can share your mapd_server.INFO log, that would help us as well.

It could be a concurrent users issue, or maybe a cache issue. The settings for both of these are not (yet) configurable, but we rarely see issues around this unless there’s another issue at play.


#6

@easymavinmind one of our engineers has investigated. We would still like to see the log to narrow down what’s happening, but he suspects there are two bugs that we can file and fix. I’m going to see if I can get him to answer directly, but to sum up:

  1. disconnect does not automatically flush the render cache
  2. we don’t currently expose the max concurrent render users setting

However, we’re also checking to see if our docs are incorrect with regard to calling disconnect.


#7

When you say mapd crashes, are you saying that it starts swapping due to out-of-memory and the OS ultimately kills the process? Getting the INFO log of the server will help in determining the cause of the crash.

As for memory increases during render, first of all, currently on an explicit disconnect, any render caches associated with the session don’t get cleaned up. This is an oversight and will be addressed in the next release.

Secondly, the server handles up to 500 concurrent render sessions/caches. Unfortunately this is not configurable currently. We can make this a configurable startup option if need be. With your script, if you were to run it hundreds of times, you’d see cpu memory steadily increase until the 500th iteration, at which point memory would stabilize.

So for clarity, here’s how the render cache mechanism works:

  • Up to 500 caches can be stored (not configurable yet)
  • After a render session cache has been idle for 5 minutes, it gets cleared. (this also isn’t configurable)
  • There’s a number of objects that make up the render session cache. Most are very small, but the biggest hit to cpu memory would be a cached auxiliary buffer used for subsequent hit-test calls. The size of this buffer is determined by the dimensions of the image in the renderVega call.

There’s a handful of things that can be done to improve the footprint of these caches, and they should be addressed soon.

But to help you out in the meantime, are you really in need of constantly connecting and disconnecting in this way? Not only will you run into cache issues, but it is obviously a bottleneck in several ways.

This usually isn’t an issue because the number of concurrent render users tends to be small, and again, unused sessions get cleaned up after 5 minutes.

But if you have a case where the number of concurrent render users are high, then we’ll need to get some better solutions for you.

Overall, the best approach is to keep one render-session open for the entirety of one’s use, and use the same id for the first argument to the renderVega() call for each separate window or view. For example, if you wanted a backend-rendered pointmap and a scatterplot on the same dashboard for the same session keep a unique id for each to pass to their respective renderVega() calls. This will ultimately create a separate cache for each window per unique id.

Lastly, the javascript snippet above has a slight issue. The connect call is async, but you immediately call disconnect before the connect and subsequent renderVega complete. You’re likely not appropriately disconnecting here. Just wanted to point that out. If you got that code snippet from the docs, could you point us to where that was done so we can correct that?

Thanks.


#8

@easy @vastcharade, Thank you for your response, I really appreciate it.

I followed the instruction from https://mapd.github.io/mapd-connector/docs/#disconnect for calling disconnect in renderVega and https://pypi.org/project/JayDeBeApi/ from jaydebeapi modul in python, but seems not closing the connection

My mapd crashed, because cpu ram was increasing until reaching the maximum and then switch to the swap memory, and then my server gave very slow responses and I have to manually restart the mapd_server and mapd_web_server.

In my case, when I did a new query, the memory increased by 200MB. My cpu ram and swap memory, have a total 47GB memory, does it means that I can only query around 47/0.2 = 94 times?

I have an idea to restart my mapd when the memory closes the limit, but it will require the user to refresh the mapd frontend, isnt it?

@vastcharade, you said that
“After a render session cache has been idle for 5 minutes, it gets cleared”
In what memory, the cache which will be deleted after 5 mins idle?

for your suggestion to keep one render-session, do you have an example code?
and does it means that the cache will be kept in user’s memory?

also, for the correct connect and disconnect, can you show me the appropriate way to connect and disconnect renderVega?

thank you for your time…
At the moment, my server goes down and being maintained, so I cant post the mapd_server.info. Next time, I will post the mapd_server.INFO in this thread.


#9

I followed the instruction from https://mapd.github.io/mapd-connector/docs/#disconnect 1 for calling disconnect in renderVega and https://pypi.org/project/JayDeBeApi/ from jaydebeapi modul in python, but seems not closing the connection

Right, but because you’re chaining async calls like this:

var con = new MapdCon()
    .connect(...)
    .disconnect(...)

disconnect gets called before a connection is really established (because connect is called asynchronously). So the disconnect() call does nothing. Perhaps it should throw an error or warning instead of silently returning. But that is why the connections you’re making aren’t really being closed. You have to properly call disconnect after a successful connection was made. Again, this won’t fix the memory issues you’ve described. I’m just pointing out why your disconnect calls weren’t actually disconnecting.

In my case, when I did a new query, the memory increased by 200MB. My cpu ram and swap memory, have a total 47GB memory, does it means that I can only query around 47/0.2 = 94 times?

200MB every query seems extreme. This should not be happening. I’d need to see your log to get a better understanding of the vega, query and data used.

Again, there is some caching done during render queries, maxing out at 500 different render sessions. For example, if each active render session were to render a 1024x1024 image, that roughly equates to caching about 13MB of cpu memory. So, if all 500 sessions were used concurrently, that comes to approx 6.5GB of memory. There are things we can do to improve that footprint certainly. But at this point I just don’t understand why you’re seeing 200MB bumps per session. That would only make sense if you were rendering massive images (approx 200 megapixels).

I have an idea to restart my mapd when the memory closes the limit, but it will require the user to refresh the mapd frontend, isnt it?

yes, such a design would require a reconnection in the frontend. But this just sounds like a workaround to your memory issues. We should see if we can really address that first.

In what memory, the cache which will be deleted after 5 mins idle?

The render session cache, which is cpu memory, would get cleared. In the above example, where the rendered image was 1024x1024, that cache would be approx 13MB of cpu memory.

for your suggestion to keep one render-session, do you have an example code?
and does it means that the cache will be kept in user’s memory?

Sorry, I don’t have great example code for you, but basically you would do this:

function init() {
  var con = new MapdCon()
    .protocol("http")
    .host(myhostserver)
    .port(myport)
    .dbName(mydb)
    .user(myuser)
    .password(mypwd)
    .connect(function(error, con) {
      // setup events/callbacks to continuously call con.renderVega()
      // while this connection is valid

      // can even setup an event that would trigger a disconnect at
      // the appropriate time here
    })
}

By preserving the connection and running renderVega repeatedly for that session (and widget id), you would re-use that 13MB render session cache over and over again.
By reconnecting, you’re building a new 13MB cache every time. (again, this cache is cleared when they are idle for 5 minutes)
My point is, keep an established connection open to rerun queries/renders and only make new connections when it makes sense to do so.

I hope this helps.

Chris


#10

mapd_server.INFO.zip (416.0 KB)

this is my mapd_server.info as promises, hope this depicts my mapd settings, log and bugs


#11

What an explanation !
It helps me to understand mapd better…

I will reconstruct my javascript for connecting and disconnecting in a proper way

Btw, I have uploaded my mapd_server.info, hope that helps you and me to analyze where my 200MB memory comes from? :joy::joy:


#12

@easy

How do I check if the session cache is cleared after 5 mins?

Also, can you describe what is the hardware in https://www.mapd.com/demos/, such as CPU core and RAM, GPU RAM etc? I saw that you have a huge number of data, like around 400m tweet data and have many queries every time I move or click with my mouse, but I think (or should be) the server never goes down…
In my case, I inserted around 200m data with 25++ fields, and each time I query(s), the cpu ram increased significantly and even can cause mapd crased (query such as select distinct id,latitude,longitude from database limit 1000000 or another aggregation)


#13

You should see a line similar to this in the mapd_server.INFO log:

QueryRenderManager - purging 2 idle connections.

Do note that the LRU cache here is not purged every 5 minutes by a scheduled task. The architecture is currently designed to purge these render session caches upon an actual render call.

So, take the following scenario for example:

  1. render_vega() is called by session id 1, widget id 1
  2. ten minutes passes: the previous render session cache for session 1, widget id 1 will not have been cleared.
  3. render_vega() is called by session id 2, widget id 1 - at this point the idle render sessions cache for session 1, widget id 1 is purged

@easy would you be able to direct @easymavinmind with the second half of his question?

I’m not an expert on the query side of things, but I’m pretty sure that SELECT distinct id, .... is an expensive query, both in terms of performance and memory, particularly with high cardinality columns. With 200M rows, and the fact that the column you’re running DISTINCT against is called “id”, leads me to believe this may be a culprit for the large memory increases you’re seeing. Again, I’m not sure here, but it is suspect.

Also, is your server and client running on the same machine?


#14

Thank you (again) for your comprehensive answer, Christoper.

So, in the mapd_server.INFO, can I inspect how much memory that have been purged for each cache?

No, they runs on different machine.
But, what if I run it on the same machine? Is there any big difference?


#15

Unfortunately you cannot currently inspect the real size of render session caches in the log. You can calculate an estimate from the size of the image rendered tho: width * height * 12 == approximate cache size. We can do better about this footprint and properly log the cache size. This is on our TODO list.

But I don’t believe these caches are ultimately the reason for your memory issues (unless you really do need 500+ concurrent render users).

From the mapd_server.INFO log you uploaded, there are lines such as this:

cpuSlabSize is 4096M
ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0

That slab allocation is called twice - so more than 8GB cpu memory used there it seems.

and then this guy, which is where it seems things started to go haywire:

I0517 13:19:55.042361  7785 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.049919  7786 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.050570  7787 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.050938  7788 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.051473  7789 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.061585  7790 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.062219  7791 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.062853  7792 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.063414  7793 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.063946  7794 BufferMgr.cpp:39] OOM trace:
runImpl:341
alloc_gpu_mem:33 : device_id 0, num_bytes 6559884288
I0517 13:19:55.064829  7018 RelAlgExecutor.cpp:1635] Query ran out of GPU memory, punt to CPU

This punt to cpu seems to be the big problem here. The render session caches, although contribute to the overall memory, tend not to be the big culprit. For clarity the log you provided used a total of 44 individual render session caches, and 42 were purged, so the memory footprint in the end on the render session cache side would be relatively small.

But this deserves closer examination from us, and I am not an expert with cpu memory usage on the query side, so I’ll have to punt it to someone who does.

We’ll get back to you.


#16

@easymavinmind To help us along, could you provide the machine stats of your server (cpu type, mem, #gpus gpu type, gpu mem)?

Also, from your INFO log it seems you’re primarily making use of a twitter_eco and twitter_eco_test table. Can you tell us how many rows are in each of those tables and tell us the CREATE TABLE statement used in each?
You can get the CREATE TABLE statement by running \d twitter_eco from a mapql terminal. This info may help with diagnosing the issue.


#17

When I post the above mapd_server.INFO, I forget which cpu specs that I use :sweat: (I have upgraded my hardware since I faced cpu memory crashing). At the moment, I use AMD Ryzen 7 1700, 8 core processor with 2 threads each, totaling 16 cpus with memory 64 (upgraded from 16GB) + 16 GB swap memory, and GTX 1070 for my gpu

After I upgraded this, indeed, my uptime is much longer than before, but at some points, the memory also reach its limit and the mapd need to be restarted. I just curious how much user it can handle, benchmarking your demo which is enable to handle many users with many data and seems wont crashed at all.

for the table statement, most of it containing TEXT ENCODING DICT(32), except for long lat which is I set to be FLOAT and timestamps in BIG INT… so I think, its an expensive table :))