Logistic regression on top of MapD?


#1

Hello experts, I have one question I hope you could help me with.

A few days ago MapD was open-sourced and a collaboration between MapD and H2O.ai was announced. I want to know if as of today there is a way to run a logistic regression in a big data set in MapD without leaving the GPU stack?

For example, let say we have a data set of 100 million rows in MapD running on gpus. How can we run a logistic regression in that data set without moving the data outside the gpu stack? It is not clear to me if currently I can use XYZ software on top of MapD without moving data or if unfortunately I have to move the data outside the gpu stack in order to do the logistic regression.

Best regards.


#2

Hi,

Thanks for looking at MapD

With the MapD API sql_execute_gpudf you can tell MapD to run a query and leave the result set on the gpu for further processing. The format that we are using for this interchange is Arrow. What will be returned back after the query is a pointer to this buffer for you to do your further processing of the result set with the data never having to leave the GPU.

We are adding more documentation around this currently

Regards


#3

Hi

If you would like to see an example implementation, you should review

https://github.com/gpuopenanalytics

It has an end to end Jupyter notebook and docker file showing running regression on mapd

Mapd to continuum to h2o

Regards


#4

Many thanks!

In fact I asked the same question to the h20.ai mailing list and they also point me to the implementation you mentioned.

I’m going to do some testing.

Best regards,

Iván