Https://www.mapd.com/blog/mapd-pandas-arrow/


#1

https://www.mapd.com/blog/mapd-pandas-arrow/

Hello, I am trying to reproduce the example and got stuck with selecting shared CPU data from mapd to Pandas.

Everything works (connection, load data into mapd, simple executes, …) , until this code on one of sample tables.
Table with drugs returns the same error.

query = “SELECT depdelay, arrdelay FROM flights_2008_10k limit 100”
df = con.select_ipc(query)
df.head()

Got this message


ModuleNotFoundError Traceback (most recent call last)
in ()
1 query = “SELECT depdelay, arrdelay FROM flights_2008_10k limit 100”
----> 2 df = con.select_ipc(query)
3 df.head()

~\Anaconda3\lib\site-packages\pymapd\connection.py in select_ipc(self, operation, parameters, first_n)
296 raise ImportError("pandas is required for select_ipc")
297
–> 298 from .shm import load_buffer
299
300 if parameters is not None:

ModuleNotFoundError: No module named ‘pymapd.shm’

Please help. Obviously, I’ve had pandas, pymapd and pyarrow.
I’ve tried the client part on Ubuntu and Windows in Jupyter notebook, Python 3.5.2 and Python 3.6.4


#2

Hi @Vyacheslav, sorry to hear you’re having issues with pymapd. Can you let me know how you installed pymapd? Did you use conda-forge, pip or GitHub?


#3

pip3 on Ubuntu, pip on Windows
Thanks!


#4

Unfortunately, I can confirm this is indeed a bug (running on Ubuntu):

I suspect it’s an issue with us changing our Thrift bindings in MapD-Core, but not updating pymapd. Hopefully we can get this resolved soon.

Thanks,
Randy


#5

@Vyacheslav If you are installing via pip you need to install pyximport which is part of cython to load .shm files. It comes by default in conda builds.
Then while importing pymapd file also do import pyximport; pyximport.install()


#6

I think we’ve come to find a temporary solution, if you’re willing to use conda:

conda install -c conda-forge pymapd
conda install -c conda-forge pyarrow==0.7.1

# packages in environment at /home/mapdadmin/miniconda3/envs/condainstall:
#
# Name                    Version                   Build  Channel
arrow-cpp                 0.7.1                    py36_2    conda-forge
backcall                  0.1.0                    py36_0  
boost-cpp                 1.66.0                        1    conda-forge
bzip2                     1.0.6                         1    conda-forge
ca-certificates           2018.4.16                     0    conda-forge
certifi                   2018.4.16                py36_0    conda-forge
decorator                 4.3.0                    py36_0  
icu                       58.2                          0    conda-forge
intel-openmp              2018.0.0                      8  
ipython                   6.3.1                    py36_0  
ipython_genutils          0.2.0            py36hb52b0d5_0  
jedi                      0.12.0                   py36_1  
libgcc-ng                 7.2.0                hdf63c60_3  
libgfortran-ng            7.2.0                hdf63c60_3  
mkl                       2018.0.2                      1  
mkl_fft                   1.0.2                    py36_0    conda-forge
mkl_random                1.0.1                    py36_0    conda-forge
ncurses                   5.9                          10    conda-forge
numpy                     1.14.2           py36hdbf6ddf_1  
openssl                   1.0.2o                        0    conda-forge
pandas                    0.22.0                   py36_1    conda-forge
parquet-cpp               1.3.0.post                    2    conda-forge
parso                     0.2.0                    py36_0  
pexpect                   4.5.0                    py36_0  
pickleshare               0.7.4            py36h63277f8_0  
pip                       9.0.3                    py36_0    conda-forge
prompt_toolkit            1.0.15           py36h17d85b1_0  
ptyprocess                0.5.2            py36h69acd42_0  
pyarrow                   0.7.1                    py36_1    conda-forge
pygments                  2.2.0            py36h0d3125c_0  
pymapd                    0.3.2                    py36_0    conda-forge
python                    3.6.5                         1    conda-forge
python-dateutil           2.7.2                      py_0    conda-forge
pytz                      2018.4                     py_0    conda-forge
readline                  7.0                           0    conda-forge
setuptools                39.1.0                   py36_0    conda-forge
simplegeneric             0.8.1                    py36_2  
six                       1.11.0                   py36_1    conda-forge
sqlalchemy                1.2.7            py36h65ede16_0    conda-forge
sqlite                    3.20.1                        2    conda-forge
thrift                    0.10.0                   py36_0    conda-forge
tk                        8.6.7                         0    conda-forge
traitlets                 4.3.2            py36h674d592_0  
wcwidth                   0.1.7            py36hdf4376a_0  
wheel                     0.31.0                   py36_0    conda-forge
xz                        5.2.3                         0    conda-forge
zlib                      1.2.11                        0    conda-forge

Obviously, this is sub-optimal, but if you need to get started immediately, the two install commands above should do it. And in the meantime, we’ll work on getting the build issues straightened out.


#7

Thanks.
I did a completely fresh install on new Ubuntu.

Used latest full Anaconda, then applied two listed conda-forge commands -

  1. install pymapd and
  2. downgrade to old pyarrow (0.7.1)

Again, everything in the example works, including the database itself, Immerse, SQL, etc.
But cannot pass the query to pandas

Now I have new error message- ValueError: Invalid shared memory key …

Thanks for your help, will continue tomorrow :slight_smile:


#8

Invalid shared memory key is a subtle issue; shared memory only works if pymapd is running on the same computer as MapD. If they are running on the same computer, you can keep the result in its location and avoid an I/O step.

But if you’re remote, then to get results you need to take them across the wire, to write a pandas dataframe locally.


#9

Thanks. Everything on the same box - Ubuntu 16.04 LTS, osboxes.org image.

Example

df = pd.DataFrame({“A”: [1, 2], “B”: [“c”, “d”]})
con.create_table(“table_name”,df, preserve_index=False)
con.load_table_columnar(“table_name”, df, preserve_index=False)

Works great.

Then

query = “SELECT A FROM table_name”
df = con.select_ipc(query)
df.head()

And finally - error.


ValueError Traceback (most recent call last)
in ()
1 query = “SELECT A FROM table_name”
----> 2 df = con.select_ipc(query)
3 df.head()

~/anaconda3/lib/python3.6/site-packages/pymapd/connection.py in select_ipc(self, operation, parameters, first_n)
306 )
307
–> 308 sm_buf = load_buffer(tdf.sm_handle, tdf.sm_size)
309 df_buf = load_buffer(tdf.df_handle, tdf.df_size)
310

pymapd/shm.pyx in pymapd.shm.load_buffer()

pymapd/shm.pyx in pymapd.shm.load_buffer()

ValueError: Invalid shared memory key 1681692777

But, anyway, if shared memory will not work over network, this is not feasible approach anyway (except very limited use cases).


#10

I’ve realized probable reason of the mistake when I’ve watched your video about Azure deployment.

I’ve used the same approach, and mapd is running in one container, while jupyter/kernel on vm or in another container. That is why (most probably) shared memory and Arrow will not work - because mapd process memory blocks are isolated by docker.

BTW, docker can be installed by just:
apt install docker.io


#11

True, but docker.io apt package is maintained by the community and the method in my video is maintained by Docker. Going with the latter seems more robust to me.