Importing 12 Go of localization data


#1

Hello,

I want to give MapD a shot to test GPU rendering capabilities of localization data on maps.

I am having trouble importing data from a Postgres database.

Firstly, I tried using the sql importer, but each time it fails
On launching the command there are heavy disk activity for a while. Then it stops for several minutes I am issued a Java exception saying that there is not enough Heap space (I also tried with the -Xmx2048m jvm option).

Secondly, I decided to export data using a csv file. The csv file is correctly generated by Postgres. However when I import it in MapD using (after having create the table with the correct columns):

COPY database FROM 'file'

geospacial data are truncated: 51.274007 becomes 51.00000 (I used the REAL type for “lon” ant “lat” columns).

I don’t know where to go from here.

Am I missing something?

Thanks


#2

Hi,

Thanks for using the MapD Database.

Sorry about the issues you are experiencing.

Would it be possible for you to share your schema definitions and a small sample of your data.

Also the logs and example invocation of how you are trying to invoke SQLImporter. It is always a good test to run the SQL query you run in SQLImporter outside the Import process to make sure the source DB can execute the query.

Regards


#3

Here is the schema definition for the gse_positions in the mapd database:

CREATE TABLE IF NOT EXISTS gse_positions ( 
    id BIGINT NOT NULL,
    lat REAL,
    lon REAL,
    ts TIMESTAMP,
    driver_id BIGINT,
    gse_id BIGINT,
    ca TIMESTAMP,
    ua TIMESTAMP
);

Note that I also tried with DOUBLE instead of REAL for the coordinates.

Here is the schema definition of the table in the postgresql database:

   Colonne   |           Type           |                       Modificateurs                       | Stockage | Cible de statistiques | Description 
-------------+--------------------------+-----------------------------------------------------------+----------+-----------------------+-------------
 id          | integer                  | non NULL Par défaut, nextval('position_id_seq'::regclass) | plain    |                       | 
 latitude    | double precision         | non NULL                                                  | plain    |                       | 
 longitude   | double precision         | non NULL                                                  | plain    |                       | 
 happened_at | timestamp with time zone | non NULL                                                  | plain    |                       | 
 driver_id   | integer                  |                                                           | plain    |                       | 
 gse_id      | integer                  | non NULL                                                  | plain    |                       | 
 created_at  | timestamp with time zone |                                                           | plain    |                       | 
 updated_at  | timestamp with time zone |                                                           | plain    |                       | 

Finaly, here is some example data:

54534304,45.71844,5.077481,2016-09-03 03:31:12.786+00,867,686,2016-09-03 03:31:12.796+00,2016-09-03 03:31:12.796+00
54534318,51.279663,6.759588,2016-09-03 03:31:24.426+00,431,805,2016-09-03 03:31:24.434+00,2016-09-03 03:31:24.434+00
54534348,51.279663,6.759588,2016-09-03 03:31:54.402+00,431,805,2016-09-03 03:31:54.412+00,2016-09-03 03:31:54.412+00
54534354,51.27555,6.757296,2016-09-03 03:32:01.556+00,431,997,2016-09-03 03:32:01.566+00,2016-09-03 03:32:01.566+00
54534362,51.277565,6.762561,2016-09-03 03:32:11.906+00,431,1108,2016-09-03 03:32:11.923+00,2016-09-03 03:32:11.923+00
54534364,51.274006,6.757626,2016-09-03 03:32:12.96+00,431,798,2016-09-03 03:32:12.969+00,2016-09-03 03:32:12.969+00
54534368,43.664645,7.209125,2016-09-03 03:32:16.461+00,398,297,2016-09-03 03:32:16.466+00,2016-09-03 03:32:16.466+00
54534370,45.717558,5.072843,2016-09-03 03:32:16.806+00,,278,2016-09-03 03:32:16.847+00,2016-09-03 03:32:16.847+00
54534373,43.443071,5.216035,2016-09-03 03:32:22.39+00,1227,391,2016-09-03 03:32:22.401+00,2016-09-03 03:32:22.401+00
54534377,47.156123,-1.600302,2016-09-03 03:32:24.81+00,834,84,2016-09-03 03:32:24.821+00,2016-09-03 03:32:24.821+00

Here is what I have in MapD after importing this file:

id|lat|lon|ts|driver_id|gse_id|ca|ua
54534318|51.000000|6.000000|2016-09-03 03:31:24|431|805|2016-09-03 03:31:24|2016-09-03 03:31:24
54534348|51.000000|6.000000|2016-09-03 03:31:54|431|805|2016-09-03 03:31:54|2016-09-03 03:31:54
54534354|51.000000|6.000000|2016-09-03 03:32:01|431|997|2016-09-03 03:32:01|2016-09-03 03:32:01
54534362|51.000000|6.000000|2016-09-03 03:32:11|431|1108|2016-09-03 03:32:11|2016-09-03 03:32:11
54534364|51.000000|6.000000|2016-09-03 03:32:12|431|798|2016-09-03 03:32:12|2016-09-03 03:32:12
54534368|43.000000|7.000000|2016-09-03 03:32:16|398|297|2016-09-03 03:32:16|2016-09-03 03:32:16
54534370|45.000000|5.000000|2016-09-03 03:32:16|NULL|278|2016-09-03 03:32:16|2016-09-03 03:32:16
54534373|43.000000|5.000000|2016-09-03 03:32:22|1227|391|2016-09-03 03:32:22|2016-09-03 03:32:22
54534377|47.000000|-1.000000|2016-09-03 03:32:24|834|84|2016-09-03 03:32:24|2016-09-03 03:32:24

Thanks,


#4

Hi,

loading that exact data on my system seems to work fine

id|lat|lon|ts|driver_id|gse_id|ca|ua
54534304|45.718441|5.077481|2016-09-03 03:31:12|867|686|2016-09-03 03:31:12|2016-09-03 03:31:12
54534318|51.279663|6.759588|2016-09-03 03:31:24|431|805|2016-09-03 03:31:24|2016-09-03 03:31:24
54534348|51.279663|6.759588|2016-09-03 03:31:54|431|805|2016-09-03 03:31:54|2016-09-03 03:31:54
54534354|51.275551|6.757296|2016-09-03 03:32:01|431|997|2016-09-03 03:32:01|2016-09-03 03:32:01
54534373|43.443069|5.216035|2016-09-03 03:32:22|1227|391|2016-09-03 03:32:22|2016-09-03 03:32:22
54534377|47.156124|-1.600302|2016-09-03 03:32:24|834|84|2016-09-03 03:32:24|2016-09-03 03:32:24
54534362|51.277565|6.762561|2016-09-03 03:32:11|431|1108|2016-09-03 03:32:11|2016-09-03 03:32:11
54534364|51.274006|6.757626|2016-09-03 03:32:12|431|798|2016-09-03 03:32:12|2016-09-03 03:32:12
54534368|43.664646|7.209125|2016-09-03 03:32:16|398|297|2016-09-03 03:32:16|2016-09-03 03:32:16
54534370|45.717560|5.072843|2016-09-03 03:32:16|NULL|278|2016-09-03 03:32:16|2016-09-03 03:32:16

I am wondering if this maybe a localization issue around decimal point . versus comma , in the numbers. Do you have a local setting in your env which may be causing numbers to expect , rather than . in the input.

try loading this data

54534304|45,71844|5,077481|2016-09-03 03:31:12,786+00|867|686|2016-09-03 03:31:12,796+00|2016-09-03 03:31:12,796+00
54534318|51,279663|6,759588|2016-09-03 03:31:24,426+00|431|805|2016-09-03 03:31:24,434+00|2016-09-03 03:31:24,434+00
54534348|51,279663|6,759588|2016-09-03 03:31:54,402+00|431|805|2016-09-03 03:31:54,412+00|2016-09-03 03:31:54,412+00
54534354|51,27555|6,757296|2016-09-03 03:32:01,556+00|431|997|2016-09-03 03:32:01,566+00|2016-09-03 03:32:01,566+00
54534362|51,277565|6,762561|2016-09-03 03:32:11,906+00|431|1108|2016-09-03 03:32:11,923+00|2016-09-03 03:32:11,923+00
54534364|51,274006|6,757626|2016-09-03 03:32:12,96+00|431|798|2016-09-03 03:32:12,969+00|2016-09-03 03:32:12,969+00
54534368|43,664645|7,209125|2016-09-03 03:32:16,461+00|398|297|2016-09-03 03:32:16,466+00|2016-09-03 03:32:16,466+00
54534370|45,717558|5,072843|2016-09-03 03:32:16,806+00||278|2016-09-03 03:32:16,847+00|2016-09-03 03:32:16,847+00
54534373|43,443071|5,216035|2016-09-03 03:32:22,39+00|1227|391|2016-09-03 03:32:22,401+00|2016-09-03 03:32:22,401+00
54534377|47,156123|-1,600302|2016-09-03 03:32:24,81+00|834|84|2016-09-03 03:32:24,821+00|2016-09-03 03:32:24,821+00

with this command

copy gse_positions from '<yourlocation>/gse.dsv' with (header='false', delimiter='|');

Regards


#5

Hi,

I am using fr_FR.UTF-8 locale and importing your snippet worked. This is definitely the problem. (, replaces . in french style decimal numbers).

Running mapd server with LC_NUMERIC env variable set to C seems to be a working hack. Would it be the advised method to handle this case?

Regards,


#6

Hi again,

I did not responded to you concerning the SQLImporter: here is the logged error:


>  java -Xms1024m -Xmx2048m -cp /opt/mapd-ce-3.0.0-20170507-7626e30-Linux-x86_64-render/bin/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar:/usr/share/java/postgresql-jdbc/postgresql-9.4.1212.jre6.jar com.mapd.utility.SQLImporter -t gse_positions -su <user> -sp <password> -c"jdbc:postgresql://<myip>/<mydb>" -ss "select * from position" -tr -b 100m
# I also tried without -b  and -tr options

2017-05-29 09:14:48 INFO  SQLImporter:executeQuery:178 - Connecting to database url :jdbc:postgresql://12.13.14.23/localeez
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2121)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430)
54534304|45,71844|5,077481|2016-09-03 03:31:12,786+00|867|686|2016-09-	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:303)
	at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:289)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:266)
	at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:233)
	at com.mapd.utility.SQLImporter.executeQuery(SQLImporter.java:189)
	at com.mapd.utility.SQLImporter.doWork(SQLImporter.java:167)
	at com.mapd.utility.SQLImporter.main(SQLImporter.java:44)


#7

rvlander

When you did not set the -b parameter did you still see the same out of memory error from postgress?

I would seem postgress is requiriing more heap than you have allocated

Have you tried upping your Xmx setting? How much memory does the machine running SQLImport have?

Regards


#8

Hi,

I hadn’t notice but the error without the -b option is different (I also have the same error using -Xmx12000m):

2017-05-30 10:27:33 INFO  SQLImporter:executeQuery:178 - Connecting to database url :jdbc:postgresql://12.13.14.23/localeez
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at java.lang.Class.getDeclaredFields0(Native Method)
	at java.lang.Class.privateGetDeclaredFields(Class.java:2583)
	at java.lang.Class.getDeclaredField(Class.java:2068)
	at java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl$1.run(AtomicReferenceFieldUpdater.java:323)
	at java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl$1.run(AtomicReferenceFieldUpdater.java:321)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.util.concurrent.atomic.AtomicReferenceFieldUpdater$AtomicReferenceFieldUpdaterImpl.<init>(AtomicReferenceFieldUpdater.java:320)
	at java.util.concurrent.atomic.AtomicReferenceFieldUpdater.newUpdater(AtomicReferenceFieldUpdater.java:110)
	at java.sql.SQLException.<clinit>(SQLException.java:372)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2121)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:288)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:430)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:356)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:303)
	at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:289)
	at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:266)
	at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:233)
	at com.mapd.utility.SQLImporter.executeQuery(SQLImporter.java:189)
	at com.mapd.utility.SQLImporter.doWork(SQLImporter.java:167)
	at com.mapd.utility.SQLImporter.main(SQLImporter.java:44)

Trying with even -Xmx14000m, I have not enough space in RAM.

2017-05-30 10:37:57 INFO  SQLImporter:executeQuery:178 - Connecting to database url :jdbc:postgresql://12.13.14.23/localeez
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000063c680000, 1326972928, 0) failed; error='Ne peut allouer de la mémoire' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1326972928 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/gandre/hs_err_pid24861.log

#9

Hi,

It appears postgres by default tries to load the entire dataset in to memory before processing the result set.
https://jdbc.postgresql.org/documentation/91/query.html#fetchsize-example
So depending on how large your initial table is it is trying to load potentially alot of stuff into memory all at once.

There is a work around for this that can be added to the SQLImport code to restrict the size of each fetch it gets from the DB.

I have uploaded a new jar file to run SQLImporter fro which hopefully should work for you with postgres, please give it a try. jar is here : https://s3-us-west-2.amazonaws.com/mapd-artifact/mapd-1.0-SNAPSHOT-jar-with-dependencies.jar

Regards


#10

Hi,

this is working this way. However it is much slower compared to importing using a csv file as an intermediate step. It might be due to fetch size parameter beeing too small when calling the driver.

Also, if locales are different on the databases host OSes the problem previously mentionned (, and . parsing with decimal number) shows up: this might be annoying to some users.

I am discovering Mapd more and more and it is very exciting.

Regards,