Datatype POINT and issue on the COPY to table


#1

Hello,

I am trying to populate MapD table with NYC 1 Billion Taxi Data with COPYing from csv files (comma delimited).

This data has been previously imported into DB2 and now we want to migrate it into MapD after some transformations for the visualisations and performance testing in MapD.
I may add that I splitted the data into pieces (different files) and below I am reffering to the file containing proxy 100m rows.

Unfortunately, it does not load the rows into MapD returning a message
“Loader truncated due to reject count. Processed : 8208756 recs, Rejected: 100103 recs in 1158.681000 secs”
The table in fact has 0 rows loaded.

I checked the log files and the problem is occuring with the POINT column I defined.
I see the log file is populated with two categories of error messages:
"E0820 13:52:02.317701 74108 geo_types.cpp:436] Geospatial Import Error: GeoGeoFactory Error: unsupported geometry type
E0820 13:52:02.334509 74165 Importer.cpp:1698] Input exception thrown: Cannot read geometry to insert into column PICKUP_POINT. Row discarded, issue at column

However, before loading the 100M dataset I did the process testing on a small samples, (up to 100 rows) and it completed succesfully with rows loaded into table.
Now repeating the same for the full 100M table results in the output provided in this topic (0 rows loaded).

Would you know the reason for this to be happening?

I am enlosing also csv sample of a few records (I now see 1 row rejected after COPY and 14 loaded properly)
EDIT: As a new user I see I cannot add attachments so please find it below, sorry about the form:
1.89,1,“2015-07-21”,2015,7,21,13,3, 294.000000000000000,1,2,0,1,-73.988868713378910, 40.777416229248050,-73.979072570800780, 40.784881591796880,3, 0.92, 5.50, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.988868713378910 40.777416229248050)”,“POINT(-73.979072570800780 40.784881591796880)”
0.00,0,“2015-07-21”,2015,7,21,13,3, 1961.000000000000000,1,1,0,1,-73.978378295898430, 40.766502380371100,-73.996704101562500, 40.725467681884770,1, 3.30, 21.00, 0.00, 0.50, 0.00, 0.30,2, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.978378295898430 40.766502380371100)”,“POINT(-73.996704101562500 40.725467681884770)”
2.65,1,“2015-07-21”,2015,7,21,13,3, 1046.000000000000000,1,1,0,1,-73.988815307617180, 40.742641448974610,-73.990455627441410, 40.719516754150400,1, 2.00, 12.50, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.988815307617180 40.742641448974610)”,“POINT(-73.990455627441410 40.719516754150400)”
0.00,0,“2015-07-21”,2015,7,21,13,3, 908.000000000000000,1,1,0,1,-74.006896972656260, 40.716228485107430,-73.994018554687500, 40.751300811767580,1, 2.70, 12.50, 0.00, 0.50, 0.00, 0.30,2, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-74.006896972656260 40.716228485107430)”,“POINT(-73.994018554687500 40.751300811767580)”
0.50,1,“2015-07-21”,2015,7,21,13,3, 796.000000000000000,1,1,0,1,-73.973236083984370, 40.747772216796880,-73.973449707031250, 40.753063201904300,1, 0.40, 9.00, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.973236083984370 40.747772216796880)”,“POINT(-73.973449707031250 40.753063201904300)”
1.50,1,“2015-07-21”,2015,7,21,13,3, 7.000000000000000,1,1,1,1,-73.861778259277340, 40.768520355224610,-73.861770629882820, 40.768520355224610,1, 9.90, 2.50, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.861778259277340 40.768520355224610)”,“POINT(-73.861770629882820 40.768520355224610)”
2.56,1,“2015-07-21”,2015,7,21,13,3, 1127.000000000000000,1,2,0,1,-73.978584289550780, 40.736675262451170,-73.983016967773440, 40.747680664062500,1, 1.13, 12.00, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.978584289550780 40.736675262451170)”,“POINT(-73.983016967773440 40.747680664062500)”
1.46,1,“2015-07-21”,2015,7,21,13,3, 375.000000000000000,1,2,0,1,-73.978958129882820, 40.751361846923840,-73.991058349609380, 40.739608764648450,1, 1.22, 6.50, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.978958129882820 40.751361846923840)”,“POINT(-73.991058349609380 40.739608764648450)”
0.00,0,“2015-07-21”,2015,7,21,13,3, 1152.000000000000000,1,2,0,1,-73.948028564453120, 40.774600982666020,-73.980636596679680, 40.769989013671880,6, 2.58, 14.00, 0.00, 0.50, 0.00, 0.30,2, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.948028564453120 40.774600982666020)”,“POINT(-73.980636596679680 40.769989013671880)”
2.36,1,“2015-07-21”,2015,7,21,13,3, 847.000000000000000,1,2,0,1,-73.952377319335940, 40.777011871337890,-73.978340148925780, 40.772190093994150,1, 2.03, 11.00, 0.00, 0.50, 0.00, 0.30,1, 0.00, 0.00, 0.00, 89.00, 74.00, 2.46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.952377319335940 40.777011871337890)”,“POINT(-73.978340148925780 40.772190093994150)”
0.00,0,“2009-12-31”,2009,12,31,9,5, 660.000000000000000,1,4,-73.979487000000000, 40.784778000000010,-73.972328000000000, 40.763970000000010,2, 1.85, 8.09, 0.00, 0.50, 0.00,2, 0.23, 0.00, 1.50, 35.00, 29.00, 3.58,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,“POINT(-73.979487000000000 40.784778000000010)”,“POINT(-73.972328000000000 40.763970000000010)”
0.00,0,“2009-12-11”,2009,12,11,1,6, 480.000000000000100,1,4,-74.001198000000000, 40.727628000000010,-74.013680000000000, 40.710360000000010,1, 1.88, 7.29, 0.50, 0.50, 0.00,2, 0.00, 0.00, 0.00, 29.00, 21.00, 12.75,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-74.001198000000000 40.727628000000010)”,“POINT(-74.013680000000000 40.710360000000010)”
1.75,1,“2009-12-10”,2009,12,10,21,5, 720.000000000000000,1,4,-73.974522000000000, 40.760445000000010,-73.987215000000000, 40.738387000000010,2, 1.86, 8.50, 0.50, 0.50, 0.00,1, 0.00, 0.00, 0.00, 46.00, 27.00, 12.30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.974522000000000 40.760445000000010)”,“POINT(-73.987215000000000 40.738387000000010)”
0.00,0,“2009-12-10”,2009,12,10,19,5, 840.000000000000000,1,4,-74.014165000000000, 40.717630000000010,-73.998047000000000, 40.737842000000010,1, 2.52, 9.69, 1.00, 0.50, 0.00,2, 0.00, 0.00, 0.00, 46.00, 27.00, 12.30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-74.014165000000000 40.717630000000010)”,“POINT(-73.998047000000000 40.737842000000010)”
0.00,0,“2009-12-10”,2009,12,10,19,5, 840.000000000000000,1,4,-73.996438000000000, 40.743167000000010,-73.980323000000000, 40.765568000000010,5, 2.85, 10.10, 1.00, 0.50, 0.00,2, 0.00, 0.00, 0.00, 46.00, 27.00, 12.30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,“POINT(-73.996438000000000 40.743167000000010)”,“POINT(-73.980323000000000 40.765568000000010)”

The table DDL is:
CREATE TABLE mapd_test(
TIP_AMOUNT NUMERIC(5,2),
DOES_TIP INTEGER,
DATE_OF_TRIP DATE,
YEAR_OF_TRIP TEXT ENCODING DICT(8),
MONTH_OF_TRIP TEXT ENCODING DICT(8),
DAY_OF_TRIP TEXT ENCODING DICT(8),
HOUR_OF_TRIP TEXT ENCODING DICT(8),
DAYOFWEEK TEXT ENCODING DICT(8),
TRAVELTIME DOUBLE,
CAB_TYPE TEXT ENCODING DICT,
VENDOR_ID TEXT ENCODING DICT,
STORE_AND_FWD_FLAG INTEGER,
RATE_CODE_ID INTEGER,
PICKUP_LONGITUDE DOUBLE,
PICKUP_LATITUDE DOUBLE,
DROPOFF_LONGITUDE DOUBLE,
DROPOFF_LATITUDE DOUBLE,
PASSENGER_COUNT INTEGER,
TRIP_DISTANCE NUMERIC(5,2),
FARE_AMOUNT NUMERIC(5,2),
EXTRA NUMERIC(5,2),
MTA_TAX NUMERIC(5,2),
TOLLS_AMOUNT NUMERIC(5,2),
IMPROVEMENT_SURCHARGE NUMERIC(5,2),
PAYMENT_TYPE TEXT ENCODING DICT,
TRIP_TYPE TEXT ENCODING DICT,
PRECIPITATION NUMERIC(5,2),
SNOW_DEPTH NUMERIC(5,2),
SNOWFALL NUMERIC(5,2),
MAX_TEMPERATURE NUMERIC(5,2),
MIN_TEMPERATURE NUMERIC(5,2),
AVERAGE_WIND_SPEED NUMERIC(5,2),
H_NEW_YEARS TEXT ENCODING DICT,
H_MARTIN_LUTHER TEXT ENCODING DICT(8),
H_VALENTINES TEXT ENCODING DICT(8),
H_PRESIDENTS TEXT ENCODING DICT(8),
H_ST_PATRICKS TEXT ENCODING DICT(8),
H_APRILS_FOOL TEXT ENCODING DICT(8),
H_EASTER_SUNDAY TEXT ENCODING DICT(8),
H_GOOD_FRIDAY TEXT ENCODING DICT(8),
H_MEMORIAL TEXT ENCODING DICT(8),
H_INDEPENDENCE TEXT ENCODING DICT(8),
H_LABOR TEXT ENCODING DICT(8),
H_COLUMBUS TEXT ENCODING DICT(8),
H_HALLOWEEN TEXT ENCODING DICT(8),
H_VETERANS TEXT ENCODING DICT(8),
H_THANKSGIVING TEXT ENCODING DICT(8),
H_CHRISTMAS TEXT ENCODING DICT(8),
H_NEW_YEARS_EVE TEXT ENCODING DICT(8),
H_BLACK_FRIDAY TEXT ENCODING DICT(8),
H_SUPERBOWL TEXT ENCODING DICT(8),
PICKUP_POINT POINT,
DROPOFF_POINT POINT);

Best regards,
Mateusz.


#2

the rows of year 2009 lacks some fields, so the import fail.

try to copy and paste your data on a spreadsheet to identify the misaligned columns


#3

Hi @aznable,
Thanks for your answer.
Misalignment is not the case. Double-checked with Excel as well - columns are aligned, 2009 just have some fields empty - NULLS.
However, I now read logs for this provided small sample and MapD rejects one row because of the, indeed, fields mismatch: E0821 11:01:34.310122 77112 Importer.cpp:1500] Incorrect Row (expected 53 columns, has 52).
That’s weird because as I said Excel shows exactly same no of fields, as well as I imported same sample into SPSS and it’s okey, also “,” sign count is same for all rows.

Though, this what we are speaking about now may be not crucial here as the problem is that my 100M load doesn’t input ANY records. This sample I gave here I made up manually and, although all seems okey for me, something could happen for this one row. Still, it loads all other records in a correct manner.
For the 100M I do not interfere with the files at all and depend on the csv exported directly from DB2.


#4

i tried to import you sample on a spreadsheet, and this is the result

the sample looks misaligned to me.

anyway maybe the problem is that you are hitting the default of rejects for a single copy command; you can change the default value of 100k with the max_reject property of with clause

so:

copy table from ‘*.csv’ with ( max_reject=10000000 )


#5

@aznable
Hi, changing the parameter of rejected rows helped with getting the data loaded into MapD! Thank you :slight_smile:
the query now is eg,:
COPY TRIPS_ANALYTICS_REPORTING FROM ‘/home/cds1inst/trips_analytics_2009.csv’ WITH (header=‘false’, max_reject = 200000000);

PS. For the sample fields misalignment this is the issue of copying the data here into the browser. As I mentioned I couldn’t upload any file as a new user, sorry about that.


#6

Hi - yes these type of errors are time consuming in any load situation.

Reducing the problem to a single row I did actually counting only 52 so I added an extra field and then started getting an unsupported geometry type error.

Reducing the problem further to simply two point fields, I had my data (that worked) and a reduced version of your file (cut and pasted from your report) which didn’t work; as in
worked - “POINT(-73.988868713378910 40.777416229248050)”,“POINT(-73.979072570800780 40.784881591796880)”
Error - “POINT(73.988868713378910 40.777416229248050)”,“POINT(-73.979072570800780 40.784881591796880)”

Cannot read geometry to insert into column p. Row discarded, issue at column : 2 data :[“POINT(73.988868713378910 40.777416229248050)”, “POINT(-73.979072570800780 40.784881591796880)”]

Which was when I noticed that the quotes are different. I did a hex dump of the file I creates from you data and it contained a wide character for the quote. This may be have been introduced into the data when I cut and pasted from the browser - but might be worth a look. Hope that helps.

Jack

  • the copy command I used was copy xyz from 'xxx.csv' with (quoted='true', quote='"', header='false') ;
  • would it be possible for you to post your copy command?
  • and I counted the fields using Vi rather than excel

#7

@JackRobert
Thanks for the answer.
Yes, I confir I also get an unusual quote in character e.g. in Notepad++ when copying the data from the browser.
The files in my env are of standard quote char.

For your last three points:

  • the copy command I used was copy xyz from 'xxx.csv' with (quoted='true', quote='"', header='false') ;
    I tested your command and it doesn’t change my load behaviour. I see in docs first two params are by default ‘true’ and ‘"’.
  • would it be possible for you to post your copy command?
    It’s in my last post -
    COPY TRIPS_ANALYTICS_REPORTING FROM ‘/home/cds1inst/trips_analytics_2009.csv’ WITH (header=‘false’, max_reject = 200000000);
  • and I counted the fields using Vi rather than excel
    I get it, I’m not so fluent in Linux so I actually used SPSS :slight_smile:

#8

Good to know! Will know to ignore crazy quotes in future.

I constructed a file with the first two lines of your sample data (replacing the quote) and it failed to load with Incorrect Row (expected 53 columns, has 52); as expected. i add an extra field to the each row (an extra zero just before the point fields) and it loaded. I also recounted the fields and there is a mismatch. In the attached file I’ve split every field from your first row of sample data onto a line. Counting lines there are 52 not 53.

Jack

nyc_bad.csv (419 Bytes)


#9

anyway if you can’t share the rejected columns in the forum you could place them on external systems


#10

@aznable
Yes, you’re right. Just I didn’t expect to be having this problem on the forum’s browser.
Now I got granted with upload attachments permission so it won’t be the case in the future.
Thank you,


#11

@JackRobert
Thanks I understand. Then it is an issue of copying data into the browser and I can assure that original file is okey with 53 fields. (I had a problem of only 14 rows out of 15 rows loading from the sample but it happened to be just lack of the end of line marker for the last row; in other words lack of “enter” button hit :slight_smile:

The main problem has been solved with @aznable tip to include the max_reject parameter so we can skip the sample data issue here for now.

Apparently, as per logs, the rows are discarded in load if they contain a NULL value in a POINT datatype field:


, where the last two fields are specified in DDL as points and are missing for those rows (marked in red).

Did I miss any parameter related to accepting rows with null values in points fields?
(I am depending here on the following doc: https://www.mapd.com/docs/latest/6_loading_data.html )


#12

Great - glad to hear you are on the way.


#13

hi @MateuszWysocki,

i cant see a way to insert a NULL on a geometry field; if you are going to use a point map on export phase you would skip all rows not containing coords, or substituting with arbitrary coords like point (0 0); this way you will be amazed by the speed of csv importer of mapd :wink:


#14

@aznable
Yes, definitely that makes sense. Just wanted to make sure what are the properties of this datatype in such occurences :slight_smile:
So thank you guys for all your responses, I got all my answers here :slight_smile: