Import of s3 data only includes a fraction of the data


#1

I do a copy into MapD from an s3 bucket. However the copy finishes, having only done about a third of the files in the s3 bucket. The result from the command line mentions 0 rejected, it is as if it has ignored a large amount of files in the s3 bucket.

I have double checked the s3 permissions and all is fine.

The s3 bucket has around 3000 files (compressed CSV). The data set is around 1.2 billion rows, but this is just my initial test before I use a much larger sized dataset.

Is there is a number-of-files limit in MapD when doing a COPY (from s3)? It may be a coincidence but I would guess it is failing around the 1000 file mark


#2

After a bit of investigation it looks like MapD does not page through the s3 bucket items correctly. The C++ AWS library returns 1000 items and then has a flag indicating whether more results are available. The additional results can then be paged through until all of the items have been retrieved.

This bug in MapD will limit the number of files that can be imported from an s3 bucket to a 1000, without indicating to the user that there were additional files that have been ignored.

I have raised an issue in GitHub


#3

I am not sure what the best workaround is.

Options that come to mind are:-

  1. Copy the s3 bucket to the local drive, and then run COPY from local
  2. Divide the buckets up so that they always contains <1000 items, run COPY multiple times against each bucket
  3. Combine the smaller files up, so that the bucket contains less than 1000 files.

Option 1 seems to be less work at the moment, but if anybody has any better ideas then I would be interested.


#4

@ArnoldJ, thanks for raising the issue on GitHub, that’s certainly something we’ll need to take care of.

How are you using MapD, MapD Cloud or MapD Community Edition/Enterprise? If it’s cloud, then currently your suggestion #1 might make sense. However, if you are using Community or Enterprise Edition, you can use the mapdql utility to incrementally load more files into a table.


#5

I am currently running the Community edition of MapD on an EC2 VM.

I have just copied the s3 bucket to the machine and then done a local copy.


#6

Glad you found a temporary workaround. I have highlighted this issue to our engineering team, hopefully its something quick for us to patch.

If you run into any other issues, please do let us know.


#7

Hi @ArnoldJ -

Our 4.1.1 release yesterday fixed this issue.

https://www.mapd.com/docs/latest/7_0_release.html#4-1-1-released-september-4-2018