Load data from S3 + new files


2 questions: first, is it possible to load files from and S3 path into a table. If I have a number of files that follow a specific path, is it possible to load all files that meet that pattern.

For example: s3://bucket-name/folder/* --> loads all files under folder into table.

Second question, is it possible to adapt the stream insert to automatically load files deposited into an S3 bucket, without using a Lambda function?



For loading direct from S3 bucket we would suggest using the StreamInsert included in our SampleCode. The syntax assuming a csv file on S3 would look like this.

aws s3 cp s3://mapd.com/streamTest/aws_xmas2015.csv - | /raidStorage/prod/mapd/SampleCode/StreamInsert taxi_xmas_2015 mapd -u mapd -p HyperInteractive --batch 100000

The options for StreamInsert are:

StreamInsert --help
Usage: <table name> <database name> {-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>][--port <port number>][--delim <delimiter>][--null <null string>][--line <line delimiter>][--batch <batch size>][{-t|--transform} transformation ...][--retry_count <num_of_retries>] [--retry_wait <wait in secs>][--print_error][--print_transform]

So in my example I am loading into a table called ‘taxi_xmas_2015’ with a batch size of insert set to 100,000 records. As the file was comma separated (the default) I needed no additional options.

The table needs to exist prior to the StreamInsert being started

To do multiple files in same directory I suspect you could use the recurisve copy option on the s3 cp but I have not tries that myself

I assume you mean to have something automatically detect a new file in a S3 bucket and then have it automatically loaded? I am not sure about all the options there, would have to look into options in s3 on behavior when files arrives. Let us know what you find.


Just to double-check, do the files need to go to a specific directory, or can go they go to any path on the system.



With the above example the S3 file will not land on the local filesystem at all. It will be directed into MapD via the StreamInsert.


Do you support gz for CSV? We keep all our data in that format, otherwise we will have to download and decompress before loading.


Just add a zcat in the pipe stream should work

aws s3 cp s3://mapd.com/streamTest/aws_xmas2015.csv.gz - | zcat | /raidStorage/prod/mapd/SampleCode/StreamInsert taxi_xmas_2015 mapd -u mapd -p HyperInteractive --batch 100000


How do we add Access key ID and Secret access key in the command.


Hi @mrchhetry -

You would have your keys set up using aws configure for the one-liner @dwayneberry provided.



MapD now supports s3 import direct.

See https://www.mapd.com/docs/latest/4_import_data.html#importing-data-from-amazon-s3