Load data from S3 + new files


#1

2 questions: first, is it possible to load files from and S3 path into a table. If I have a number of files that follow a specific path, is it possible to load all files that meet that pattern.

For example: s3://bucket-name/folder/* --> loads all files under folder into table.

Second question, is it possible to adapt the stream insert to automatically load files deposited into an S3 bucket, without using a Lambda function?


#2

Q1.

For loading direct from S3 bucket we would suggest using the StreamInsert included in our SampleCode. The syntax assuming a csv file on S3 would look like this.

aws s3 cp s3://mapd.com/streamTest/aws_xmas2015.csv - | /raidStorage/prod/mapd/SampleCode/StreamInsert taxi_xmas_2015 mapd -u mapd -p HyperInteractive --batch 100000

The options for StreamInsert are:

StreamInsert --help
Usage: <table name> <database name> {-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>][--port <port number>][--delim <delimiter>][--null <null string>][--line <line delimiter>][--batch <batch size>][{-t|--transform} transformation ...][--retry_count <num_of_retries>] [--retry_wait <wait in secs>][--print_error][--print_transform]

So in my example I am loading into a table called ‘taxi_xmas_2015’ with a batch size of insert set to 100,000 records. As the file was comma separated (the default) I needed no additional options.

The table needs to exist prior to the StreamInsert being started

To do multiple files in same directory I suspect you could use the recurisve copy option on the s3 cp but I have not tries that myself

Q2.
I assume you mean to have something automatically detect a new file in a S3 bucket and then have it automatically loaded? I am not sure about all the options there, would have to look into options in s3 on behavior when files arrives. Let us know what you find.


#3

Just to double-check, do the files need to go to a specific directory, or can go they go to any path on the system.


#4

Hi,

With the above example the S3 file will not land on the local filesystem at all. It will be directed into MapD via the StreamInsert.


#5

Do you support gz for CSV? We keep all our data in that format, otherwise we will have to download and decompress before loading.


#6

Just add a zcat in the pipe stream should work

aws s3 cp s3://mapd.com/streamTest/aws_xmas2015.csv.gz - | zcat | /raidStorage/prod/mapd/SampleCode/StreamInsert taxi_xmas_2015 mapd -u mapd -p HyperInteractive --batch 100000

#7

How do we add Access key ID and Secret access key in the command.


#8

Hi @mrchhetry -

You would have your keys set up using aws configure for the one-liner @dwayneberry provided.


#9

Hi,

MapD now supports s3 import direct.

See https://www.mapd.com/docs/latest/4_import_data.html#importing-data-from-amazon-s3
and
https://www.mapd.com/docs/latest/6_loading_data.html#importing-aws-s3-files

regards