Newline characters in files loaded by COPY FROM


#1

One of the columns in our CSV files that we are loading into MapD via COPY FROM is text that may contain new line characters. According to the COPY FROM documentation, even if a field is quoted, the new lines within that column will be treated as line separators, thus rejecting the record.

Is there any way to have COPY FROM ignore new line characters in quoted fields, or otherwise “escape” them somehow so they are not treated as line delimiters?


#2

Hi,

You must run the COPY FROM in single thread mode

Please see docs here

regards


#3

Sorry, I’m not seeing how that helps? According to the docs:

quote: a single-character string for quoting a field. The default quote character is double quote ". All characters inside quotes are imported “as is,” except for line delimiters.

So even if I chose some other random character (say tab, \t) as my line delimiter, if a tab character happens to shows up in my quoted text field, it will still be treated as a line delimiter by COPY FROM.

Or am I misunderstanding that line from the docs?


#4

Hi,

Its not about changing the delimiter, its about how the file is processed, the link I sent should have had highlighting on the thread parameter.

See comment at bottom of COPY

Note: by default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1 in the WITH clause.

this will allow the new lines to be ignored

regards


#5

I’m obviously missing something here. In that case, how do you tell COPY FROM where one line ends and the next begins if you take new lines out altogether? Perhaps you could give a real example to illustrate how this works?


#6

Hi Brian,

To clarify, in single threaded mode newline characters inside the quoted field will be ignored in the sense that the CSV parser won’t treat them as row delimiters, it instead will just treat them like any other (non-quote) character.

foo.csv:

"embedded
new
line", 1
no new line, 2

And then:

mapdql> create table foo (a text, b int);
mapdql> copy foo from '/path/to/foo.csv' with (header='false',threads=1);
mapdql> select * from foo;
a|b
embedded
new
line|1
no new line|2

#7

Hi,

Input table is

1,"hello world"
2,"broken
line"
3, " multiple
 broken
 line"
4,"basic line"

Using standard parallel COPY FROM

create table brokenline (i1 int, t1 text);

mapdql> copy brokenline from '~/Datasets/epoch/text.csv' with ( header='false');
Result
Loaded: 2 recs, Rejected: 5 recs in 0.137000 secs
mapdql> select * from brokenline;
i1|t1
1|hello world
4|basic line
mapdql>

Now with single thread

create table testline (i1 int, t1 text);
mapdql> copy testline from '~/Datasets/epoch/text.csv' with (threads=1, header='false');
Result
Loaded: 4 recs, Rejected: 0 recs in 0.113000 secs
mapdql> select * from testline;
i1|t1
1|hello world
2|broken
line
3| multiple
 broken
 line
4|basic line

Hope this fills in the blanks

regards


#8

Ahhhhh, I get it now. Putting one of these examples on the COPY FROM documentation page may help clarify the single threaded point for folks like me who didn’t get it from the existing doc. :slight_smile:

I’m assuming there’s a performance penalty on how fast a file can be loaded when running in single threaded mode?