How to make sure header row populates column names?


#1

I’ve tried importing various CSV’s, TSV’s, both with and without quote-enclosed strings, \n and \r\n carriage returns, etc. and can’t seem to figure out how to consistently get Mapd to use the header row to populate column names. This also means that every field is marked as a string, even if the only string is the header row. Is there a way to make this work consistently?


#2

Thanks for trying MapD.

Could you post a sample file of the data you’re trying to import? That will greatly simplify debugging and will make it easier to file an issue if necessary.

The rules used for deciding whether or not there’s a header are:

  • first row is detected as text
  • there exists at least one column whose fields after the first are not detected as text
  • there are at least three rows

Unfortunately Immerse does not currently allow you to override the header setting, though that is possible if you create the table and import data using the command line mapdql client. See: https://www.mapd.com/docs/latest/mapd-core-guide/loading-data/?highlight=header


#3

Sure thing - here’s the most recent version I’ve tried, but it’s kind of a weird one (pipe delimiters and forced quotes).
The full file is about 23000 lines. Here’s a 100 line sample.

If you’re curious, I can see a few of the options I’ve tried in my script…

        giantwriter = csv.writer(gf, delimiter='|', quotechar = '"', quoting=csv.QUOTE_ALL)
        # giantwriter = csv.writer(gf, delimiter='|', quotechar = '"', lineterminator='\r\n')
        # giantwriter = csv.writer(gf, delimiter='\t', quotechar = '"')
        # giantwriter = csv.writer(gf, dialect='excel')

I also tried removing spaces from the header fields
giantrow = [n.replace(" ", “_”).strip() for n in giantrow]

BTW, your form doesn’t permit uploading files with the ‘.tsv’ extension. Might be worth a look.

100lines.csv (575.7 KB)

Thanks!


#4

Could you try removing the commas from your numbers? So instead of 5,411 make it simply 5411. Hopefully pretty easy since the file is pipe-delimited.


#5

Thanks! I hadn’t caught that. With all the commas removed (and checking type with int() and float() ) that gets the data types populated.

The header row is still not working though - any thoughts? It’s kind of odd, really, since it shows the strings as the first row but correctly recognizes type as int

.

100lines.csv (369.8 KB)


#6

Quick solution: prepend a letter to all the column names which start with a number (we actually don’t allow column names to start with a number anyways).

This was an interesting one, the date and time type detection code is marking some of those column names as potentially valid times. I’ll look into reworking that a bit.


#7

Ah perfect! That did the trick. Thanks.