Maximum Ingest Rates Possible


#1

Does anyone have any benchmark information related maximum possible ingest rates for relatively high performance hardware?


#2

@HokieKev our sales engineers do have some good information, but anything I relay would be for specific ingestion methods and how the data is structured, particularly with regard to dictionary encoding and number of columns. Can you share more information about that?

One POC we did with a single node with 9 writers hit a million rows per second.


#3

That’s a tough question to reply, because the ingestion TP is affected by several factor; as @easy already said it depends by how much/type of columns you are using in your table, but you have to take in account other factors, expecially if you are using a machine with a lot of cores.

Let’s talk about the ingestion of data using the copy command, so the source of data is local to the server (i am assuming you are not using a network file system, but a local one)

  1. are you using a a compressed file as a source? the decompression will use just a core, so the speed of a single core will be the limiting factor, because it would not be able to feed fast enough the rest of CPU cores inserting data

  2. Using a uncompressed file the disk subsystem could be a limiting factor, because you have to read 10x times the data of a compressed file; so is your disk subsystem fast enough to feed the cpu cores of your machine?

  3. Mapd load the data on disk, not directly on ram, so it has to write to disk; if you have enough spare ram to cache the disk writes it could be not a problem, but if you are loading a lots of data, it’s likely you will have a lot of disk writes, and again the disk subsystem can be a limiting factor

  4. Are you using dictionary encoded strings? This could save a lot a disk writes if cardinality is low, but if it is high will pose a serious problem on disk subsystem and some serialization problem on dictionary loading.
    e.g
    lineitem table of tpc-h has 16 columns and two dict encoded string (one wirh extremly high cardinality)
    flights table of ASA data expo 2009 has 29 columns and 6 dict encoded string

guess what? the 1st one has a TP 800k/sec, while the second one 1600k/sec.

if we would extend to different API, Loading data from Networks and so on, the things would be more complicated.

In my experience the ingestion performance of mapd is very good (and i am saying “very good” only to be prudent)