MapD slow on text fields


#1

In benchmarking the following query:

SELECT * from test order by c0 asc limit 1

Out of ten tests we get:

Took: 7877 ms. (7.877 s)
Took: 15617 ms. (15.617 s)
Took: 8067 ms. (8.067 s)
Took: 15924 ms. (15.924 s)
Took: 8057 ms. (8.057 s)
Took: 15864 ms. (15.864 s)
Took: 15455 ms. (15.455 s)
Took: 15245 ms. (15.245 s)
Took: 7857 ms. (7.857 s)
Took: 15624 ms. (15.624 s)

Note that c0 is a text field.

Why is this query so slow? And is there a way to fix this or do a work-around so that a query that would take 0.1s in mysql works acceptably?


#2

you can try a query like that to mitigate your performance problem

if c0 column has an high cardinality
select * from test join (select rowid from test order by c0 asc limit 1) o on o.rowid=test.rowid;

if c0 column has a low cardinality you can try
select * from test join (select c0 from (select distinct c0 from test) order by c0 limit 1) a on a.c0=test.c0 limit 1)

the query is faster on mysql because mysql

  1. is using an index (educated guess) so it hasn’t to sort anything; he has just to traverse some blocks of a btree structure and get the rowid of the row and then projects (basically it has to read a block on disk and get all the data)
  2. it’s a row organized database so it’s more suitable to retrieving small batches of rows and the code is written and optimized to do that

i guess mapd to execute the same query is decoding the column ( i am assuming a text encoded field) on system memory and sorting the result of all n_rows


#3

Hi,

How many rows in table? What is the cardinality of c0. Is it possible to get access to your benchmark details and data.

We have no indexes so with the sql expressed like this the sort is fully executed for the c0 column and we have not optimized for this particular style of query plan.

To get a more performant result for your query try rewriting it as follows

select * from test where c0 in (select c0 from test group by c0 order by c0 asc limit 1) limit 1;

A little long winded but I hope you will find better performance.

Using the online 400m Twitter dataset the following query comes back in around 100ms

mapdql> select * from tweets_new where county_state in (select county_state from tweets_new group by county_state order by county_state asc limit 1) limit 1;
tweet_id|tweet_time|lat|lon|sender_id|sender_name|location|source|reply_to_user_id|reply_to_tweet_id|lang|followers|followees|tweet_count|join_time|tweet_text|country|admin1|admin2|place_name|state_abbr|county_state|origin|hashtags|tweet_tokens|goog_x|goog_y|is_exact
888073821797875713|2017-07-20 16:31:13|34.180248|-82.376106|1942415563|e_d_simpson|South Carolina|Twitter for Android|NULL|NULL|fr|2148|2125|19525|2013-10-07 00:14:10|#ThrowbackThursday 2015 "X" Tour @PhilipsArena @edsheeran #Epic https://t.co/9ziqr0UW6V|US|South Carolina|Abbeville|Abbeville|SC|Abbeville County, SC|android|{#throwbackthursday, #epic}|{#throwbackthursday, 2015, x, tour, philipsarena, edsheeran, #epic, https://tco/9ziqr0uw6v}|-9170066.000000|4053030.500000|false
1 rows returned.
Execution time: 80 ms, Total time: 81 ms
mapdql>

It seems you also posted on Slashdot (so I have also responded there)

regards


#4

Hi,

Thanks for the response. We’ll test these out and respond with results
early next week. Yes, I also posted on StackOverflow – thank you for the
response there as well. We’ll keep them in the forum from now on seeing
that you guys are so responsive with that.

Thanks,

David


#5

Hi,

Additionally, are there plans to ‘support’ the SELECT * … ORDER BY field
in terms of performance?

The dataset we were testing with was 10M rows.

Thanks,

David