About sort & concat


#1

Hi
We are testing with the community version, and found that ‘order by’ is very slow. Other sort functions are very fast and it’s amazing. But ‘order by’ is not. Do you have any idea why?

Next, I want to know if the ‘concat’ function is supported. The concat function I want is to combine the two strings. The site is well explained.

Https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions026.htm

Does mapd offer this feature? If not, do you plan to add it?

Finally, there is a limit when sorting. There is a limit of one million, is there a way to solve it?

spec

os : centos 7 ( installed by vmware esxi 6.0)
cpu : Intel® Xeon® CPU E5-2620 v4 @ 2.10GHz ( use 8 core )
ram : 16GB
gpu : GeForce GTX 1060 6GB

Thanks


#2

Hi,

Thanks for giving MapD a try.

Could you give a bit more details about the column you are order by on. What other sort functions are you using?

There is currently no concat function.

Could you elaborate on the use case where you need to generate and project an ordered list of more than a million values.

regards


#3

First, about concat.
What we want is the ability to combine strings.
I called it concat in the other db and provided it. It’s not necessarily a function named concat. Is there a way to combine two strings in mapd?

Next is about sort.
I created and used a query in the following manner.
Select column from test_table order by time_column;
We have millions of data to sort by chronological order. It is a feature that we should use frequently, but it is difficult because it has a limit of one million. That’s why I asked you how you can solve the limit.

Last, we also used group by. It gave us enough results.

Thanks for your reply.


#4

In addition, millions of queries are not immediately available when using select. To view
I have to give a limit, but the limit is one million.
We have to deal with millions more data. Is there a way to solve the limit?


#5

Hi,

We currently have no concat function or any other ability to concatenate string.

I am still not clear on your use case. I understand you have millions of records (we deal with many datasets with billions of records)

I do not understand under what conditions you need to send back a list of 1 million sorted results.

As a fast Interactive analytical database we have not seen any really good use cases for this.

What we would normally recommend would be to return back some kind of binned group by in your example something around date say like:

Assume cstart is your time stamp. This table has about 300m records

mapdql> select date_trunc (day,cstart),count(*) from mydata group by date_trunc (day, cstart) order by date_trunc (day, cstart) desc limit 20;
EXPR$0|EXPR$1
2016-09-08 00:00:00|350415
2016-09-07 00:00:00|1236516
2016-09-06 00:00:00|1524534
2016-09-05 00:00:00|1622904
2016-09-04 00:00:00|1823895
2016-09-03 00:00:00|1791741
2016-09-02 00:00:00|1784010
2016-09-01 00:00:00|1860900
2016-08-31 00:00:00|1873596
2016-08-30 00:00:00|1997988
2016-08-29 00:00:00|1973811
2016-08-28 00:00:00|2009865
2016-08-27 00:00:00|1956750
2016-08-26 00:00:00|1962429
2016-08-25 00:00:00|2010954
2016-08-24 00:00:00|2012358
2016-08-23 00:00:00|2073966
2016-08-22 00:00:00|1986390
2016-08-21 00:00:00|2068050
2016-08-20 00:00:00|1993596
20 rows returned.
Execution time: 107 ms, Total time: 107 ms
mapdql> 

I am returning 20 here but we may want to return hunderds of days (or whatever granularity, whatever is relevant

Then we would hone in on the data we are interested in with a series of finer filter queries until we saw the data that was relevant to the analyst. Rather than drowning them in the delays of delivering 1 Million records… really what are they going to do with those records next?

This is not to say we wont lift the sorting constraints over time but we are targeting an interactive experience for our end users and that does not normally include delivering 1 million results ordered or not. We are trying to lift people out of the REPORT style mentality and allow them to EXPLORE the real data.

Have you reviewed the dashboards at https://www.mapd.com/demos/

You will see many of them have time charts that span many billions of records, that work effectively on this binning approach as you drill into the data.

i hope this helps explain our approach

regards


#6

Okay. We understood its meaning. However, we do not use the results of the search to generate statistics. All the data searched are used with their own meaning. The retrieved data is used for analysis in one place. That’s why we sort them in chronological order.
If we separate all of them, we would have to send hundreds of the same query to another table.
We should get results even if the speed is a bit late. The simplest way we think is to solve a million limitations.
Is there a solution for this?


#7

Hi DTMN,

I read your description a couple of times and am interested in the problem, but not sure I understand all of the details.

It sounds like you want to generate output for time-ordered data, and you may want to group it (not sure about the group part), and it also sounds like you don’t want interactive analysis, but you want to export the time-ordered results.

So if I have that all correct, then are you exporting the data? like this example from the MapD docs

COPY (SELECT * FROM tweets) TO ‘/tmp/tweets.csv’;
COPY (SELECT * tweets ORDER BY tweet_time LIMIT 10000) TO
’/tmp/tweets.tsv’ WITH (delimiter = ‘\t’, quoted = ‘true’, header = ‘false’);

Please let me know if I am following you. (Note: the last time I did a large export to CSV, I did encounter some sort of output limitation. If that is your usecase, and the CSV limit is still applicable in this version, you may have to export in groups - as I believe you pointed out. There might be another work-around - if I am following you, that is)