Support MAX on dict column


#1

The use case is viewing the string values of various categorical columns for multiple aggregated rows. The query errors with MAX on strings or arrays not supported yet.

When is this support planned and are there any workarounds?


#2

Hi - we’re currently working on string function support and will update this when it lands. Can you illustrate the exact use case (query) that you’d like support for, so we can take this into consideration?


#3

Sure. A simple use case would look like this:

SELECT
        SUM(member.paid_amount) as paid_amount,
        MAX(CASE
                WHEN member.month_key IN ('2016-12-01') THEN member.home_city
        END) AS home_city,
        MAX(CASE
                WHEN member.month_key IN ('2016-12-01') THEN member.home_state
        END) AS home_state,
        member.member_id
FROM
        member
WHERE
        member.month_key BETWEEN '2016-01-01' AND '2016-12-01'
GROUP BY
	member.member_id

Numeric values are aggregated and categorical ones are pulled from their last month in the selected date window.


#4

Hi,

It seems to me that in this example the MAX() is purely used to get a value in a group by. To do a MAX() for this process seems more than you need. Would an aggregate called say SAMPLE() which just picks a single value be enough for your purposes?

regards


#5

You are correct. MAX is just a workaround to get at the value. An aggregate like SAMPLE would work great.


#6

Though having MAX would allow for simultaneous compatibility with our current db Redshift.


#7

Hi,

Its worth keeping in mind MAX() is a potentially computationally intense thing to do over billions of records when really all you need is a SAMPLE().

regards


#8

why not just group by home_city and home_state?


#9

Hi,

I would guess the purpose is so they get a single value for the paid_amount for the entire elapsed period not broken down by city and state, but they want a state and city (ANY city or state for the period for a given member_id) to “complete” the output record on the summary for the period

regards


#10

@dwayneberry yes, i first thought this was trying to find the latest(by time period) city/state but the MAX seems to apply to the city, state, not the time period (which is fixed) - so SAMPLE seems more appropriate


#11

@billmaimone: the specific use case is what @dwayneberry describes.

As @niviksha mentioned, “latest” is ultimately what I’m trying to do but SAMPLE seems like a more immediate solution since it doesn’t require specifying according to which column latest applies.

I’d appreciate any updates on this request as progress is made. This appears to be the last major issue in switching from Redshift for my query engine. Thanks for the great support all along the way.


#12

Though lots of products could find a way to execute this with MAX, there is this opportunity to do a far more optimal implementation here. I did a quick look around to see if any other db had a function with these semantics but didn’t see any. @niviksha Do you know any? If there is a precedent for what to call this it’d be best to follow it. Implementing this would be pretty straightforward, probably just pick the first element encountered in each new grouping set.


#13

Hi @dallinregehr, as of version 4.0 we now have LAST_SAMPLE in the product which does exactly this.

Please let us know if you have any questions.

Regards