Mar 16, 2011

How can I group all (other) columns?

Short Answer

Explanation

When you write a grouping query (i.e., one with a by clause) that lists columns in its select clause, the result nests all of the values from the selected columns. However, if no columns are listed in the select clause, the result only includes each column’s last value for each key.

For example, consider the following table:

trade: ([] time:  `time$   ();
           sym:   `symbol$ ();
           price: `float$  ();
           size:  `int$    ());
`trade insert (09:30:00t + til 1000; 
	1000 # 10 ? `3; 
	10.0 + 1000 ? 15.0; 
	100 + 100 * 1000 ? 10);

If we execute a query with a column in its select clause, the result is nested:

q)select price by sym from trade
sym| price                                         ..
---| ----------------------------------------------..
dgh| 13.68204 18.47666 19.59256 23.92803 12.75742 2..
dmo| 19.39943 19.1779  12.71798 10.9528  24.89916 2..
ekh| 22.16764 19.13283 23.42802 11.1409  24.42824 1..
gap| 16.33531 11.47509 13.90435 16.40958 22.59581 2..
log| 23.23925 17.83933 21.00673 14.97167 18.14587 1..
q)

However, if the select clause is empty, then the query returns only the last item from each group:

q)select by sym from trade
sym| time         price    size
---| --------------------------
dgh| 09:30:00.992 22.55807 1000
dmo| 09:30:00.999 22.72169 1000
ekh| 09:30:00.993 24.27423 300
gap| 09:30:00.995 20.08908 900
log| 09:30:00.998 24.42545 600
q)

Usually, this behavior is what you want. However, if you want to group all columns (except the grouping column), you have two options:

List each column explicitly in your select clause,

q)select time, price, size by sym from trade
sym| time                                          ..
---| ----------------------------------------------..
dgh| 09:30:00.002 09:30:00.012 09:30:00.022 09:30:0..
dmo| 09:30:00.009 09:30:00.019 09:30:00.029 09:30:0..
ekh| 09:30:00.003 09:30:00.013 09:30:00.023 09:30:0..
gap| 09:30:00.005 09:30:00.015 09:30:00.025 09:30:0..
log| 09:30:00.008 09:30:00.018 09:30:00.028 09:30:0..
q)

or,

Use xgroup:

q)`sym xgroup trade
sym| time                                          ..
---| ----------------------------------------------..
pln| 09:30:00.000 09:30:00.010 09:30:00.020 09:30:0..
nch| 09:30:00.001 09:30:00.011 09:30:00.021 09:30:0..
dgh| 09:30:00.002 09:30:00.012 09:30:00.022 09:30:0..
ekh| 09:30:00.003 09:30:00.013 09:30:00.023 09:30:0..
ojn| 09:30:00.004 09:30:00.014 09:30:00.024 09:30:0..
q)

The more columns your table has, the more attractive xgroup will be. However, every time we tried, xgroup was slightly slower:

q)\t do[10000; select time, price, size by sym from trade]
246
q)\t do[10000; `sym xgroup trade]
275
q)

On the other hand, if you want the code in question to be generically applied to several tables, xgroup is much better than the alternatives.

Notice that xgroup, in contrast to select by, does not sort the result by its key. If you need the rows sorted by key, follow xgroup with xasc:

q)`sym xasc `sym xgroup trade
sym| time                                          ..
---| ----------------------------------------------..
dgh| 09:30:00.002 09:30:00.012 09:30:00.022 09:30:0..
dmo| 09:30:00.009 09:30:00.019 09:30:00.029 09:30:0..
ekh| 09:30:00.003 09:30:00.013 09:30:00.023 09:30:0..
gap| 09:30:00.005 09:30:00.015 09:30:00.025 09:30:0..
log| 09:30:00.008 09:30:00.018 09:30:00.028 09:30:0..
q)

The cost of the sort will make the xgroup significantly slower than using select:

q)\t do[10000; `sym xasc `sym xgroup trade]
479
q)