Entries in where function (2)

How do I get the length of a list?

Short answer:

count list
or
count each column_name
to get the length of a list-valued field in a query.


The length of any list can be found using the count function; the only complication occurs with certain applications of count in queries. We'll illustrate with a table of simulated trade data:

q)trade
time         sym price size
---------------------------
09:30:17.623 glo 26.79 3200
09:30:26.602 ojb 28.17 1600
09:30:33.657 fjc 91.31 5400
09:30:40.884 knb 59.7  2900
09:31:19.256 apl 1.26  7900
..
q)

The following query uses count to return the number of rows that satisfy the where clause:

q)exec count i from trade where sym = `aif
6
q)

This is usually what you want.

However, suppose we created another table (perhaps to improve the performance of certain queries) that grouped prices for each symbol:

q)grouped: select price by sym from trade
q)grouped
sym| price                                                   ..
---| --------------------------------------------------------..
acl| 18.1 85.5 86.31 45.65 10.91 31.25 5.49 51.66 1.02 31.82 ..
aif| 44.02 47.74 11.83 14.28 85.11 99.09                     ..
alk| 51.12 16.7 78.71 96.56 26.55 32.09 33.66 30.01 83.24 30...
amg| 59.25 56.18 92.63 46.57 57.25 14.58 69.21 88.25 94.1 28...
aog| 92.59 51.52 96.95 83.2 6.21 59.77 44.19 94.19 3.13 41.94..
..
q)

In table grouped, the price column is of type list-of-float - note the uppercase type letter F:

q)meta grouped
c    | t f a
-----| -----
sym  | s    
price| F    

It appears that we can retrieve the prices for a given symbol easily enough:

q)exec price from grouped where sym = `aif
44.02 47.74 11.83 14.28 85.11 99.09
q)

Suppose, however, that we want to know how many trades occurred for a particular symbol:

q)exec count price from grouped where sym = `aif
1    
q)

What went wrong? A where function yields a list of row indexes that meet the constraints, and then each projection (i.e., the column names between exec and from — in this case, price) yields a list of corresponding field values. Since the number of rows that met our sole constraint is 1, the result of the projection is an untyped list with a single element:

q)type exec price from grouped where sym = `aif
0h
q)count exec price from grouped where sym = `aif
1
q)

Projection results are the arguments to aggregation functions in queries. In other words, the untyped list in our example is the same as the one passed to count. Since the projection's single element contains the list of prices we want to count, the way out is to combine count with first:

q)exec count first price from grouped where sum = `aif
6
q)

Applying the same logic when counting the trades for every symbol, we need to use count each:

q)select count price by sym from grouped
sym| price
---| -----
acl| 1    
aif| 1    
alk| 1    
amg| 1    
aog| 1    
..
q)select count each price by sym from grouped
sym| price
---| -----
acl| 14   
aif| 6    
alk| 10   
amg| 12   
aog| 13   
..
q)

Constraints in where clauses behave the same as well. Consider the following select statement:

q)t
x y z
-------------
a 1 "xyzzy"
b 2 "grue"
c 3 "frobozz"
q)select from t where 5 < count z
x y z
-----
q)

Newcomers to q often expect the above query to return the rows from t whose z column has more than 5 characters (i.e., c 3 frobozz). Rather than counting the contents of each z field, however, count is actually counting the list t `z:

q)count t `z
3
q)t `z
"xyzzy"
"foobar"
"frobozz"
q)

This insight suggests the solution:

q)count each t `z
5 6 7
q)select from t where 5 < count each z
x y z
-------------
c 3 "frobozz"
q)

Does the order of field constraints matter in the performance of a select query?

Short answer: Yes, put the most restrictive constraint first.

Although expressions in q are usually evaluated from right to left, there is one exception: the constraints in a where clause are evaluated from left to right. Consider the following example:

q)t: ([] x: `a`b`c; y: 1 2 3)
q)select from t where x < `c, y > 1
x y
---
b 2
q)

The comma separating x < `c from y > 1 is not the join operator; instead, it delimits the constraints of the where clause. If you need to use the join operator in a where clause, use parentheses:

q)select from t where x in (`a, `b), y > 1
x y
---
b 2
q)

Each constraint is evaluated in the usual order, i.e., right-to-left:

q)select from t where x < first -1 # `p`c, y > 1
x y
---
b 2
q)

Because constraints are evaluated from left to right, putting the most restrictive constraint first will speed up queries on large tables. On a date partitioned database, for example, the placement of a constraint on the virtual date column will make a marked performance difference:

select from large_table where date = 2011.02.25, name = `JOE

is significantly faster than

select from large_table where name = `JOE, date = 2011.02.25

The latter query will attempt to load the entire contents of table for rows matching name `JOE prior to narrowing the result set down to a single date. kdb does not provide behind-the-scenes optimization in cases like these.