I have a very big text file. Can I read a text file in chunks?

By defining a function to handle one chunk

process_chunk: {[list_of_lines]
// Split and parse each line in list_of_lines and
// then - most likely - create some sort of output.
// After all, the idea here is to avoid holding
// all of the data in memory.

and passing your chunk handler along with the file handle to .Q.fs:

bytes_read: .Q.fs[process_chunk; `:big_file_name]

For details on how to parse lines of text inside your chunk handler, see this related faq.

How do I read in a text file?

Short answer:

1. (types; delimiter) 0: `:filename
2. .Q.fs[chunk_handler; `:filename]

Although there are many ways to read an ASCII file in q – depending on the content, how big the file is, and what you want to do with it – most of the time you will use one of two methods. The first method is for files you want to read into memory in their entirety, while the other approach is for situations in which you want to deal with the file in chunks. The latter scenario is covered in this related faq.

If the file is small enough (compared to the available memory in your system), you can read it all in as a list of lines in one go using read0. Given the following file, lines.txt,


we can write

q)lines: read0 `:lines.txt

To break up the lines, we use the vs (vector from scalar) function, applying the /: (each right) adverb so that we split each line:

q)split: “=” vs/: lines
“foo” “10”
“bar” “20”
“baz” “30”

At this point, you probably want to parse each piece of text into its corresponding type to facilitate fast searching or arithmetic etc. You use the $ (cast) operator to do this, passing an uppercase type character as its left argument:

q)”S” $ “foo”
q)”I” $ “10”

You may remember from this related faq that you can convert a list of items at once:

q)”S” $ (“foo”; “bar”; “baz”)
q)”I” $ (“10”; “20”; “30”)
10 20 30

But wait! There’s more! If you pass a list of type characters as the left argument to $, you can parse multiple lists:

q)”SI” $ ((“foo”; “bar”; “baz”); (“10”; “20”; “30”))
foo bar baz
10 20 30

The list of type characters can be as long as you like:

q)"SSFI*" $ ("foo"; "bar"; "10.5"; "47"; "left as a string")
"left as a string"

Thus, we can parse our file with the following code:

q)”SI” $ flip “=” vs/: read0 `:lines.txt
foo bar baz
10 20 30

Since this sequence of operations is so common, it has been wrapped up in an overload of that workhorse of text I/O, 0: (load text). The trick is to pass a pair as the left argument to 0: where the first element of the pair is the string of type characters and the second element of the pair is the delimiter between each name and value in the file:

q)(“SI”; “=”) 0: `:lines.txt
foo bar baz
10 20 30

Putting it all together, we can turn our file into a table like so:

q)flip `name`val ! (“SI”; “=”) 0: `:lines.txt
name val
foo 10
bar 20
baz 30

Using 0: instead of the combination of read0, vs and $ is faster and less memory-intensive. The differences become significant as the file size grows:

q)system “wc trade_small.csv”
” 1000001 1000001 29997921 trade_small.csv”
q)\ts (“TSIF”; “,”) 0: `:trade_small.csv
554 20971840j
q)\ts “TSIF” $ flip “,” vs/: read0 `:trade_small.csv
2649 232389280j

As a consequence of the 0: function’s superior memory efficiency, it can handle much larger files than the other approach:

q)system “wc trade.csv”
” 10000001 10000001 299888328 trade_big.csv”
q)\ts (“TSIF”; “,”) 0: `:trade_big.csv
5672 335544640j
q)\ts “TSIF” $ flip “,” vs/: read0 `:trade_big.csv

If you don’t actually need to parse the file contents, then read0 (by itself) is fine.

By the way, if you only want to grab part of the file, you can pass a triple to read0 in order to read a subset of the bytes. You’ll still get a list of lines broken on newlines:

q)offset: 3
q)number_of_bytes_to_read: 6
q)read0 (`:lines.txt; offset; number_of_bytes_to_read)

Unless your file has fixed-length records, however, you may find it easier – assuming you have the head and tail utilities (or similar) available, to use the system function to get exactly the lines you want. For example,

q)first_line: first system “head -1 lines.txt”

(Note the call to first; system always returns a list of strings, even when there is only one.) This particular example is handy when you need to examine the start of a file to figure out how to read it properly.

This work is licensed under a Creative Commons License.
The views and opinions expressed herein are those of the authors and do not necessarily reflect those of any other person or legal entity.
Kdb+ is the registered trademark of Kx Systems, Inc.