Ben Godfrey

Archive for February, 2006

Aggregate operations in shell

You might be familiar with aggregate clauses in SQL, you know, count(*) and friends. I was interested in performing aggregate functions on files of tab-separated data. I’d do an aggregate query in SQL, get a set of result rows, then I needed to total one column.

I output the data to TSV, using a mysql -e command and redirecting the output to a file. The data looks something like:

id      viewcount       comments
127067  341     44
127076  66      2
127077  158     6
127111  379     25
127112  83      3
127119  105     10
127131  47      0
127135  51      1
127137  133     17

I found that awk has the required mix of unix command line tool sensibilities and numeric functions. Here awk accumulates the 3rd value of each row and displays the total.

cat data.tsv | awk '{x+=$3} END {print x}'

awk is basically C-like and has a small core set of numeric functions built-in. To calculate mean, divide the accumulator result by the built-in variable NR, which is incremented each time a record (line) is processed and so contains the total line count by the time we reach the END block.

Doctests vs unit tests

OK, I might have to eat my words here. Doctests are quite cool as small dabs here and there, but the module doesn’t have the flexibility of unit tests. It’s much harder to chunk the tests and refer to them individually. This is hugely exacerbated by the whole Django model magic madness. Hopefully most of the gymnastics I’m doing in testing at the moment will become redundant when magic-removal hits the shelves. In the mean time I’m considering moving all my tests back to units, which will be a pain because of the whole from django.models... boilerplate that’s required everywhere.

Also, having the tests in with code is quite cool in some ways, but it does make getting around the files tricky when the blocks get large. So much for literate programming.

Testing Django apps

I’ve started work on a new, quite large Django project. It will be six weeks until I’ve implemented everything in the specification, so I wanted to get in early with some tests.

Building on the code and ideas of Hugo, Ian Maurer and Sune Kirkeby, I created a simple Python module that provides:

  • A testenv_setup method that modifies the database settings in the current DJANGO_SETTINGS_MODULE to point to a in-memory sqlite database, installs a specified set of models into that virgin database and changes the MEDIA_ROOT to point to a folder in /tmp/.
  • A testenv_teardown method which cleans up the temporary MEDIA_ROOT.
  • A unit test base class (derived from unittest.TestCase) that uses testenv_setup and testenv_teardown.
  • A new handler, TestHandler, derived from WSGIHandler which allows the user to construct requests to the Django processing framework very easily (e.g. get_response("/my/url/")) and returns the original objects created for the response and allows exceptions to bubble rather than just pushing back the HTML only. This facilitates easy test writing. Of course the HTML is there too, and, like Sune Kirkeby, I push it into a BeautifulSoup object for handy parsing.

The result is a simple module that can be used to create either doctests and unit tests that test both models and views. I prefer doctests, because they just seem more pythonic to me. Unit tests are great, but there’s a weighty Javaness to them, they’re more clunky to write, which is critical. Doctests are also a better fit for Django’s model magic. Because of the DB settings monkey-patching, unit tests must import the models after setUp, within each test function. This just adds tonnes of lines of cruft. With doctests, I just have a block of imports at the top of each test docstring.

Web 2.0 logo jam

Web 2.0 logo jam

There’s at least a good few billion in investment here!

Syncasting

I have a bluetooth dongle for my (old) Powerbook. I keep all my events in iCal and all my contacts in Address Book. I’m organised and I appreciate having this data on the run. So why do I never synchronise my phone to my computer? It only takes about five minutes, but it’s just hassle. I have to take that five minutes out of whatever else I’m doing.

One alternative is to use wholly online services, with mobile versions, or browse them phone-side through Opera Mini. In my case I’m sure I can find a service that can provide WebDAV calendars for iCal and perhaps even LDAP for Address Book. Whenever I need to get to my information while I’m out, I just go to my bookmarks on my phone. But now I’m restricted to the places where I can get signal. If I’m sat on the underground I’ve got a powerful device with a large memory, but none of the data I need. This is something that a lot of mobile app developers seem to have missed. It’s just about OK to assume that a desktop user has connectivity, as a mobile user it’s in an out all the time and the bandwidth is dog slow. A much better experience will be provided to the user if you assume offline and be online if you get the chance.

So, instead, wouldn’t it be great if I could use podcasting to get my contact, event and other data to my handset on a regular schedule? I’d set up RSS feeds for all of my key information, and those channels would include data in standard formats. For contacts, send vCards as enclosures, or use the hCards microformat. My phone would have options to poll my feeds regularly, I could do it on demand as well. In my case I’d set my phone to update every 24 hours, that would be plenty given how often I update and would keep my data traffic fairly low. If I’m out of range or battery when the allotted time comes, my phone would fail silently and retry later on. Perhaps giving me an error if a threshold of retries failed. My operator would provide a central service for uniting my disparate feeds (or I could do this myself), so my phone only has to check one place. If there’s no new information, the data transferred is tiny.

The only problem with this idea is how to implement it. AFAIK it can’t be done with a Java midlet because these must be explicitly run by the user. The whole benefit of podcasting here, as in other applications, is that it’s a transparent background task. The user should not be involved.

Now some of you might say stop being lazy and get your Bluetooth on, but if you think this is an interesting idea, maybe we can get grass roots support to force it on the handset manufacturers and network operators :-). Maybe Dave Winer is reading right now. If you think this has potential, like to this post with the link text “syncasting” and we’ll see what happens.

Actually, I did once meet a guy who did some UI work at Nokia and SE. Maybe I can twist his arm…

The 11th lie of the entrpreneur

Guy Kawasaki adds the following to his top ten lies of entrepreneurs:

We’ll generate a lot of traffic and monetize it with Google AdSense”

He spills the beans on his blog’s numbers in a very open way for a VC, it’s certainly useful for comparison.