Recent Blog Posts

Using git & Python to autogen changelogs

2014-01-14T00:00:00Z

Background

As part of the communication process at work, devs maintain changelogs for some of our projects. What these consist of is a single RELEASE NOTES.md file in the project root, where each each line is a Markdown hyperlink to the pull request that introduced the change. These pull request links are then grouped together by date of release. The changelog looks like:

## v1.7 2013/03/17
* [#100](https://github.com/courseload/project/pull/100) - Finalized previously preliminary stuff
* [#99](https://github.com/courseload/project/pull/99) - Did some preliminary stuff

## v1.6.4 2013/03/14
* [#98](https://github.com/courseload/project/pull/98) - Made dongles brighter.
* [#97](https://github.com/courseload/project/pull/97) - Improved widget performance by 3.8x

At first, these were created by having devs also update RELEASE NOTES.md with each pull request. This distributed the workload, but it also made having multiple pull requests a big pain in the ass since the same file, usually the same line in the same file, was being modified by multiple pull requests. So we stopped that practice and instead moved to a hand-made RELEASE NOTES.md file, maintained by these de facto primaries. Obviously this kind of work is sub-optimal and ripe for automation. For months though, streamlining the process fell far down on the priority list until I just couldn't take it anymore.

git log

When I am automating a repetitive task like this, my goal is to write as little code as possible. In thise case, that means massaging the output of git log to get me as close to the desired final format of the changelog lines as possible. In other words, I only want to output merge commits. We can do that with:

git log --merges

This is good, but it shows a lot of extra information I'd have to parse out. If you'll notice in my example above, the lines in RELEASE NOTES.md are formatted like [#<pull request number>](https://github.com/courseload/project/pull/<pull request number>) - <pull request description>. So we notice right away we need two things from git log:

The commit message of the merge. Think of this as the subject line of an email. We want this because this has the number of the pull request.
The pull request description, which works out to be, for the sake of this blog post, the equivalent of the first line of the body of the aforementioned email.

This git command gets us this info without a bunch of cruft:

git log --pretty=format:'%s%n%b' --merges

But let's get really close now to the desired final output:

git log --pretty=format:'%s%n* [#{pr_num}](https://github.com/courseload/project/pull/{pr_num}) - %b)'

Now, every merge commit appears as a two-line entry. The first is the merge commit message. The second is the pull request description. For bonus points ,the second line looks almost exactly like the changelog lines, except using Python string interpolation variables embedded in place of the PR number.

Python

It's great that we have just the info we want, but I know we're also going to need to do two things:

Parse out the pull request number from the git log output, and
Use the PR number to create the changelog entry

By running the above git log command via subprocess.check_output I can automate all this with this script:

#!/usr/bin/env python
"""This script generates release notes for each merged pull request from
git merge-commit messages.
Usage:
 `python release.py <start_commit> <end_commit> [--output {file,stdout}]`
For example, if you wanted to find the diff between version 1.0 and 1.2,
and write the output to the release notes file, you would type the
following:
 `python release.py 1.0 1.2 -f CHANGELOG.md`
"""
import os.path as op
import re
import subprocess
from collections import deque

PROJECT_URI = "https://github.com/foo/bar"

def commit_msgs(start_commit, end_commit):
    """Run the git command that outputs the merge commits (both subject
    and body) to stdout, and return the output.
    """
    fmt_string = ("'%s%n* [#{pr_num}]"
                  "(" + PROJECT_URI + "/{pr_num}) - %b'")
    return subprocess.check_output([
        "git",
        "log",
        "--pretty=format:%s" % fmt_string,
        "--merges", "%s..%s" % (start_commit, end_commit)])


def release_note_lines(msgs):
    """Parse the lines from git output and format the strings using the
    pull request number.
    """
    ptn = r"Merge pull request #(\d+).*\n([^\n]*)'$"
    pairs = re.findall(ptn, msgs, re.MULTILINE)
    return deque(body.format(pr_num=pr_num) for pr_num, body in pairs)


def release_header_line(version, release_date=None):
    release_date = release_date or datetime.date.today().strftime('%Y/%m/%d')
    return "## %s - %s" % (version, release_date)


def prepend(filename, lines, release_header=False):
    """Write `lines` (i.e. release notes) to file `filename`."""
    if op.exists(filename):
        with open(filename, 'r+') as f:
            first_line = f.read()
            f.seek(0, 0)
            f.write('\n\n'.join([lines, first_line]))
    else:
        with open(filename, 'w') as f:
            f.write(lines)
            f.write('\n')


if __name__ == "__main__":
    import argparse
    import datetime

    parser = argparse.ArgumentParser()
    parser.add_argument('start_commit', metavar='START_COMMIT_OR_TAG')
    parser.add_argument('end_commit', metavar='END_COMMIT_OR_TAG')
    parser.add_argument('--filepath', '-f',
                        help="Absolute path to output file.")
    parser.add_argument('--tag', '-t', metavar='NEW_TAG')
    parser.add_argument(
        '--date', '-d', metavar='RELEASE_DATE',
        help="Date of release for listed patch notes. Use yyyy/mm/dd format.")
    args = parser.parse_args()
    start, end = args.start_commit, args.end_commit
    lines = release_note_lines(commit_msgs(start, end))

    if args.tag:
        lines.appendleft(release_header_line(args.tag, args.date))

    lines = '\n'.join(lines)

    if args.filepath:
        filename = op.abspath(args.filepath)
        prepend(filename, lines)
    else:
        print lines

To view the output in stdout, at the command line type:

$ ./release.py 1.7 HEAD

Or, specify an output file:

$ ./release 1.7 HEAD ./RELEASE\ NOTES.md

Conclusion

One additional step I took is to create a git alias for the git log command, but prettied up a bit, for when I want to just scan through the differences from one version to the next. If you'd like to do the same, add the following to the [alias] section of ~/.gitconfig:

lm = log --pretty=format:'%Cred%h%Creset %C(bold blue)<%an>%Creset \
  -%C(yellow)%d%Creset %C(bold cyan)%s %Cgreen(%cr)%n%Creset%n - %b%n' \
  --abbrev-commit --date=relative --merges

You can also achieve the same effect by entering the following at the CLI:

git config --global alias.lm "log --pretty=format:'%Cred%h%Creset \
  %C(bold blue)<%an>%Creset -%C(yellow)%d%Creset %C(bold cyan)%s \
  %Cgreen(%cr)%n%Creset%n - %b%n' --abbrev-commit --date=relative --merges"

(The escaped newlines aren't necessary, only including them to keep the line length down on the page.)

Please leave a comment if you have questions or spot an error. Thanks.

How to Run a Windows Service As A Linux Daemon

2012-10-19T00:00:00Z

Premise: You've got a Windows service that you want to run on a Linux server

Problem: Your code is written using the .NET framework and some language that targets the CLR (C#, VB, Clojure-CLR, etc.)

Solution: Mono is an open-source implementation of the .NET framework. By installing mono you gain access to a ton of useful stuff, but the relevant item here is the mono-service executable. (Installing mono is out of the scope of this blog post, but odds are pretty good mono is available from your distro's package management system.)

Once installed, you can run your compiled code like so:

mono-service SomeExecutable.exe

By default, this creates a lockfile in /tmp. You can change this by using the -l:<lockfile> option. This is great, because now your service is running in the background! However, this is really flimsy; what if the process dies? What if the server needs rebooted? To solve this I'm using supervisor.

Get It Running In 4 Steps

Once you've got supervisor and mono installed, follow these steps:

Create a supervisor file in /etc/supervisor/conf.d/ with a descriptive name. We'll use mysvc.conf.

Edit mysvc.conf so it looks similar to this^1,2

[program:mysvc]
command=mono-service MyWindowsService.exe --no-daemon
directory=/path/to/executable
user=someuser
stdout_logfile=/home/someuser/mysvc/out.log
redirect_stderr=true

sudo service supervisor update. This will reload the config file you edited above.
To confirm that your process started, run ps aux|grep mono. You should see it in the process list.

Conclusion

Hope this helps. Supervisor has a ton of different options for configuring how a process runs, it's worth it to RTFM.

Footnotes

1. The directory specified in your stdout_logfile parameter must already exist. If you try to start the mysvc process without creating it, supervisor will throw an error. Also, the user parameter should be set to a user that has permissions to write to the directory where you're keeping the stdout_logfile. Please consult the relevant supervisor docs for more about users & processes.

2. You must use the --no-daemon flag to avoid creation of the lockfile which indirectly allows supervisor to capture/redirect stdout/stderr to a logfile.

Larry the Software Guy

2012-10-05T00:00:00Z

Anil Dash published a blog post today I think is a victim of a bad title: "The Blue Collar Coder." I normally skim over the "Is programming an art, craft or science?" discussions but there were a couple of very smart programmers discussing it on Twitter, and I joined in. During the conversation, I vacillated between agreeing with Anil's proposition and agreeing with Alex Feinberg.

I think the title is poor because programming will never be "blue collar." Anil knows that; he more or less admits it was basically caste-baiting in the first sentence of the final paragraph. Unfortunately, I think people reacted to the notion of a programmer being considered "blue collar" more than the real points I think he was trying to make. The tl;dr of Anil's blog post seems to be:

A CS degree is overkill for most job openings
The "tech community" (??) should be focused on creating lots of jobs, not entrepreneurship
Huge amounts of good for people & business can be done by creating a vocational training program for software development

I don't even want to touch (1) because people seem to have such ridiculously strong feelings one way or the other (and possession of a CS degree seems to be no indicator of which way those feelings will go). I don't have a CS degree, and I am enjoying my career. I recognize though that in a few years maybe I'll be bored of the nature of problems I'm working on and maybe getting that degree would have been a smart move after all. In other words, I don't have an opinion on this because I don't know what I don't know.

The second point is eyeroll-worthy, in my opinion, because I think the impression the "tech community" is hyper-focused on producing "the next Zuckerburg" is the result of Hacker News's own "reality distortion field" about startups. Hacker News is the modern equivalent of a sweaty, manic Steve Ballmer trying to pump up a room full of nerds, but instead of "Developers! Developers! Developers!", HN is chanting, "Startups! Startups! Startups!" But what're you gonna do? HN exists for a very specific reason: startup news. Point being that it's not good or bad that this reality distortion effect exists, but you have to seek other perspectives.

I agree with the third point. Full stop. My SWAG (pretty light on the "S") is higher ed could serve more people with lower per-person costs, deliver employees to the job market with high skills, while maintaining/building a reputation as a high-quality institution by offering associates degree & certification programs in software development compared to the current BS/BA in CS.

This is where I think Anil's points get lost because of the title, illustrated by something Alex F. wrote:

There will be demand for "non-programmers who code" for sure, but these positions will still require analytical thinking.

Maybe I'm misreading it, but the implication seems to be that "blue collar" implies work where analytical thinking is optional. There's no less analytical thinking in e.g. managing inventory, building windmills, etc. My opinion, based on my military experience, is that there are many smart and savvy people out there with great analytical abilities, who couldn't get into or complete a CS degree. For these people an associates of applied science or 1-year certificate in software development would be FAR more accessible. Not only that, I'd wager the distribution of skill among graduates would look pretty close to that of most CS programs. What I'm saying is, in my short time doing this I've met some dumb/bad/lazy programmers with CS degrees from universities with respected programs.

Now obviously I don't do much manual labor anymore, but I'm proud of and enjoy the maintenance work I do. Most programming IMO boils down to the equivalent of "blue collar" work: refactoring code you or someone else wrote; patching over and smoothing out ugly spots; squashing bugs that have been around so long they're just considered part of the product. This isn't something I'm claiming I discovered by the way; this is a conclusion other people have drawn that is supported by my own anecdotes.

REST API for search results

2012-02-07T00:00:00Z

Updated: So after talking with the author of Tastypie I added the SearchDeclarativeMetaclass and SearchOptions to handle inheritance of the metaclass attributes on SearchResource. I almost entirely copied his ModelDeclarativeMetaclass and it works well. In-house, we further subclass SearchResource to model our job postings data in our search index, and it works great.

So, first things first: django-tastypie is pretty great. If you're running a Django web application and want to expose your data via a REST API, tastypie will do it. I got everything up-and-running in just a few hours (95% reading, 5% writing).

Tastypie -- written by Daniel Lindsley, the guy behind django-haystack -- uses a Resource class to handle all the API hairiness; it comes with a ModelResource subclass out of the box to provide an interface to a Django model & the ORM. If you want a better explanation, or want to know more, go read the docs.

Speaking of the documentation, there is an example Resource subclass in the docs' cookbook, though that was more about adding search to an existing resource. We want to serve resources -- i.e. Solr documents -- exclusively from Lucene. Our resource is literally a document from the search engine, so we needed a class to model that behavior. (You can read more about how we use Solr here.) To accomplish this, I put together this SearchResource subclass which others may find useful.

If you use Haystack, you know that it goes to great lengths to emulate the API of Django's ORM to provide a familiar interface to the search index. In that vein, SearchResource emulates the ModelResource class.

One issue we have in-house is that there are in some cases discrepancies between the semantics we want to expose as part of our API and the fields we're going to be leveraging to look up resources. To address that, I created a map of querystring parameters to the actual fields in the search index in which their values would be sought:

class JobSearchResource(SearchResource):
    field_aliases = {
        'city': 'city_exact__exact',
        'state': 'state_exact__exact',
        'country': 'country_exact__exact',
        'company': 'company_exact__exact',
        'title': None,
        'date_new': None,
        'uid': None
    }

    <snip declared fields>

    def __init__(self, **kwargs):
        super(JobSearchResource, self).__init__(**kwargs)
        self._meta.index_fields = self.field_aliases.keys()

We use field_aliases.keys() to populate index_fields, so now we need to add in logic to look up those keys and replace them in the query logic with the fields we actually want to search against. In this case, we want to search against (country|state|city|company)_exact, which, if you're familiar with Lucene, are stored, unanalyzed fields. We use Haystack's __exact lookup which has the effect of turning the term query into a phrase by wrapping it in quotes, e.g. q=country_exact:"United States". We don't want tokenized field lookup because we don't want to match, say, "United Kingdom" when we are looking for "United States" due to the match on "United." (There are a million ways to do this of course, but this is how we chose to do it.)

Now we need to override SearchResource.build_filters:

def build_filters(self, filters=None):
    terms = []

    if filters is None:
        filters = {}

    for param_alias, value in filters.items():

        if param_alias not in self._meta.index_fields:
            continue

        param = self.field_aliases.get(param_alias, param_alias) # <---
        tokens = value.split(self._meta.lookup_sep)
        field_queries = []

        for token in tokens:

            if token:
                field_queries.append(self._meta.query_object((param,
                                                              token)))

        terms.append(reduce(operator.or_,
                            filter(lambda x: x, field_queries)))

    if terms:
        return reduce(operator.and_, filter(lambda x: x, terms))
    else:
        return terms

Note the line with the commented <---: This is where the alias->index field translation takes place. If you find yourself with a need to alias search fields this may be a solution for you.

Finally, I made the decision to force some additional configuration overhead -- about 5 attributes on the metaclass -- in order to completely preserve the amazing extensibility of Haystack. I know that in-house we subclass just about everything from Haystack, including the SearchQuerySet; I assume there are others out there doing the same, and more, so you are not forced to use Haystack's built-in SQ object to compose query trees if you've created your own. (If you have I'd be curious to see it.)

Let me know in the comments if you have any problems, spot bugs or think I'm an idiot.

Displacing MySQL with...Solr?

2011-12-29T00:00:00Z

We recently completed a big refactor at work, the intent for which was implementing search for one of our products, a Django-based web CMS called DirectSEO. It did not take long, however, to realize that by choosing Solr as our search backend, we had the opportunity to make some much-needed optimizations. Now, after analyzing three weeks' worth of data related to the refactor, I can say the time investment has yielded real, measurable gains. They came mainly from removing some very expensive database calls from our views, then fetching the same data via calls to the Solr index. This resulted in a simplified code base and decreased page-load times. This post is intended to explain a bit about our approach to leveraging Solr's feature set.

(This is my first truly technical post so I'm sure I'm leaving things out, or explaining poorly. Please contact me or leave comments if I didn't cover something in enough detail or if you've got any questions.)

Some Background

As part of their membership in DirectEmployers, member organizations are provided with a job board on a domain of their choosing to present their job listings in an SEO-friendly way. These sites often live on the .jobs TLD; however, members can -- and often do -- use subdomains of their own site for their job board. An example of each: Lockheed-Martin (.jobs); Arrow Electronics (other).

How It Works

The job boards are generated dynamically. Members give us some basic information -- header images, brand colors, and so forth -- which we use to create a site configuration. This configuration is then referenced to lookup all the jobs associated with a particular member organization. Sometimes, a member organization may have multiple job sites catering to specific job categories: IBM Brazil or Lockheed-Martin InfoSec, for example. In these cases, the corpus of jobs for that member organization are then refined to only include jobs which fall into that category.

From here, users can drill down into the jobs using standard navigation links which we generated based on facets for title, location and custom facets we call Saved Search (not to be confused with saved-searches).

Implementation Details

Simply put, we use Django to deal with MySQL, and we use Django-Haystack to deal with Solr. We run our own fork of Haystack, which capitalizes on some hacks in my own fork of pysolr.

Our saved-search app gives our members a way to create and maintain persistent, user-defined queries. In practice we use these to create sites like the aforementioned Lockheed-Martin InfoSec. They also give our members the ability to create custom job verticals. Hilton has saved searches built around departments; Unilever has a saved search for "hot jobs" they want to fill quickly.

Architectural Aside

A problem arises, however, when a site has a lot of saved searches. But to understand the problem, I should explain a little bit about how our data is stored in the database and how it gets indexed.

Each job listing is a row on our joblisting table. This is currently the only table Solr indexes. Haystack uses a module called search_indexes.py to set the parameters in schema.xml. In it, we specify model fields to index directly, plus several fields Haystack calls "prepared fields," which contain denormalized or calculated data. Native model fields like title, state, country, etc., can be used to create facets. Facets are what you see under "Filter by (Title|City|State|Country)" here. Something like the below snippet will return all the values for those fields along with counts of each (which is what faceting is):

sqs = SearchQuerySet().facet('title_slab').facet('city_slab')\
                      .facet('state_slab').facet('country_slab')
facet_counts = sqs.facet_counts()['fields']

("slabs" are calculated fields such that the city_slab field would have a format like:

"/manassas/virginia/usa/jobs/::Manassas, VA"

We use these to precalculate URL segments in the index so we can keep string manipulation to a minimum in the application. We split on "::" and handle those substrings as needed.)

However, since saved searches are ad-hoc filters that can be composed of any permutation of index fields, they cannot be properly faceted. This means that to get counts of job listings for each saved search, we'd normally have to perform a single HTTP request for each.

To circumvent this costly routine, I hacked up pysolr to implement support for Solr's field collapsing/group query functionality, then wrote a backend to support it. The effect is that for n saved searches configured for a particular site, only one query is required; the saved search concept would otherwise involve far too many HTTP requests to be practical.

Haystack & Solr Setup

On the Python side, we use Haystack's RealTimeSearchIndex class as the basis for our index. In short, it's the exact same as the SearchIndex class, but with post-save/delete listeners for the jobListing table. It gets us as close as we really need to get to ElasticSearch-style real-time search. While Solr 4.0 is going to have "near real-time" search, it's just not a feature we have a need for now. If that changes in the future, we'll re-evaluate.

For Solr, we run two servers in a master-slave configuration. The master handles the real-time updates. The (read-only) slave handles all the queries, and is set to do replication checks every 60 seconds. The side effect of this is that when the master is handling a large volume of updates, average query response time by the slave slows by 50-75ms. For comparison, it normally takes around 200ms for our application to calculate and return an HTTP response.

The one caveat for using Solr in this way is that unlike some other document databases, there is absolutely no notion of relations whatsoever. Plus, obviously, it wouldn't be responsible to use Solr as a primary datastore (A good read on why can be found in this response on SO).

Performance & Reliability

Performance has improved measurably, especially on pages with a lot of jobs, a lot of facets and a lot of saved searches. Some very costly SQL queries have been eliminated. By utilizing Solr's query-tuning tools like facet.mincount, start and offset, we've kept the amount of data transfered per request is low. Using Solr to power saved searches eliminates a lot of complexity from our code base.

Getting data reliability right has taken longer, involving some diligent bug-hunting. I've spent the past four months learning about how Solr works, how to intelligently leverage Haystack's API, and implementing some features of Solr in Haystack that aren't included out-of-the-box. It is important to keep in mind that a Solr match is not necessarily binary. A thing might match, it might not, but more likely it will "kinda" match. Tightening up queries as needed is vital if you want exact results only. One of my big hurdles in getting this working right was making sure matches were fuzzy where they should be fuzzy, and exact where they should be exact.

Finally, I think that as we add more features to our application, we'll have to start putting standard RDBMS queries back into play in some areas. For the past 3 months I've been rewiring a Django application, cutting out the old relational stuff and replacing it with simpler, faster methods. It is a dramatic shift. As time goes on we'll be building out more features that will require relational information.

Conclusion

Utilizing Solr in this way is both ordinary and novel. It's novel because when people think of Solr, they think a search box with a button that says "Search". You click on the button and get results. It's ordinary because Solr is, after all, a document database. It stores documents in a flat structure, and you compose queries to retrieve them. Not exotic, unusual or special in any way. In a use case such as ours, however, where the need for relations is minimal and practically all of our content is generated based on text searching, Solr is great.

How I Became a Programmer

2011-11-23T00:00:00Z

I posted a very brief response to a post on HackerNews yesterday challenging the notion that 8 weeks of guided tutelage on Ruby on Rails is not going to produce someone who you might consider a "junior RoR developer." It did not garner many upvotes so I figured that like most conversation on the Internet it faded into the general ambient chatter. Imagine my surprise when I woke up to couple handfuls' worth of emails from around the world asking me what I did, how I did it, and how I got a job. I'm assuming, judging by the relatively small amount of mail I got from a random aside on HN*, that there must be a lot of people who are trying to figure out how to pursue a career in programming.

First, A Disclaimer or Two

Please note that this blog post is entitled, "How I Became a Programmer", not, "How You Can Become a Programmer." I'm not a self-help guru or wise or even a particularly good programmer. I did, however, decide at an inflection point in my life to pursue something vigorously and it paid off. Any insights gleaned from my experience are yours to make on your own; I doubt I'll have much insight for your personal situation.

Also, after consulting with my girlfriend, my total time of dedicated effort to becoming a paid programmer was actually about 12 weeks, not ~10 as I stated in the post I linked to above. So, there you go.

My Story: tl;dr

In brief: I left the Marine Corps after more than a decade in July 2010. I got a job at the state lottery as a PR flak in August of that year, and lost it in mid-February. In mid-May I got hired as a part-time "junior User Experience engineer" at DirectEmployers Association. By late August I was a full-time, regular old "User Experience engineer."

When I lost my job I decided that I was done doing PR; I wanted to be a programmer. I took my tax return and stretched it out on a ramen and water diet. My family (dad, mostly...) was nervous as hell. In that February to May span I spent basically every waking moment learning to program, learning about Linux, and learning about computer science. I taught myself Python, I taught myself Django, I learned some functional and imperative programming, and got semi-decent at the Linux command line.

Voila. Without further ado, I'm going to write about what I didn't do, then dive into the questions I got via email.

What I Didn't Do

One of the things that was asked in almost every email was, "How did you learn Django in 11 weeks?"

I want to make it clear that I didn't set out to learn Django per se. Django is just a very nice toolkit of abstractions that makes creating web applications easy using Python. As far as I'm concerned learning Django was incidental to learning to program. I did not -- and still don't -- want to be considered a "Django developer." I'm not even sure I want to refer to myself as "a Python programmer."

In other words, I do not feel that I would be as modestly competent as I am today if I had spent an inordinate time becoming an expert at the abstraction layer of Django, instead of learning the concepts that make Django work.

Questions From Email

Did you begin with web or book resources?

Yes I did. :) Django has excellent documentation, but StackOverflow is a much more comprehensive help source. On more general topics, I believe that MIT's OpenCourseware Introduction to Computer Science video lecture series was one of the first real computer science resources I consumed. I watched through lecture 13 or something.

What kind of hours were you putting in on a daily and weekly basis?

A lot. Sometimes 8, sometimes 12, sometimes 16. I was a willfully unemployed single parent, so I not only had a passion for programming, I was also hungry (figuratively speaking) and desperate. I put myself in a position where I had no room to be lazy or complacent. I think above all else that made me work 10x harder. I didn't play video games, I didn't watch TV, I didn't sleep all day. All I did all day every day was code, hack, program and develop.

Did you have a mentor of any kind?

I did indeed. A very smart guy was and is my mentor still, though I've learned enough that I don't rely on him as much for guidance as I used to. He mentored my metamorphosis into a programmer in nearly every way. Some specific ways he provided leadership: Practical programming knowledge (especially Python & Django); command-line expertise; got me up-and-running with emacs & vim; career advice. It helps that he is a very successful & well-respected guy who has a reputation for informed skepticism.

Was there anything from your previous background and experience that you feel was a particular asset in your self-guided studies?

Not really. I was a computer geek from way back, had a few BBSes in the late 80s (yes, I'm a child of the 80s & 90s), learned QBasic & VisualBasic back in the day, and tinkered with Python for a few years off and on... mostly off. Other than that, nope.

How did you come to choose Django to study?

The guy whose career I was trying to emulate had made a very successful career for himself with Django. Pretty straightforward from there.

Would you mind sharing your learning process?

I want to restate that I am not a self-help guru or particularly special in any way. I just worked hard because I was hungry and in a self-made corner where I had no choice but to succeed. I consumed everything I could that would get me to a place where I could make money doing something I love. That was my learning process. Seriously.

I would appreciate it if you can show me how you learned Django and give me any tips/tricks sites/books to look at to learn Django or even HTML/CSS, JavaScript (Front-end Engineering stuff)

I don't have any tips or tricks to learning except just doing it. I spent a lot of long (but enjoyable) hours learning stuff.

As I said above, I did not and do not consider it fruitful to "learn Django," "learn Ruby on Rails," or "learn Noir." I think a contributor to my success was learning the languages and the concepts behind them, then using a web framework to better learn that language. I learned the framework incidentally to my education in the language.

Go read the Django docs, join #django on irc.freenode.net and ask questions constantly. That's what I did and it worked ok for me. But honestly I didn't just sit down and read stuff most of the time. Usually I was making things in order to learn concepts better, then reading in support of my goals. I'm a hands-on learner. Some people aren't, but I am so it worked for me. Decide on your own if that's good for you.

As far as HTML & CSS there is just so much information out there, and they're such straightforward concepts. I learned as much HTML & CSS as I needed to do what I needed to do. I did not memorize much about how HTML & CSS work, i.e. syntax & semantics. I don't know right off the top of my head how to create a gradient, but I do know right off the top of my head how to find out. I think that's the important thing.

How did you show the company your skills? Did you show them the projects you've made?

Github, Github, Github. I can't emphasize it enough. Make stuff, put it on github, show people you're passionate and smart and curious.

Also, network. Attend meetups. Meet people. Tweet. Blog. Interact with the community around your language(s). Get to know people. Demonstrate to the world that you really love programming. The week before I saw the job posting for my first programming job I delivered a lightning talk on Fabric, Python's Capistrano analog. That got me on a few people's radar.

Conclusion

If I had to summarize the big overview of how I did what I did, I'd say:

Ask questions, be curious, be passionate
Learn a language, not a web framework for god's sake.
Work hard
Network, attend meetups, tweet, blog, be social and show people you'd be fun to work with, and a credit to team.
(Optional) Put yourself in a position of desperation, so there is no choice but to succeed

My final point really is that I got lucky. I'm not an amazing developer. At the end of the day I'm a newb and I still have a lot to learn. My career is just beginning but I am proud of the effort I put into changing my life. I hope my experiences can help some other folks.

* I should note that I was already of a mind to blog about this since my cousin Jeff has also taken up programming after leaving the environmental consultancy business.

Export ALL Your Facebook Photos Easily

2011-07-01T00:00:00Z

It's no secret that Google+ is gaining new users as fast as the acceptance pipeline will let invitees click "Make me an account."

I love G+, and am thrilled that someone has finally, IMO, smashed Facebook's reign as top dog. There's been a poverty of choice for years when it comes to the social stuff. Google has hit it out of the park. If you are undecided about trying out G+, do it. It's well worth it.

At any rate, on to why I'm writing. If there's a way to download all your Facebook photos at one fell swoop, I don't know what it is. Of course, I don't use Facebook apps or anything, so I'm sure there's something there. It's just easier for me to write it myself.

It will download all of your pictures from your Facebook account, and store them in whatever directory you specify (default is your current working directory). Additionally, this script will create a subdirectory for each album, and tuck each photo into the appropriate subdir. This way, when you go to upload them to Picasa, you can just create whatever Picasa folder, and just "select all" in a particular album subdirectory for easy uploadin'.

I guess I could plug this in to the Picasa API, and may do so this weekend.

import optparse
import os
import re
import subprocess
import sys
import urllib2

import facepy

from mytoken import token, username

def get_photos(dl_dir):
    dest = os.path.abspath(dl_dir)
    p = re.compile(r"[,!'\ /]")
    fb_photos = find_photos()
    for album in fb_photos:
        albname = p.sub("_", album).lower()
        mk_album_dirs(dest, albname)
        folder = albname
        for img_url in fb_photos[album]['images']:
            img_name = img_url.split('/')[-1]
            url = urllib2.urlopen(img_url)

            with open("%s/%s/%s" % (dest, folder, img_name), 'w') as f:
                meta = url.info()
                filesize = int(meta.getheaders("Content-Length")[0])
                #print "Downloading: %s Bytes: %s" % (img_name, filesize)
                filesize_dl = 0
                blocksize = 8192
                while True:
                    buff = url.read(blocksize)
                    if not buff:
                        break

                    filesize_dl += blocksize
                    f.write(buff)
                    status = r"%10d [%3.2f%%]" % (filesize_dl,
                                                  filesize_dl * 100. / filesize)
                    status = status + chr(8)*(len(status)+1)
                    #print status,

def find_photos():
    '''
    Creates a dictionary, with album id as key and a list of images
    in the album as the value.
    '''
    albums = {}
    graph = facepy.GraphAPI(token)
    my_albums = graph.get("%s/albums" % username)
    for album in my_albums:
        albums[album['name']] = {}
        albums[album['name']]['id'] = album['id']
        my_pics = graph.get("%s/photos?limit=100" % album['id'])
        albums[album['name']]['images'] = [pic['source'] for pic in my_pics]
    return albums

def mk_album_dirs(dest, album):
    '''
    Create a subfolder for each facebook album.
    '''
    if not os.path.exists("%s/%s" % (dest, album)):
        os.mkdir("%s/%s" % (dest, album))
    return

if __name__ == "__main__":
    d = os.getcwd()
    parser = optparse.OptionParser()
    parser.add_option("-d", "--dest", action="store", type="string",
                      dest="dest_dir", default=os.getcwd(),
                     help=("Specify the directory where you want your photos t"
                            "o be downloaded. Photos will be downloaded to cur"
                            "rent working dir by default."))
    args = sys.argv[1:]
    (options, args) = parser.parse_args(args)
    get_photos(options.dest_dir)

Changing Careers at 31

2011-06-17T00:00:00Z

I won't bury the lead: About a month ago, I got my first job as a programmer after years of working in PR and marketing.

As I noted here, I spent this spring a "stay-at-home dad," and spent practically every waking moment becoming a better programmer, with the intent of joining the ranks of professional hackers and getting an awesome job making awesome things. Well, a few days after my last blog post, an acquaintance I'd through a local Python meetup tweeted a job opening. I responded, interviewed, and amazingly enough, got the job.

I should point out that I live in Indiana. Development jobs using Python are extremely rare, and one using Django is rarer still. In fact, as far as I know, I may very well have snagged the only job in Indiana that offered the opportunity to work with both Python and Django.

I consider myself very fortunate. It is a great place to work, with smart people, and every day I do interesting things. Every day I learn something new. Working with geeks is very different than working with marketers. My boss's bookshelf is filled with books like, *Leading Geeks*. When I talk about something I read on HN, there's a conversation, not a bunch of blank stares.

Though I get up at 5:30am to get Emma off to day camp and drop my girlfriend off downtown for her classes at IUPUI, I practically bounce out of bed. I love going to work. I'm a little disappointed when I have to go home for the night. Putting in those long hours reading and hacking have paid off. Best decision ever.

If you're curious, at work I'm working on deployment automation. It's not super sexy objectively speaking, but I feel like I've achieved a moderate level of expertise with Fabric. Plus, it has been a great way to learn the ins and outs of the various systems we use at work. Eventually I hope to roll it up into the Django admin panel and make provisioning and deployment as easy as clicking a few radio buttons.

Chebyshev polynomials in LaTeX

2011-05-13T00:00:00Z

I'm recovering from an obsession with Chebyshev polynomials. Despite the fancy title and somewhat-intimidating definition, Chebyshev polynomials are actually a fantastic shortcut -- relative to what we're taught from the book -- to factoring out trigonometric double-angle problems like cos(6x).

I was originally going to write a script that calculated the Chebyshev polynomials, but when I learned Python's SciPy library already has a function, I "pivoted." Instead I wanted to write the below script, which calculates the polynomial using scipy.special.orthogonal.chebyt(), then creates a LaTeX -formatted string representation of the equation. For example, the output for the ninth-degree Chebyshev polynomial is rendered thusly:

Here's the code, it should be pretty straightforward:

import sys
import math
from scipy.special import orthogonal as orth

def chebyTex(n):
    '''Returns a LaTeX-formatted string for a Chebyshev polynomial of
    order n.'''
    c = orth.chebyt(n)
    coeffs = []
    for i in c:
        if i >= 1 or i <= -1:
            coeffs.append(int(round(i)))
        else:
            pass

    pows = [coeffs.index(i)*2 for i in coeffs]
    pows.sort(reverse=True)

    # The only "magic" in this function is some string manipulation to
    # handle the LaTeX formatting for super- and subscript characters.
    arrays = zip(coeffs, pows)
    latex_string = 'T_{%s}(x) = ' % n
    for array in arrays:
        z = n-arrays.index(array)*2
        if arrays[-1] != array:
            latex_string += r'%sx' % array[0]
            latex_string += r'^{%s} + ' % z
        else:
            if not n % 2:
                latex_string += '%s' % array[0]
            else:
                latex_string += '%sx' % array[0]

    return latex_string


if __name__ == '__main__':
    s = chebyTex(int(sys.argv[1]))
    print s

It would be trivial to connect to something like MathBin pull down and store the resulting image, but was beyond the scope of this little script.

Python-Powered Smash'n'Grab

2011-05-12T00:00:00Z

After watching and listening my girlfriend wrestling with her course schedule for the fall semester, I got a "big idea" for another project. It's not ready to see the light of day, but suffice it to say it involves a better way of scheduling classes.

To start down the road of iterating on my project, I needed data. Specifically, I needed the schedule of every course offered by IUPUI in the fall: what days of the week each class was held, and at what times.

I assumed it would be as easy as sending a data request to the university helpdesk. I also assumed it would take 1-3 weeks for them to respond with the data. After all, they do make the schedule available as a PDF. That's clearly autogenerated, so they must have raw data sitting in a database somewhere, right?

The response I got was indecisive and confusing:

"Hi Matt,

I'm sorry but we currently don't have a way for students to obtain this type of information. Contact your instructor or department to see if they can provide a dataset for you.

Also, the IUPUI Registrar's website might help build your own dataset http://registrar.iupui.edu/schedule.html. Thanks,

SIS Help Desk"

(Emphasis added)

I sent a follow-up email asking what the bolded text actually means, but got nothing back. So instead of waiting, I decided to just make my own.

I used Python's lxml library to power a script that scrapes IUPUI's Schedule of Classes sub-site. Then the script builds a JSON document populated with the data from the course relevant to my project. The structure of the sub-site, thankfully, is RESTful, which made writing the logic much easier.

I won't bore you with the nitty-gritty whys and wherefores of the problems I ran into here (plus, my code is commented). scrapeDepts() initializes the JSON file and populates it with department names:

import string
import json
import time
import sys
import os
import re
import lxml.html as lh

jsonSched = 'sched.json'

def scrapeDepts():
    '''Scrape the departments and export to json.'''
    divMain = parse('http://registrar.iupui.edu/enrollment/4118/index.html')
    depts = [link.text for link in divMain.findall(".//a")]
    deptDict = {}

    for dept in depts:
        d = parse('http://registrar.iupui.edu/enrollment/4118/classes/%s/inde'
                  'x.html' % dept)
        crs = [{a.text.replace(' ', ''): {}} for a in d.findall(".//a") if
               a.text.startswith(dept)]
        deptDict[dept] = crs

    with open(os.path.abspath(jsonSched), 'w+') as f:
        json.dump(deptDict, f)

    return

scrapeCourses() is heavily commented for my own sanity. I've got probably more list comprehensions than I need, but they're more readable this way. Plus, it works, and the part of the process the list comprehensions handle aren't going to impact total run time in any appreciable way on a dataset this small.

def scrapeCourses():
    '''Scrape the courses for each department.'''
    item_counter = 0
    with open(os.path.abspath('sched.json'), 'r+') as f:
        deptDict = json.load(f)

    for key in deptDict.keys():
        # item_counter keeps a running tally of all the department and
        # course pages the parser touches. It increments once for a dept.
        # page, and once for each course page on the department.
        item_counter += 1
        for d in deptDict[key]:
            item_counter += 1
            try:
                # Some courses did not parse properly in scrapeDepts() so
                # I had to include this try/except loop to handle
                # IOErrors.
                f = parse('http://registrar.iupui.edu/enrollment/4118/class'
                          'es/%s/%s.html' % (key, d.keys()[0]))
            except:
                continue

            # This is lxml syntax to find all <pre></pre> tags. `.//foo`
            # finds all <foo></foo> tags.
            pre = f.findall(".//pre")[0]

            # The text content for the <pre> tag on a given dept/course
            # web page comes through as an unformatted block of text. `t`
            # is a list comprehension that splits this block of text into
            # separate lines, including each separate line iff. it has at
            # least one character. This conditional is necessary because
            # splitlines() will include empty strings as lines. e.g.:
            #
            #    ['hello world', '', 'my name is matt', '', 'how are you']
            t = [l.strip() for l in pre.text_content().splitlines() if
                 len(l.strip()) > 0]

            # `lines` is a list comprehension to gather all the lines
            # from `t` that began with a digit. This is a heuristic
            # particular to registrar.iupui.edu.
            lines = [line for line in t if line[0] in string.digits]

            for line in lines:
                sid = 'session%d' % lines.index(line)
                d[d.keys()[0]][sid] = {'time': '',
                                       'days': ''}
                try:
                    # This regex matches string segments like:
                    #    '03:30P-04:45P     MWF'
                    # Exceptions are caused when a course is closed,
                    # or when the times of the class are TBD.
                    reg = re.search(r"(?P<time>\d+:\d+[AP]-\d+:\d+[AP]\W+[MTWRF"
                                    "]{1,5})", line)
                    dt = reg.group('time').split()
                    time = dt[0]
                    days = dt[1]
                    d[d.keys()[0]][sid]['time'] = time
                    d[d.keys()[0]][sid]['days'] = days
                except AttributeError:
                    d[d.keys()[0]][sid]['time'] = 'UNK'
                    d[d.keys()[0]][sid]['days'] = 'UNK'
                    continue
                except IndexError:
                    d[d.keys()[0]][sid]['time'] = 'CLOSED'
                    d[d.keys()[0]][sid]['days'] = 'CLOSED'
                    continue

    with open(os.path.abspath('sched.json'), 'w') as f:
        json.dump(deptDict, f)

    return item_counter

parse() is a helper function for scrapeDepts() and scrapeCourses().

def parse(link):
    print >> sys.stderr, "Parsing %s" % link[-15:]
    ind = lh.parse(link)
    print >> sys.stderr, "Parsing complete. Fetching div#main"
    main = [div for div in ind.findall(".//div") if div.get("id") == "main"]
    print >> sys.stderr, "Fetch complete. Returning to main process."
    return main[0]

maketime() is basically the function I had been wanting to write in the first place, if I had been provided with some legit raw data. It takes the machine-readable data and turns it into a much more manageable data structure. In this case, it's a list. Then using the time library it transforms the string describing the course start and end times, first into a list of time.struct_time objects. Finally, I use struct_time's attributes to transform that list into a list of integers.

def maketime():
    with open('sched.json', 'r') as f:
        sched = json.load(f)
    courses = []

    for k in sched.iterkeys():
        for i in sched[k]:
            for j in i.iterkeys():
                for h in i[j]:
                    courses.append([j, h, i[j][h]['time'], i[j][h]['days']])

    # Strip out all 'UNK' and 'CLOSED' courses.
    courses = [course for course in courses if isinstance(course[2][0], int)]

    for course in courses:
        timeSplit = course[2].split('-')
        for t in timeSplit:
            y = time.strptime(t, "%I:%M%p")

            if y.tm_min == 0:
                minute = '00'
            else:
                minute = str(y.tm_min)

            hour = str(y.tm_hour)
            y = int(hour+minute)
            timeSplit[timeSplit.index(t)] = y
        course[2] = timeSplit

    return courses

if __name__ == '__main__':
    scrapeDepts()
    ic = scrapeCourses()
    print ic

Turned out I was scraping about 2,850 individual pages to compile the data. Running this script took about an hour each time I ran it. At least now I'm past that and can move on with the rest of the project, which I hope to start this weekend.