Recent Blog Postshttp://mattdeboard.net/feed.atom2014-01-14T00:00:00ZRecent blog postsWerkzeugUsing git & Python to autogen changelogshttp://mattdeboard.net/2014/01/14/automatic-changelog-generation-with-git2014-01-14T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><div class="section" id="background">
<h2>Background</h2>
<p>As part of the communication process at work, devs maintain changelogs for some of our projects. What these consist of is a single <cite>RELEASE NOTES.md</cite> file in the project root, where each each line is a Markdown hyperlink to the pull request that introduced the change. These pull request links are then grouped together by date of release. The changelog looks like:</p>
<pre class="literal-block">
## v1.7 2013/03/17
* [#100](https://github.com/courseload/project/pull/100) - Finalized previously preliminary stuff
* [#99](https://github.com/courseload/project/pull/99) - Did some preliminary stuff
## v1.6.4 2013/03/14
* [#98](https://github.com/courseload/project/pull/98) - Made dongles brighter.
* [#97](https://github.com/courseload/project/pull/97) - Improved widget performance by 3.8x
</pre>
<p>At first, these were created by having devs also update <cite>RELEASE NOTES.md</cite> with each pull request. This distributed the workload, but it also made having multiple pull requests a big pain in the ass since the same file, usually the same line in the same file, was being modified by multiple pull requests. So we stopped that practice and instead moved to a hand-made <cite>RELEASE NOTES.md</cite> file, maintained by these de facto primaries. Obviously this kind of work is sub-optimal and ripe for automation. For months though, streamlining the process fell far down on the priority list until I just couldn't take it anymore.</p>
</div>
<div class="section" id="git-log">
<h2>git log</h2>
<p>When I am automating a repetitive task like this, my goal is to write as little code as possible. In thise case, that means massaging the output of <cite>git log</cite> to get me as close to the desired final format of the changelog lines as possible. In other words, I only want to output merge commits. We can do that with:</p>
<pre class="literal-block">
git log --merges
</pre>
<p>This is good, but it shows a lot of extra information I'd have to parse out. If you'll notice in my example above, the lines in <cite>RELEASE NOTES.md</cite> are formatted like <tt class="docutils literal"><span class="pre">[#<pull</span> request <span class="pre">number>](https://github.com/courseload/project/pull/<pull</span> request number>) - <pull request description></tt>. So we notice right away we need two things from <cite>git log</cite>:</p>
<ol class="arabic simple">
<li>The commit message of the merge. Think of this as the subject line of an email. We want this because this has the number of the pull request.</li>
<li>The pull request description, which works out to be, for the sake of this blog post, the equivalent of the first line of the body of the aforementioned email.</li>
</ol>
<p>This git command gets us this info without a bunch of cruft:</p>
<pre class="literal-block">
git log --pretty=format:'%s%n%b' --merges
</pre>
<p>But let's get really close now to the desired final output:</p>
<pre class="literal-block">
git log --pretty=format:'%s%n* [#{pr_num}](https://github.com/courseload/project/pull/{pr_num}) - %b)'
</pre>
<p>Now, every merge commit appears as a two-line entry. The first is the merge commit message. The second is the pull request description. For bonus points ,the second line looks almost exactly like the changelog lines, except using Python string interpolation variables embedded in place of the PR number.</p>
</div>
<div class="section" id="python">
<h2>Python</h2>
<p>It's great that we have just the info we want, but I know we're also going to need to do two things:</p>
<ol class="arabic simple">
<li>Parse out the pull request number from the <cite>git log</cite> output, and</li>
<li>Use the PR number to create the changelog entry</li>
</ol>
<p>By running the above <cite>git log</cite> command via <cite>subprocess.check_output</cite> I can automate all this with <a class="reference external" href="https://gist.github.com/mattdeboard/68f7009e847e36e6c107">this script</a>:</p>
<blockquote>
<div class="highlight"><pre><span class="c">#!/usr/bin/env python</span>
<span class="sd">"""This script generates release notes for each merged pull request from</span>
<span class="sd">git merge-commit messages.</span>
<span class="sd">Usage:</span>
<span class="sd"> `python release.py <start_commit> <end_commit> [--output {file,stdout}]`</span>
<span class="sd">For example, if you wanted to find the diff between version 1.0 and 1.2,</span>
<span class="sd">and write the output to the release notes file, you would type the</span>
<span class="sd">following:</span>
<span class="sd"> `python release.py 1.0 1.2 -f CHANGELOG.md`</span>
<span class="sd">"""</span>
<span class="kn">import</span> <span class="nn">os.path</span> <span class="kn">as</span> <span class="nn">op</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">deque</span>
<span class="n">PROJECT_URI</span> <span class="o">=</span> <span class="s">"https://github.com/foo/bar"</span>
<span class="k">def</span> <span class="nf">commit_msgs</span><span class="p">(</span><span class="n">start_commit</span><span class="p">,</span> <span class="n">end_commit</span><span class="p">):</span>
<span class="sd">"""Run the git command that outputs the merge commits (both subject</span>
<span class="sd"> and body) to stdout, and return the output.</span>
<span class="sd"> """</span>
<span class="n">fmt_string</span> <span class="o">=</span> <span class="p">(</span><span class="s">"'</span><span class="si">%s</span><span class="s">%n* [#{pr_num}]"</span>
<span class="s">"("</span> <span class="o">+</span> <span class="n">PROJECT_URI</span> <span class="o">+</span> <span class="s">"/{pr_num}) - %b'"</span><span class="p">)</span>
<span class="k">return</span> <span class="n">subprocess</span><span class="o">.</span><span class="n">check_output</span><span class="p">([</span>
<span class="s">"git"</span><span class="p">,</span>
<span class="s">"log"</span><span class="p">,</span>
<span class="s">"--pretty=format:</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="n">fmt_string</span><span class="p">,</span>
<span class="s">"--merges"</span><span class="p">,</span> <span class="s">"</span><span class="si">%s</span><span class="s">..</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">start_commit</span><span class="p">,</span> <span class="n">end_commit</span><span class="p">)])</span>
<span class="k">def</span> <span class="nf">release_note_lines</span><span class="p">(</span><span class="n">msgs</span><span class="p">):</span>
<span class="sd">"""Parse the lines from git output and format the strings using the</span>
<span class="sd"> pull request number.</span>
<span class="sd"> """</span>
<span class="n">ptn</span> <span class="o">=</span> <span class="s">r"Merge pull request #(\d+).*\n([^\n]*)'$"</span>
<span class="n">pairs</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="n">ptn</span><span class="p">,</span> <span class="n">msgs</span><span class="p">,</span> <span class="n">re</span><span class="o">.</span><span class="n">MULTILINE</span><span class="p">)</span>
<span class="k">return</span> <span class="n">deque</span><span class="p">(</span><span class="n">body</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">pr_num</span><span class="o">=</span><span class="n">pr_num</span><span class="p">)</span> <span class="k">for</span> <span class="n">pr_num</span><span class="p">,</span> <span class="n">body</span> <span class="ow">in</span> <span class="n">pairs</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">release_header_line</span><span class="p">(</span><span class="n">version</span><span class="p">,</span> <span class="n">release_date</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">release_date</span> <span class="o">=</span> <span class="n">release_date</span> <span class="ow">or</span> <span class="n">datetime</span><span class="o">.</span><span class="n">date</span><span class="o">.</span><span class="n">today</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s">'%Y/%m/</span><span class="si">%d</span><span class="s">'</span><span class="p">)</span>
<span class="k">return</span> <span class="s">"## </span><span class="si">%s</span><span class="s"> - </span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">version</span><span class="p">,</span> <span class="n">release_date</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">prepend</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">lines</span><span class="p">,</span> <span class="n">release_header</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="sd">"""Write `lines` (i.e. release notes) to file `filename`."""</span>
<span class="k">if</span> <span class="n">op</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="n">filename</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">'r+'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">first_line</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">f</span><span class="o">.</span><span class="n">seek</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'</span><span class="se">\n\n</span><span class="s">'</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">lines</span><span class="p">,</span> <span class="n">first_line</span><span class="p">]))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">argparse</span>
<span class="kn">import</span> <span class="nn">datetime</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'start_commit'</span><span class="p">,</span> <span class="n">metavar</span><span class="o">=</span><span class="s">'START_COMMIT_OR_TAG'</span><span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'end_commit'</span><span class="p">,</span> <span class="n">metavar</span><span class="o">=</span><span class="s">'END_COMMIT_OR_TAG'</span><span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--filepath'</span><span class="p">,</span> <span class="s">'-f'</span><span class="p">,</span>
<span class="n">help</span><span class="o">=</span><span class="s">"Absolute path to output file."</span><span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">'--tag'</span><span class="p">,</span> <span class="s">'-t'</span><span class="p">,</span> <span class="n">metavar</span><span class="o">=</span><span class="s">'NEW_TAG'</span><span class="p">)</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span>
<span class="s">'--date'</span><span class="p">,</span> <span class="s">'-d'</span><span class="p">,</span> <span class="n">metavar</span><span class="o">=</span><span class="s">'RELEASE_DATE'</span><span class="p">,</span>
<span class="n">help</span><span class="o">=</span><span class="s">"Date of release for listed patch notes. Use yyyy/mm/dd format."</span><span class="p">)</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span>
<span class="n">start</span><span class="p">,</span> <span class="n">end</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">start_commit</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">end_commit</span>
<span class="n">lines</span> <span class="o">=</span> <span class="n">release_note_lines</span><span class="p">(</span><span class="n">commit_msgs</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">end</span><span class="p">))</span>
<span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">tag</span><span class="p">:</span>
<span class="n">lines</span><span class="o">.</span><span class="n">appendleft</span><span class="p">(</span><span class="n">release_header_line</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">tag</span><span class="p">,</span> <span class="n">args</span><span class="o">.</span><span class="n">date</span><span class="p">))</span>
<span class="n">lines</span> <span class="o">=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
<span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">filepath</span><span class="p">:</span>
<span class="n">filename</span> <span class="o">=</span> <span class="n">op</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">filepath</span><span class="p">)</span>
<span class="n">prepend</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">lines</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">print</span> <span class="n">lines</span>
</pre></div>
</blockquote>
<p>To view the output in stdout, at the command line type:</p>
<pre class="literal-block">
$ ./release.py 1.7 HEAD
</pre>
<p>Or, specify an output file:</p>
<pre class="literal-block">
$ ./release 1.7 HEAD ./RELEASE\ NOTES.md
</pre>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>One additional step I took is to create a git alias for the git log command, but prettied up a bit, for when I want to just scan through the differences from one version to the next. If you'd like to do the same, add the following to the <cite>[alias]</cite> section of <cite>~/.gitconfig</cite>:</p>
<pre class="literal-block">
lm = log --pretty=format:'%Cred%h%Creset %C(bold blue)<%an>%Creset \
-%C(yellow)%d%Creset %C(bold cyan)%s %Cgreen(%cr)%n%Creset%n - %b%n' \
--abbrev-commit --date=relative --merges
</pre>
<p>You can also achieve the same effect by entering the following at the CLI:</p>
<pre class="literal-block">
git config --global alias.lm "log --pretty=format:'%Cred%h%Creset \
%C(bold blue)<%an>%Creset -%C(yellow)%d%Creset %C(bold cyan)%s \
%Cgreen(%cr)%n%Creset%n - %b%n' --abbrev-commit --date=relative --merges"
</pre>
<p>(The escaped newlines aren't necessary, only including them to keep the line length down on the page.)</p>
<p>Please leave a comment if you have questions or spot an error. Thanks.</p>
</div>
How to Run a Windows Service As A Linux Daemonhttp://mattdeboard.net/2012/10/19/how-to-run-windows-service-as-linux-daemon2012-10-19T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p><strong>Premise:</strong> You've got a Windows service that you want to run on a Linux server</p>
<p><strong>Problem:</strong> Your code is written using the .NET framework and some language that targets the CLR (C#, VB, Clojure-CLR, etc.)</p>
<p><strong>Solution:</strong> <a class="reference external" href="http://www.mono-project.com/Main_Page">Mono</a> is an open-source implementation of the .NET framework. By installing mono you gain access to a ton of useful stuff, but the relevant item here is the <cite>mono-service</cite> executable. (Installing mono is out of the scope of this blog post, but odds are pretty good mono is available from your distro's package management system.)</p>
<p>Once installed, you can run your compiled code like so:</p>
<pre class="literal-block">
mono-service SomeExecutable.exe
</pre>
<p>By default, this creates a lockfile in <cite>/tmp</cite>. You can change this by using the <cite>-l:<lockfile></cite> option. This is great, because now your service is running in the background! However, this is really flimsy; what if the process dies? What if the server needs rebooted? To solve this I'm using <a class="reference external" href="http://supervisord.org/">supervisor</a>.</p>
<div class="section" id="get-it-running-in-4-steps">
<h2>Get It Running In 4 Steps</h2>
<p>Once you've got supervisor and mono installed, follow these steps:</p>
<ol class="arabic">
<li><p class="first">Create a supervisor file in <cite>/etc/supervisor/conf.d/</cite> with a descriptive name. We'll use <cite>mysvc.conf</cite>.</p>
</li>
<li><p class="first">Edit <cite>mysvc.conf</cite> so it looks similar to this<sup>1,2</sup></p>
<pre class="literal-block">
[program:mysvc]
command=mono-service MyWindowsService.exe --no-daemon
directory=/path/to/executable
user=someuser
stdout_logfile=/home/someuser/mysvc/out.log
redirect_stderr=true
</pre>
</li>
<li><p class="first"><cite>sudo service supervisor update</cite>. This will reload the config file you edited above.</p>
</li>
<li><p class="first">To confirm that your process started, run <cite>ps aux|grep mono</cite>. You should see it in the process list.</p>
</li>
</ol>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>Hope this helps. Supervisor has a ton of different options for configuring how a process runs, it's worth it to RTFM.</p>
<div class="section" id="footnotes">
<h3>Footnotes</h3>
<p><strong>1.</strong> The directory specified in your <cite>stdout_logfile</cite> parameter must already exist. If you try to start the <cite>mysvc</cite> process without creating it, supervisor will throw an error. Also, the <cite>user</cite> parameter should be set to a user that has permissions to write to the directory where you're keeping the <cite>stdout_logfile</cite>. Please consult the relevant <a class="reference external" href="http://supervisord.org/configuration.html#program-x-section-values">supervisor docs</a> for more about users & processes.</p>
<p><strong>2</strong>. You must use the <cite>--no-daemon</cite> flag to avoid creation of the lockfile which indirectly allows supervisor to capture/redirect stdout/stderr to a logfile.</p>
</div>
</div>
Larry the Software Guyhttp://mattdeboard.net/2012/10/05/larry-the-software-guy2012-10-05T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>Anil Dash published <a class="reference external" href="http://dashes.com/anil/2012/10/the-blue-collar-coder.html">a blog post</a> today I think is a victim of a bad title: "The Blue Collar Coder." I normally skim over the "Is programming an art, craft or science?" discussions but there were a couple of very smart programmers discussing it on Twitter, and I joined in. During the conversation, I vacillated between agreeing with Anil's proposition and agreeing with <a class="reference external" href="https://twitter.com/strlen/status/254369312884805632">Alex Feinberg</a>.</p>
<p>I think the title is poor because programming will never be "blue collar." Anil knows that; he more or less admits it was basically caste-baiting in the first sentence of the final paragraph. Unfortunately, I think people reacted to the notion of a programmer being considered "blue collar" more than the <em>real</em> points I think he was trying to make. The tl;dr of Anil's blog post seems to be:</p>
<ol class="arabic simple">
<li>A CS degree is overkill for most job openings</li>
<li>The "tech community" (??) should be focused on creating lots of jobs, not entrepreneurship</li>
<li>Huge amounts of good for people & business can be done by creating a vocational training program for software development</li>
</ol>
<p>I don't even want to touch (1) because people seem to have such ridiculously strong feelings one way or the other (and possession of a CS degree seems to be no indicator of which way those feelings will go). I don't have a CS degree, and I am enjoying my career. I recognize though that in a few years maybe I'll be bored of the nature of problems I'm working on and maybe getting that degree would have been a smart move after all. In other words, I don't have an opinion on this because I don't know what I don't know.</p>
<p>The second point is eyeroll-worthy, in my opinion, because I think the impression the "tech community" is hyper-focused on producing "the next Zuckerburg" is the result of Hacker News's own "reality distortion field" about startups. Hacker News is the modern equivalent of a sweaty, manic Steve Ballmer trying to pump up a room full of nerds, but instead of "Developers! Developers! Developers!", HN is chanting, "Startups! Startups! Startups!" But what're you gonna do? HN exists for a very specific reason: startup news. Point being that it's not good or bad that this reality distortion effect exists, but you have to seek other perspectives.</p>
<p>I agree with the third point. Full stop. My <a class="reference external" href="http://en.wikipedia.org/wiki/Scientific_Wild-Ass_Guess">SWAG</a> (pretty light on the "S") is higher ed could serve more people with lower per-person costs, deliver employees to the job market with high skills, while maintaining/building a reputation as a high-quality institution by offering associates degree & certification programs in software development compared to the current BS/BA in CS.</p>
<p>This is where I think Anil's points get lost because of the title, illustrated by something Alex F. wrote:</p>
<blockquote>
There will be demand for "non-programmers who code" for sure, but these positions will still require analytical thinking.</blockquote>
<p>Maybe I'm misreading it, but the implication seems to be that "blue collar" implies work where analytical thinking is optional. There's no less analytical thinking in e.g. managing inventory, <a class="reference external" href="http://jacquesmattheij.com/how-to-build-a-windmill-ii">building windmills</a>, etc. My opinion, based on my military experience, is that there are many smart and savvy people out there with great analytical abilities, who couldn't get into or complete a CS degree. For these people an associates of applied science or 1-year certificate in software development would be FAR more accessible. Not only that, I'd wager the distribution of skill among graduates would look pretty close to that of most CS programs. What I'm saying is, in my short time doing this I've met some dumb/bad/lazy programmers with CS degrees from universities with respected programs.</p>
<p>Now obviously I don't do much manual labor anymore, but I'm proud of and enjoy the maintenance work I do. Most programming IMO boils down to the equivalent of "blue collar" work: refactoring code you or someone else wrote; patching over and smoothing out ugly spots; squashing bugs that have been around so long they're just considered part of the product. This isn't something I'm claiming I discovered by the way; this is a conclusion other people have drawn that is supported by my own anecdotes.</p>
REST API for search resultshttp://mattdeboard.net/2012/02/07/haystack-resource-for-django-tastypie2012-02-07T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p><strong>Updated:</strong> <em>So after talking with the author of Tastypie I added the</em> <cite>SearchDeclarativeMetaclass</cite> <em>and</em> <cite>SearchOptions</cite> <em>to handle inheritance of the metaclass attributes on</em> <cite>SearchResource</cite>. <em>I almost entirely copied his</em> <cite>ModelDeclarativeMetaclass</cite> <em>and it works well. In-house, we further subclass</em> <cite>SearchResource</cite> <em>to model our job postings data in our search index, and it works great.</em></p>
<p>So, first things first: <a class="reference external" href="https://github.com/toastdriven/django-tastypie">django-tastypie</a> is pretty great. If you're running a Django web application and want to expose your data via a REST API, tastypie will do it. I got everything up-and-running in just a few hours (95% reading, 5% writing).</p>
<p>Tastypie -- written by <a class="reference external" href="https://twitter.com/#!/daniellindsley">Daniel Lindsley</a>, the guy behind <a class="reference external" href="http://haystacksearch.org">django-haystack</a> -- uses a <cite>Resource</cite> class to handle all the API hairiness; it comes with a <cite>ModelResource</cite> subclass out of the box to provide an interface to a Django model & the ORM. If you want a better explanation, or want to know more, go <a class="reference external" href="http://django-tastypie.readthedocs.org/en/latest/index.html">read the docs</a>.</p>
<p>Speaking of the documentation, there is an example <cite>Resource</cite> subclass in the docs' <a class="reference external" href="http://readthedocs.org/docs/django-tastypie/en/latest/cookbook.html#adding-search-functionality">cookbook</a>, though that was more about adding search to an existing resource. We want to serve resources -- i.e. Solr documents -- exclusively from Lucene. Our resource is literally a document from the search engine, so we needed a class to model that behavior. (You can read more about how we use Solr <a class="reference external" href="http://mattdeboard.net/2011/12/29/displacing-mysql-with-solr/">here</a>.) To accomplish this, I put together <a class="reference external" href="https://github.com/mattdeboard/mattdeboard.net/blob/master/2012/02/07/resources.py">this</a> <cite>SearchResource</cite> subclass which others may find useful.</p>
<p>If you use Haystack, you know that it goes to great lengths to emulate the API of Django's ORM to provide a familiar interface to the search index. In that vein, <cite>SearchResource</cite> emulates the <cite>ModelResource</cite> class.</p>
<p>One issue we have in-house is that there are in some cases discrepancies between the semantics we want to expose as part of our API and the fields we're going to be leveraging to look up resources. To address that, I created a map of querystring parameters to the actual fields in the search index in which their values would be sought:</p>
<div class="highlight"><pre><span class="k">class</span> <span class="nc">JobSearchResource</span><span class="p">(</span><span class="n">SearchResource</span><span class="p">):</span>
<span class="n">field_aliases</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'city'</span><span class="p">:</span> <span class="s">'city_exact__exact'</span><span class="p">,</span>
<span class="s">'state'</span><span class="p">:</span> <span class="s">'state_exact__exact'</span><span class="p">,</span>
<span class="s">'country'</span><span class="p">:</span> <span class="s">'country_exact__exact'</span><span class="p">,</span>
<span class="s">'company'</span><span class="p">:</span> <span class="s">'company_exact__exact'</span><span class="p">,</span>
<span class="s">'title'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span>
<span class="s">'date_new'</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span>
<span class="s">'uid'</span><span class="p">:</span> <span class="bp">None</span>
<span class="p">}</span>
<span class="o"><</span><span class="n">snip</span> <span class="n">declared</span> <span class="n">fields</span><span class="o">></span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="nb">super</span><span class="p">(</span><span class="n">JobSearchResource</span><span class="p">,</span> <span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="n">__init__</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_meta</span><span class="o">.</span><span class="n">index_fields</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">field_aliases</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
</pre></div>
<p>We use <cite>field_aliases.keys()</cite> to populate <cite>index_fields</cite>, so now we need to add in logic to look up those keys and replace them in the query logic with the fields we actually want to search against. In this case, we want to search against <cite>(country|state|city|company)_exact</cite>, which, if you're familiar with Lucene, are stored, unanalyzed fields. We use Haystack's <cite>__exact</cite> lookup which has the effect of turning the term query into a phrase by wrapping it in quotes, e.g. <cite>q=country_exact:"United States"</cite>. We don't want tokenized field lookup because we don't want to match, say, "United Kingdom" when we are looking for "United States" due to the match on "United." (There are a million ways to do this of course, but this is how we chose to do it.)</p>
<p>Now we need to override <cite>SearchResource.build_filters</cite>:</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">build_filters</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">filters</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">terms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">if</span> <span class="n">filters</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">filters</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">param_alias</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">filters</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="k">if</span> <span class="n">param_alias</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">_meta</span><span class="o">.</span><span class="n">index_fields</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">param</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">field_aliases</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">param_alias</span><span class="p">,</span> <span class="n">param_alias</span><span class="p">)</span> <span class="c"># <---</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">value</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_meta</span><span class="o">.</span><span class="n">lookup_sep</span><span class="p">)</span>
<span class="n">field_queries</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">token</span> <span class="ow">in</span> <span class="n">tokens</span><span class="p">:</span>
<span class="k">if</span> <span class="n">token</span><span class="p">:</span>
<span class="n">field_queries</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_meta</span><span class="o">.</span><span class="n">query_object</span><span class="p">((</span><span class="n">param</span><span class="p">,</span>
<span class="n">token</span><span class="p">)))</span>
<span class="n">terms</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">reduce</span><span class="p">(</span><span class="n">operator</span><span class="o">.</span><span class="n">or_</span><span class="p">,</span>
<span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="n">field_queries</span><span class="p">)))</span>
<span class="k">if</span> <span class="n">terms</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">reduce</span><span class="p">(</span><span class="n">operator</span><span class="o">.</span><span class="n">and_</span><span class="p">,</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">,</span> <span class="n">terms</span><span class="p">))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="n">terms</span>
</pre></div>
<p>Note the line with the commented <cite><---</cite>: This is where the alias->index field translation takes place. If you find yourself with a need to alias search fields this may be a solution for you.</p>
<p>Finally, I made the decision to force some additional configuration overhead -- about 5 attributes on the metaclass -- in order to completely preserve the amazing extensibility of Haystack. I know that <a class="reference external" href="http://directemployersfoundation.org">in-house</a> we subclass just about everything from Haystack, including the <cite>SearchQuerySet</cite>; I assume there are others out there doing the same, and more, so you are not forced to use Haystack's built-in <cite>SQ</cite> object to compose query trees if you've created your own. (If you have I'd be curious to see it.)</p>
<p>Let me know in the comments if you have any problems, spot bugs or think I'm an idiot.</p>
Displacing MySQL with...Solr?http://mattdeboard.net/2011/12/29/displacing-mysql-with-solr2011-12-29T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>We recently completed a big refactor at <a class="reference external" href="http://directemployers.org">work</a>, the intent for which was implementing search for one of our products, a Django-based web CMS called DirectSEO. It did not take long, however, to realize that by choosing Solr as our search backend, we had the opportunity to make some much-needed optimizations. Now, after analyzing three weeks' worth of data related to the refactor, I can say the time investment has yielded real, measurable gains. They came mainly from removing some very expensive database calls from our views, then fetching the same data via calls to the <a class="reference external" href="http://lucene.apache.org/solr/">Solr</a> index. This resulted in a simplified code base and decreased page-load times. This post is intended to explain a bit about our approach to leveraging Solr's feature set.</p>
<p>(This is my first truly technical post so I'm sure I'm leaving things out, or explaining poorly. Please contact me or leave comments if I didn't cover something in enough detail or if you've got any questions.)</p>
<div class="section" id="some-background">
<h2>Some Background</h2>
<p>As part of their membership in DirectEmployers, member organizations are provided with a job board on a domain of their choosing to present their job listings in an SEO-friendly way. These sites often live on the <a class="reference external" href="http://en.wikipedia.org/wiki/.jobs">.jobs TLD</a>; however, members can -- and often do -- use subdomains of their own site for their job board. An example of each: <a class="reference external" href="http://lockheedmartin.jobs">Lockheed-Martin</a> (.jobs); <a class="reference external" href="http://jobsearch.arrow.com">Arrow Electronics</a> (other).</p>
<div class="section" id="how-it-works">
<h3>How It Works</h3>
<p>The job boards are generated dynamically. Members give us some basic information -- header images, brand colors, and so forth -- which we use to create a site configuration. This configuration is then referenced to lookup all the jobs associated with a particular member organization. Sometimes, a member organization may have multiple job sites catering to specific job categories: <a class="reference external" href="http://ibm-brazil.jobs">IBM Brazil</a> or <a class="reference external" href="http://lockheedmartin-infosec.jobs">Lockheed-Martin InfoSec</a>, for example. In these cases, the corpus of jobs for that member organization are then refined to only include jobs which fall into that category.</p>
<p>From here, users can drill down into the jobs using standard navigation links which we generated based on facets for title, location and custom facets we call <a class="reference external" href="https://github.com/DirectEmployers/saved-search">Saved Search</a> (not to be confused with <a class="reference external" href="https://github.com/toastdriven/saved_searches">saved-searches</a>).</p>
</div>
</div>
<div class="section" id="implementation-details">
<h2>Implementation Details</h2>
<p>Simply put, we use Django to deal with MySQL, and we use <a class="reference external" href="http://haystacksearch.org">Django-Haystack</a> to deal with Solr. We run our <a class="reference external" href="https://github.com/DirectEmployers/django-haystack">own fork</a> of Haystack, which capitalizes on some hacks in my own <a class="reference external" href="https://github.com/mattdeboard/pysolr">fork of pysolr</a>.</p>
<p>Our saved-search app gives our members a way to create and maintain persistent, user-defined queries. In practice we use these to create sites like the aforementioned <a class="reference external" href="http://lockheedmartin-infosec.jobs">Lockheed-Martin InfoSec</a>. They also give our members the ability to create custom job verticals. <a class="reference external" href="http://hiltonworldwide.jobs">Hilton</a> has saved searches built around departments; <a class="reference external" href="http://unilevercareers.jobs">Unilever</a> has a saved search for "hot jobs" they want to fill quickly.</p>
<div class="section" id="architectural-aside">
<h3>Architectural Aside</h3>
<p>A problem arises, however, when a site has a lot of saved searches. But to understand the problem, I should explain a little bit about how our data is stored in the database and how it gets indexed.</p>
<p>Each job listing is a row on our <cite>joblisting</cite> table. This is currently the only table Solr indexes. Haystack uses a module called <a class="reference external" href="http://p.mattdeboard.net/search_indexes.py.html">search_indexes.py</a> to set the parameters in <cite>schema.xml</cite>. In it, we specify model fields to index directly, plus several fields Haystack calls "prepared fields," which contain denormalized or calculated data. Native model fields like <cite>title</cite>, <cite>state</cite>, <cite>country</cite>, etc., can be used to create <a class="reference external" href="http://www.lucidimagination.com/devzone/technical-articles/faceted-search-solr">facets</a>. Facets are what you see under "Filter by (Title|City|State|Country)" <a class="reference external" href="http://arinc.jobs/">here</a>. Something like the below snippet will return all the values for those fields along with counts of each (which is what faceting is):</p>
<div class="highlight"><pre><span class="n">sqs</span> <span class="o">=</span> <span class="n">SearchQuerySet</span><span class="p">()</span><span class="o">.</span><span class="n">facet</span><span class="p">(</span><span class="s">'title_slab'</span><span class="p">)</span><span class="o">.</span><span class="n">facet</span><span class="p">(</span><span class="s">'city_slab'</span><span class="p">)</span>\
<span class="o">.</span><span class="n">facet</span><span class="p">(</span><span class="s">'state_slab'</span><span class="p">)</span><span class="o">.</span><span class="n">facet</span><span class="p">(</span><span class="s">'country_slab'</span><span class="p">)</span>
<span class="n">facet_counts</span> <span class="o">=</span> <span class="n">sqs</span><span class="o">.</span><span class="n">facet_counts</span><span class="p">()[</span><span class="s">'fields'</span><span class="p">]</span>
</pre></div>
<p>("slabs" are calculated fields such that the <cite>city_slab</cite> field would have a format like:</p>
<pre class="literal-block">
"/manassas/virginia/usa/jobs/::Manassas, VA"
</pre>
<p>We use these to precalculate URL segments in the index so we can keep string manipulation to a minimum in the application. We split on "::" and handle those substrings as needed.)</p>
<p>However, since saved searches are ad-hoc filters that can be composed of any permutation of index fields, they cannot be properly faceted. This means that to get counts of job listings for each saved search, we'd normally have to perform a single HTTP request for each.</p>
<p>To circumvent this costly routine, I hacked up pysolr to implement support for Solr's <a class="reference external" href="http://wiki.apache.org/solr/FieldCollapsing">field collapsing/group query functionality</a>, then wrote <a class="reference external" href="https://github.com/DirectEmployers/saved-search/blob/master/saved_search/groupsearch.py">a backend</a> to support it. The effect is that for <em>n</em> saved searches configured for a particular site, only one query is required; the saved search concept would otherwise involve far too many HTTP requests to be practical.</p>
</div>
<div class="section" id="haystack-solr-setup">
<h3>Haystack & Solr Setup</h3>
<p>On the Python side, we use Haystack's <a class="reference external" href="http://docs.haystacksearch.org/dev/searchindex_api.html#realtimesearchindex">RealTimeSearchIndex</a> class as the basis for our index. In short, it's the exact same as the SearchIndex class, but with post-save/delete listeners for the jobListing table. It gets us as close as we really need to get to ElasticSearch-style real-time search. While Solr 4.0 is going to have "near real-time" search, it's just not a feature we have a need for now. If that changes in the future, we'll re-evaluate.</p>
<p>For Solr, we run two servers in a master-slave configuration. The master handles the real-time updates. The (read-only) slave handles all the queries, and is set to do replication checks every 60 seconds. The side effect of this is that when the master is handling a large volume of updates, average query response time by the slave slows by 50-75ms. For comparison, it normally takes around 200ms for our application to calculate and return an HTTP response.</p>
<p>The one caveat for using Solr in this way is that unlike some other document databases, there is absolutely no notion of relations whatsoever. Plus, obviously, it wouldn't be responsible to use Solr as a primary datastore (A good read on why can be found in <a class="reference external" href="http://stackoverflow.com/questions/4960952/when-to-consider-solr/4961973#4961973">this</a> response on SO).</p>
</div>
</div>
<div class="section" id="performance-reliability">
<h2>Performance & Reliability</h2>
<p>Performance has improved measurably, especially on <a class="reference external" href="http://lockheedmartin.jobs">pages with a lot of jobs, a lot of facets and a lot of saved searches</a>. Some very costly SQL queries have been eliminated. By utilizing Solr's query-tuning tools like <cite>facet.mincount</cite>, <cite>start</cite> and <cite>offset</cite>, we've kept the amount of data transfered per request is low. Using Solr to power saved searches eliminates a lot of complexity from our code base.</p>
<p>Getting data reliability right has taken longer, involving some diligent bug-hunting. I've spent the past four months learning about how Solr works, how to intelligently leverage Haystack's API, and implementing some features of Solr in Haystack that aren't included out-of-the-box. It is important to keep in mind that a Solr match is not necessarily binary. A thing might match, it might not, but more likely it will "kinda" match. Tightening up queries as needed is vital if you want exact results <em>only</em>. One of my big hurdles in getting this working right was making sure matches were fuzzy where they should be fuzzy, and exact where they should be exact.</p>
<p>Finally, I think that as we add more features to our application, we'll have to start putting standard RDBMS queries back into play in some areas. For the past 3 months I've been rewiring a Django application, cutting out the old relational stuff and replacing it with simpler, faster methods. It is a dramatic shift. As time goes on we'll be building out more features that will require relational information.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>Utilizing Solr in this way is both ordinary and novel. It's novel because when people think of Solr, they think a search box with a button that says "Search". You click on the button and get results. It's ordinary because Solr is, after all, a document database. It stores documents in a flat structure, and you compose queries to retrieve them. Not exotic, unusual or special in any way. In a use case such as ours, however, where the need for relations is minimal and practically all of our content is generated based on text searching, Solr is great.</p>
</div>
How I Became a Programmerhttp://mattdeboard.net/2011/11/23/how-i-became-a-programmer2011-11-23T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>I posted a <a class="reference external" href="http://news.ycombinator.com/item?id=3268469">very brief response</a> to a post on HackerNews yesterday challenging the notion that 8 weeks of guided tutelage on <a class="reference external" href="http://rubyonrails.org/">Ruby on Rails</a> is not going to produce someone who you might consider a "junior RoR developer." It did not garner many upvotes so I figured that like most conversation on the Internet it faded into the general ambient chatter. Imagine my surprise when I woke up to couple handfuls' worth of emails from around the world asking me what I did, how I did it, and how I got a job. I'm assuming, judging by the relatively small amount of mail I got from a random aside on HN*, that there must be a lot of people who are trying to figure out how to pursue a career in programming.</p>
<div class="section" id="first-a-disclaimer-or-two">
<h2>First, A Disclaimer or Two</h2>
<p>Please note that this blog post is entitled, "How <strong>I</strong> Became a Programmer", not, "How <strong>You</strong> Can Become a Programmer." I'm not a self-help guru or wise or even a particularly good programmer. I did, however, decide at an inflection point in my life to pursue something vigorously and it paid off. Any insights gleaned from my experience are yours to make on your own; I doubt I'll have much insight for your personal situation.</p>
<p>Also, after consulting with my girlfriend, my total time of dedicated effort to becoming a paid programmer was actually about 12 weeks, not ~10 as I stated in the post I linked to above. So, there you go.</p>
</div>
<div class="section" id="my-story-tl-dr">
<h2>My Story: tl;dr</h2>
<p>In brief: I left the Marine Corps after more than a decade in July 2010. I got a job at the state lottery as a PR flak in August of that year, and lost it in mid-February. In mid-May I got hired as a part-time "junior User Experience engineer" at <a class="reference external" href="http://directemployers.org">DirectEmployers Association</a>. By late August I was a full-time, regular old "User Experience engineer."</p>
<p>When I lost my job I decided that I was done doing PR; I wanted to be a programmer. I took my tax return and stretched it out on a ramen and water diet. My family (dad, mostly...) was nervous as hell. In that February to May span I spent basically every waking moment learning to program, learning about Linux, and learning about computer science. I taught myself Python, I taught myself Django, I learned some functional and imperative programming, and got semi-decent at the Linux command line.</p>
<p>Voila. Without further ado, I'm going to write about what I didn't do, then dive into the questions I got via email.</p>
</div>
<div class="section" id="what-i-didn-t-do">
<h2>What I Didn't Do</h2>
<p>One of the things that was asked in almost every email was, "How did you learn Django in 11 weeks?"</p>
<p>I want to make it clear that I didn't set out to learn Django per se. Django is just a very nice toolkit of abstractions that makes creating web applications easy using Python. As far as I'm concerned learning Django was incidental to learning to program. I did not -- and still don't -- want to be considered a "Django developer." I'm not even sure I want to refer to myself as "a Python programmer."</p>
<p>In other words, I do not feel that I would be as modestly competent as I am today if I had spent an inordinate time becoming an expert at the abstraction layer of Django, instead of learning the concepts that make Django work.</p>
</div>
<div class="section" id="questions-from-email">
<h2>Questions From Email</h2>
<p><strong>Did you begin with web or book resources?</strong></p>
<p>Yes I did. :) <a class="reference external" href="http://djangoproject.com">Django</a> has excellent documentation, but <a class="reference external" href="http://stackoverflow.com">StackOverflow</a> is a much more comprehensive help source. On more general topics, I believe that MIT's OpenCourseware <a class="reference external" href="http://www.youtube.com/watch?v=k6U-i4gXkLM">Introduction to Computer Science</a> video lecture series was one of the first real computer science resources I consumed. I watched through lecture 13 or something.</p>
<p><strong>What kind of hours were you putting in on a daily and weekly basis?</strong></p>
<p>A lot. Sometimes 8, sometimes 12, sometimes 16. I was a willfully unemployed single parent, so I not only had a passion for programming, I was also hungry (figuratively speaking) and desperate. I put myself in a position where I had no room to be lazy or complacent. I think above all else that made me work 10x harder. I didn't play video games, I didn't watch TV, I didn't sleep all day. All I did all day every day was code, hack, program and develop.</p>
<p><strong>Did you have a mentor of any kind?</strong></p>
<p>I did indeed. A very smart guy was and is my mentor still, though I've learned enough that I don't rely on him as much for guidance as I used to. He mentored my metamorphosis into a programmer in nearly every way. Some specific ways he provided leadership: Practical programming knowledge (especially Python & Django); command-line expertise; got me up-and-running with emacs & vim; career advice. It helps that he is a very successful & well-respected guy who has a reputation for informed skepticism.</p>
<p><strong>Was there anything from your previous background and experience that you feel was a particular asset in your self-guided studies?</strong></p>
<p>Not really. I was a computer geek from way back, had a few BBSes in the late 80s (yes, I'm a child of the 80s & 90s), learned QBasic & VisualBasic back in the day, and tinkered with Python for a few years off and on... mostly off. Other than that, nope.</p>
<p><strong>How did you come to choose Django to study?</strong></p>
<p>The <a class="reference external" href="http://bretthoerner.com">guy</a> whose career I was trying to emulate had made a very successful career for himself with Django. Pretty straightforward from there.</p>
<p><strong>Would you mind sharing your learning process?</strong></p>
<p>I want to restate that I am not a self-help guru or particularly special in any way. I just worked hard because I was hungry and in a self-made corner where I had no choice but to succeed. I consumed everything I could that would get me to a place where I could make money doing something I love. That was my learning process. Seriously.</p>
<p><strong>I would appreciate it if you can show me how you learned Django and give me any tips/tricks sites/books to look at to learn Django or even HTML/CSS, JavaScript (Front-end Engineering stuff)</strong></p>
<p>I don't have any tips or tricks to learning except just doing it. I spent a lot of long (but enjoyable) hours learning stuff.</p>
<p>As I said above, I did not and do not consider it fruitful to "learn Django," "learn Ruby on Rails," or "learn <a class="reference external" href="http://webnoir.org">Noir</a>." I think a contributor to my success was learning the languages and the concepts behind them, then using a web framework to better learn that language. I learned the framework incidentally to my education in the language.</p>
<p>Go read the Django docs, join #django on irc.freenode.net and ask questions constantly. That's what I did and it worked ok for me. But honestly I didn't just sit down and read stuff most of the time. Usually I was making things in order to learn concepts better, then reading in support of my goals. I'm a hands-on learner. Some people aren't, but I am so it worked for me. Decide on your own if that's good for you.</p>
<p>As far as HTML & CSS there is just so much information out there, and they're such straightforward concepts. I learned as much HTML & CSS as I needed to do what I needed to do. I did not memorize much about how HTML & CSS work, i.e. syntax & semantics. I don't know right off the top of my head how to create a gradient, but I do know right off the top of my head how to find out. I think that's the important thing.</p>
<p><strong>How did you show the company your skills? Did you show them the projects you've made?</strong></p>
<p><a class="reference external" href="http://github.com/mattdeboard">Github, Github, Github</a>. I can't emphasize it enough. Make stuff, put it on github, show people you're passionate and smart and curious.</p>
<p>Also, network. Attend meetups. Meet people. Tweet. Blog. Interact with the community around your language(s). Get to know people. Demonstrate to the world that you really love programming. The week before I saw the job posting for my first programming job I delivered a lightning talk on <a class="reference external" href="http://fabfile.org">Fabric</a>, Python's Capistrano analog. That got me on a few people's radar.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>If I had to summarize the big overview of how I did what I did, I'd say:</p>
<ol class="arabic simple">
<li>Ask questions, be curious, be passionate</li>
<li>Learn a language, not a web framework for god's sake.</li>
<li>Work hard</li>
<li>Network, attend meetups, tweet, blog, be social and show people you'd be fun to work with, and a credit to team.</li>
<li>(Optional) Put yourself in a position of desperation, so there is no choice but to succeed</li>
</ol>
<p>My final point really is that I got lucky. I'm not an amazing developer. At the end of the day I'm a newb and I still have a lot to learn. My career is just beginning but I am proud of the effort I put into changing my life. I hope my experiences can help some other folks.</p>
<p>* <em>I should note that I was already of a mind to blog about this since my cousin Jeff has also taken up programming after leaving the environmental consultancy business.</em></p>
</div>
Export ALL Your Facebook Photos Easilyhttp://mattdeboard.net/2011/07/01/facebook-photo-export2011-07-01T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>It's no secret that <a class="reference external" href="http://plus.google.com">Google+</a> is gaining new users as fast as the acceptance pipeline will let invitees click "Make me an account."</p>
<p>I love G+, and am thrilled that someone has finally, IMO, smashed Facebook's reign as top dog. There's been a poverty of choice for years when it comes to the social stuff. Google has hit it out of the park. If you are undecided about trying out G+, do it. It's well worth it.</p>
<p>At any rate, on to why I'm writing. If there's a way to download all your Facebook photos at one fell swoop, I don't know what it is. Of course, I don't use Facebook apps or anything, so I'm sure there's something there. It's just easier for me to write it myself.</p>
<p>It will download all of your pictures from your Facebook account, and store them in whatever directory you specify (default is your current working directory). Additionally, this script will create a subdirectory for each album, and tuck each photo into the appropriate subdir. This way, when you go to upload them to <a class="reference external" href="http://picasaweb.google.com">Picasa</a>, you can just create whatever Picasa folder, and just "select all" in a particular album subdirectory for easy uploadin'.</p>
<p>I guess I could plug this in to the Picasa API, and may do so this weekend.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">optparse</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">facepy</span>
<span class="kn">from</span> <span class="nn">mytoken</span> <span class="kn">import</span> <span class="n">token</span><span class="p">,</span> <span class="n">username</span>
<span class="k">def</span> <span class="nf">get_photos</span><span class="p">(</span><span class="n">dl_dir</span><span class="p">):</span>
<span class="n">dest</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="n">dl_dir</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">r"[,!'\ /]"</span><span class="p">)</span>
<span class="n">fb_photos</span> <span class="o">=</span> <span class="n">find_photos</span><span class="p">()</span>
<span class="k">for</span> <span class="n">album</span> <span class="ow">in</span> <span class="n">fb_photos</span><span class="p">:</span>
<span class="n">albname</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">"_"</span><span class="p">,</span> <span class="n">album</span><span class="p">)</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span>
<span class="n">mk_album_dirs</span><span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">albname</span><span class="p">)</span>
<span class="n">folder</span> <span class="o">=</span> <span class="n">albname</span>
<span class="k">for</span> <span class="n">img_url</span> <span class="ow">in</span> <span class="n">fb_photos</span><span class="p">[</span><span class="n">album</span><span class="p">][</span><span class="s">'images'</span><span class="p">]:</span>
<span class="n">img_name</span> <span class="o">=</span> <span class="n">img_url</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'/'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">img_url</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="s">/</span><span class="si">%s</span><span class="s">/</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">folder</span><span class="p">,</span> <span class="n">img_name</span><span class="p">),</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">meta</span> <span class="o">=</span> <span class="n">url</span><span class="o">.</span><span class="n">info</span><span class="p">()</span>
<span class="n">filesize</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">meta</span><span class="o">.</span><span class="n">getheaders</span><span class="p">(</span><span class="s">"Content-Length"</span><span class="p">)[</span><span class="mi">0</span><span class="p">])</span>
<span class="c">#print "Downloading: %s Bytes: %s" % (img_name, filesize)</span>
<span class="n">filesize_dl</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">blocksize</span> <span class="o">=</span> <span class="mi">8192</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">buff</span> <span class="o">=</span> <span class="n">url</span><span class="o">.</span><span class="n">read</span><span class="p">(</span><span class="n">blocksize</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">buff</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">filesize_dl</span> <span class="o">+=</span> <span class="n">blocksize</span>
<span class="n">f</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">buff</span><span class="p">)</span>
<span class="n">status</span> <span class="o">=</span> <span class="s">r"</span><span class="si">%10d</span><span class="s"> [</span><span class="si">%3.2f%%</span><span class="s">]"</span> <span class="o">%</span> <span class="p">(</span><span class="n">filesize_dl</span><span class="p">,</span>
<span class="n">filesize_dl</span> <span class="o">*</span> <span class="mf">100.</span> <span class="o">/</span> <span class="n">filesize</span><span class="p">)</span>
<span class="n">status</span> <span class="o">=</span> <span class="n">status</span> <span class="o">+</span> <span class="nb">chr</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">status</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="c">#print status,</span>
<span class="k">def</span> <span class="nf">find_photos</span><span class="p">():</span>
<span class="sd">'''</span>
<span class="sd"> Creates a dictionary, with album id as key and a list of images</span>
<span class="sd"> in the album as the value.</span>
<span class="sd"> '''</span>
<span class="n">albums</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">graph</span> <span class="o">=</span> <span class="n">facepy</span><span class="o">.</span><span class="n">GraphAPI</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>
<span class="n">my_albums</span> <span class="o">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="s">/albums"</span> <span class="o">%</span> <span class="n">username</span><span class="p">)</span>
<span class="k">for</span> <span class="n">album</span> <span class="ow">in</span> <span class="n">my_albums</span><span class="p">:</span>
<span class="n">albums</span><span class="p">[</span><span class="n">album</span><span class="p">[</span><span class="s">'name'</span><span class="p">]]</span> <span class="o">=</span> <span class="p">{}</span>
<span class="n">albums</span><span class="p">[</span><span class="n">album</span><span class="p">[</span><span class="s">'name'</span><span class="p">]][</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">album</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span>
<span class="n">my_pics</span> <span class="o">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="s">/photos?limit=100"</span> <span class="o">%</span> <span class="n">album</span><span class="p">[</span><span class="s">'id'</span><span class="p">])</span>
<span class="n">albums</span><span class="p">[</span><span class="n">album</span><span class="p">[</span><span class="s">'name'</span><span class="p">]][</span><span class="s">'images'</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">pic</span><span class="p">[</span><span class="s">'source'</span><span class="p">]</span> <span class="k">for</span> <span class="n">pic</span> <span class="ow">in</span> <span class="n">my_pics</span><span class="p">]</span>
<span class="k">return</span> <span class="n">albums</span>
<span class="k">def</span> <span class="nf">mk_album_dirs</span><span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">album</span><span class="p">):</span>
<span class="sd">'''</span>
<span class="sd"> Create a subfolder for each facebook album.</span>
<span class="sd"> '''</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">exists</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="s">/</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">album</span><span class="p">)):</span>
<span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="s">"</span><span class="si">%s</span><span class="s">/</span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">album</span><span class="p">))</span>
<span class="k">return</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">()</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">optparse</span><span class="o">.</span><span class="n">OptionParser</span><span class="p">()</span>
<span class="n">parser</span><span class="o">.</span><span class="n">add_option</span><span class="p">(</span><span class="s">"-d"</span><span class="p">,</span> <span class="s">"--dest"</span><span class="p">,</span> <span class="n">action</span><span class="o">=</span><span class="s">"store"</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="s">"string"</span><span class="p">,</span>
<span class="n">dest</span><span class="o">=</span><span class="s">"dest_dir"</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">(),</span>
<span class="n">help</span><span class="o">=</span><span class="p">(</span><span class="s">"Specify the directory where you want your photos t"</span>
<span class="s">"o be downloaded. Photos will be downloaded to cur"</span>
<span class="s">"rent working dir by default."</span><span class="p">))</span>
<span class="n">args</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span>
<span class="p">(</span><span class="n">options</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">(</span><span class="n">args</span><span class="p">)</span>
<span class="n">get_photos</span><span class="p">(</span><span class="n">options</span><span class="o">.</span><span class="n">dest_dir</span><span class="p">)</span>
</pre></div>
Changing Careers at 31http://mattdeboard.net/2011/06/17/career-change-in-your-30s-is-possible2011-06-17T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>I won't bury the lead: About a month ago, I got my first job as a programmer after years of working in PR and marketing.</p>
<p>As I noted <a class="reference external" href="http://mattdeboard.net/2011/05/04/no-regrets">here</a>, I spent this spring a "stay-at-home dad," and spent practically every waking moment becoming a better programmer, with the intent of joining the ranks of professional hackers and getting an awesome job making awesome things. Well, a few days after <a class="reference external" href="http://mattdeboard.net/2011/05/13/chebyshev-polynomials-in-latex">my last blog post</a>, an acquaintance I'd through a local Python meetup <a class="reference external" href="http://twitter.com/#!/wehrlock/status/68811203329261568">tweeted a job opening</a>. I responded, interviewed, and amazingly enough, got the job.</p>
<p>I should point out that I live in Indiana. Development jobs using Python are <em>extremely</em> rare, and one using Django is rarer still. In fact, as far as I know, I may very well have snagged the only job <em>in Indiana</em> that offered the opportunity to work with both Python and Django.</p>
<p>I consider myself very fortunate. It is a great place to work, with smart people, and every day I do interesting things. Every day I learn something new. Working with geeks is <em>very</em> different than working with marketers. My boss's bookshelf is filled with books like, <a class="reference external" href="http://www.amazon.com/Leading-Geeks-Manage-Deliver-Technology/dp/0787961485/ref=sr_1_1?ie=UTF8&qid=1308409661&sr=8-1">*Leading Geeks*</a>. When I talk about something I read on <a class="reference external" href="http://news.ycombinator.com">HN</a>, there's a conversation, not a bunch of blank stares.</p>
<p>Though I get up at 5:30am to get Emma off to day camp and drop my girlfriend off downtown for her classes at <a class="reference external" href="http://iupui.edu">IUPUI</a>, I practically bounce out of bed. I love going to work. I'm a little disappointed when I have to go home for the night. Putting in those long hours reading and hacking have paid off. Best decision ever.</p>
<p>If you're curious, at work I'm working on deployment automation. It's not super sexy objectively speaking, but I feel like I've achieved a moderate level of expertise with <a class="reference external" href="http://fabfile.org">Fabric</a>. Plus, it has been a great way to learn the ins and outs of the various systems we use at work. Eventually I hope to roll it up into the Django admin panel and make provisioning and deployment as easy as clicking a few radio buttons.</p>
Chebyshev polynomials in LaTeXhttp://mattdeboard.net/2011/05/13/chebyshev-polynomials-in-latex2011-05-13T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>I'm recovering from an obsession with <a class="reference external" href="http://mathworld.wolfram.com/ChebyshevPolynomialoftheFirstKind.html">Chebyshev polynomials</a>. Despite the fancy title and somewhat-intimidating definition, Chebyshev polynomials are actually a fantastic shortcut -- relative to what we're taught from the book -- to factoring out trigonometric double-angle problems like <cite>cos(6x)</cite>.</p>
<p>I was originally going to write a script that calculated the Chebyshev polynomials, but when I learned Python's <a class="reference external" href="http://www.scipy.org/">SciPy</a> library already has a function, I "pivoted." Instead I wanted to write the below script, which calculates the polynomial using scipy.special.orthogonal.chebyt(), then creates a <a class="reference external" href="http://www.latex-project.org/">LaTeX</a> -formatted string representation of the equation. For example, the output for the ninth-degree Chebyshev polynomial is rendered thusly:</p>
<img alt="http://mathbin.net/equations/62360_0.png" src="http://mathbin.net/equations/62360_0.png" />
<p>Here's the code, it should be pretty straightforward:</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">from</span> <span class="nn">scipy.special</span> <span class="kn">import</span> <span class="n">orthogonal</span> <span class="k">as</span> <span class="n">orth</span>
<span class="k">def</span> <span class="nf">chebyTex</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="sd">'''Returns a LaTeX-formatted string for a Chebyshev polynomial of</span>
<span class="sd"> order n.'''</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">orth</span><span class="o">.</span><span class="n">chebyt</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
<span class="n">coeffs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">c</span><span class="p">:</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">>=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">i</span> <span class="o"><=</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">coeffs</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="nb">round</span><span class="p">(</span><span class="n">i</span><span class="p">)))</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">pass</span>
<span class="n">pows</span> <span class="o">=</span> <span class="p">[</span><span class="n">coeffs</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">coeffs</span><span class="p">]</span>
<span class="n">pows</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># The only "magic" in this function is some string manipulation to</span>
<span class="c"># handle the LaTeX formatting for super- and subscript characters.</span>
<span class="n">arrays</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">coeffs</span><span class="p">,</span> <span class="n">pows</span><span class="p">)</span>
<span class="n">latex_string</span> <span class="o">=</span> <span class="s">'T_{</span><span class="si">%s</span><span class="s">}(x) = '</span> <span class="o">%</span> <span class="n">n</span>
<span class="k">for</span> <span class="n">array</span> <span class="ow">in</span> <span class="n">arrays</span><span class="p">:</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">n</span><span class="o">-</span><span class="n">arrays</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">array</span><span class="p">)</span><span class="o">*</span><span class="mi">2</span>
<span class="k">if</span> <span class="n">arrays</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">!=</span> <span class="n">array</span><span class="p">:</span>
<span class="n">latex_string</span> <span class="o">+=</span> <span class="s">r'</span><span class="si">%s</span><span class="s">x'</span> <span class="o">%</span> <span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">latex_string</span> <span class="o">+=</span> <span class="s">r'^{</span><span class="si">%s</span><span class="s">} + '</span> <span class="o">%</span> <span class="n">z</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">n</span> <span class="o">%</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">latex_string</span> <span class="o">+=</span> <span class="s">'</span><span class="si">%s</span><span class="s">'</span> <span class="o">%</span> <span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">latex_string</span> <span class="o">+=</span> <span class="s">'</span><span class="si">%s</span><span class="s">x'</span> <span class="o">%</span> <span class="n">array</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">return</span> <span class="n">latex_string</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">chebyTex</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
<span class="k">print</span> <span class="n">s</span>
</pre></div>
<p>It would be trivial to connect to something like <a class="reference external" href="http://mathbin.net">MathBin</a> pull down and store the resulting image, but was beyond the scope of this little script.</p>
Python-Powered Smash'n'Grabhttp://mattdeboard.net/2011/05/12/smash-n-grab-with-lxml2011-05-12T00:00:00Z<a href="http://mattdeboard.net">Matt DeBoard</a><p>After watching and listening my girlfriend wrestling with her course schedule for the fall semester, I got a "big idea" for another project. It's not ready to see the light of day, but suffice it to say it involves a better way of scheduling classes.</p>
<p>To start down the road of iterating on my project, I needed data. Specifically, I needed the schedule of every course offered by <a class="reference external" href="http://iupui.edu">IUPUI</a> in the fall: what days of the week each class was held, and at what times.</p>
<p>I assumed it would be as easy as sending a data request to the university helpdesk. I also assumed it would take 1-3 weeks for them to respond with the data. After all, they do make the schedule available <a class="reference external" href="http://registrar.iupui.edu/enrollment/4118/4118_standard.pdf">as a PDF</a>. That's clearly autogenerated, so they must have raw data sitting in a database somewhere, right?</p>
<p>The response I got was indecisive and confusing:</p>
<blockquote>
<p>"Hi Matt,</p>
<p>I'm sorry but <strong>we currently don't have a way for students to obtain this type of information</strong>. Contact your instructor or department to see if they can provide a dataset for you.</p>
<p>Also, the IUPUI Registrar's website might help build your own dataset <a class="reference external" href="http://registrar.iupui.edu/schedule.html">http://registrar.iupui.edu/schedule.html</a>.
Thanks,</p>
<p>SIS Help Desk"</p>
</blockquote>
<p><em>(Emphasis added)</em></p>
<p>I sent a follow-up email asking what the bolded text actually means, but got nothing back. So instead of waiting, I decided to just make my own.</p>
<p>I used Python's <a class="reference external" href="http://lxml.de/">lxml library</a> to power a script that scrapes IUPUI's <a class="reference external" href="http://registrar.iupui.edu/enrollment/4118/">Schedule of Classes</a> sub-site. Then the script builds a JSON document populated with the data from the course relevant to my project. The structure of the sub-site, thankfully, is RESTful, which made writing the logic much easier.</p>
<p>I won't bore you with the nitty-gritty whys and wherefores of the problems I ran into here (plus, my code is commented). scrapeDepts() initializes the JSON file and populates it with department names:</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">string</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">lxml.html</span> <span class="kn">as</span> <span class="nn">lh</span>
<span class="n">jsonSched</span> <span class="o">=</span> <span class="s">'sched.json'</span>
<span class="k">def</span> <span class="nf">scrapeDepts</span><span class="p">():</span>
<span class="sd">'''Scrape the departments and export to json.'''</span>
<span class="n">divMain</span> <span class="o">=</span> <span class="n">parse</span><span class="p">(</span><span class="s">'http://registrar.iupui.edu/enrollment/4118/index.html'</span><span class="p">)</span>
<span class="n">depts</span> <span class="o">=</span> <span class="p">[</span><span class="n">link</span><span class="o">.</span><span class="n">text</span> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">divMain</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">".//a"</span><span class="p">)]</span>
<span class="n">deptDict</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">dept</span> <span class="ow">in</span> <span class="n">depts</span><span class="p">:</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">parse</span><span class="p">(</span><span class="s">'http://registrar.iupui.edu/enrollment/4118/classes/</span><span class="si">%s</span><span class="s">/inde'</span>
<span class="s">'x.html'</span> <span class="o">%</span> <span class="n">dept</span><span class="p">)</span>
<span class="n">crs</span> <span class="o">=</span> <span class="p">[{</span><span class="n">a</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' '</span><span class="p">,</span> <span class="s">''</span><span class="p">):</span> <span class="p">{}}</span> <span class="k">for</span> <span class="n">a</span> <span class="ow">in</span> <span class="n">d</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">".//a"</span><span class="p">)</span> <span class="k">if</span>
<span class="n">a</span><span class="o">.</span><span class="n">text</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="n">dept</span><span class="p">)]</span>
<span class="n">deptDict</span><span class="p">[</span><span class="n">dept</span><span class="p">]</span> <span class="o">=</span> <span class="n">crs</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="n">jsonSched</span><span class="p">),</span> <span class="s">'w+'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">json</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">deptDict</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
<span class="k">return</span>
</pre></div>
<p>scrapeCourses() is heavily commented for my own sanity. I've got probably more list comprehensions than I need, but they're more readable this way. Plus, it works, and the part of the process the list comprehensions handle aren't going to impact total run time in any appreciable way on a dataset this small.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">scrapeCourses</span><span class="p">():</span>
<span class="sd">'''Scrape the courses for each department.'''</span>
<span class="n">item_counter</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="s">'sched.json'</span><span class="p">),</span> <span class="s">'r+'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">deptDict</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">deptDict</span><span class="o">.</span><span class="n">keys</span><span class="p">():</span>
<span class="c"># item_counter keeps a running tally of all the department and</span>
<span class="c"># course pages the parser touches. It increments once for a dept.</span>
<span class="c"># page, and once for each course page on the department.</span>
<span class="n">item_counter</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">deptDict</span><span class="p">[</span><span class="n">key</span><span class="p">]:</span>
<span class="n">item_counter</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">try</span><span class="p">:</span>
<span class="c"># Some courses did not parse properly in scrapeDepts() so</span>
<span class="c"># I had to include this try/except loop to handle</span>
<span class="c"># IOErrors.</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">parse</span><span class="p">(</span><span class="s">'http://registrar.iupui.edu/enrollment/4118/class'</span>
<span class="s">'es/</span><span class="si">%s</span><span class="s">/</span><span class="si">%s</span><span class="s">.html'</span> <span class="o">%</span> <span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]))</span>
<span class="k">except</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c"># This is lxml syntax to find all <pre></pre> tags. `.//foo`</span>
<span class="c"># finds all <foo></foo> tags.</span>
<span class="n">pre</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">".//pre"</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="c"># The text content for the <pre> tag on a given dept/course</span>
<span class="c"># web page comes through as an unformatted block of text. `t`</span>
<span class="c"># is a list comprehension that splits this block of text into</span>
<span class="c"># separate lines, including each separate line iff. it has at</span>
<span class="c"># least one character. This conditional is necessary because</span>
<span class="c"># splitlines() will include empty strings as lines. e.g.:</span>
<span class="c">#</span>
<span class="c"># ['hello world', '', 'my name is matt', '', 'how are you']</span>
<span class="n">t</span> <span class="o">=</span> <span class="p">[</span><span class="n">l</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">pre</span><span class="o">.</span><span class="n">text_content</span><span class="p">()</span><span class="o">.</span><span class="n">splitlines</span><span class="p">()</span> <span class="k">if</span>
<span class="nb">len</span><span class="p">(</span><span class="n">l</span><span class="o">.</span><span class="n">strip</span><span class="p">())</span> <span class="o">></span> <span class="mi">0</span><span class="p">]</span>
<span class="c"># `lines` is a list comprehension to gather all the lines</span>
<span class="c"># from `t` that began with a digit. This is a heuristic</span>
<span class="c"># particular to registrar.iupui.edu.</span>
<span class="n">lines</span> <span class="o">=</span> <span class="p">[</span><span class="n">line</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">t</span> <span class="k">if</span> <span class="n">line</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="ow">in</span> <span class="n">string</span><span class="o">.</span><span class="n">digits</span><span class="p">]</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">lines</span><span class="p">:</span>
<span class="n">sid</span> <span class="o">=</span> <span class="s">'session</span><span class="si">%d</span><span class="s">'</span> <span class="o">%</span> <span class="n">lines</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span><span class="s">'time'</span><span class="p">:</span> <span class="s">''</span><span class="p">,</span>
<span class="s">'days'</span><span class="p">:</span> <span class="s">''</span><span class="p">}</span>
<span class="k">try</span><span class="p">:</span>
<span class="c"># This regex matches string segments like:</span>
<span class="c"># '03:30P-04:45P MWF'</span>
<span class="c"># Exceptions are caused when a course is closed,</span>
<span class="c"># or when the times of the class are TBD.</span>
<span class="n">reg</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r"(?P<time>\d+:\d+[AP]-\d+:\d+[AP]\W+[MTWRF"</span>
<span class="s">"]{1,5})"</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span>
<span class="n">dt</span> <span class="o">=</span> <span class="n">reg</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s">'time'</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="n">time</span> <span class="o">=</span> <span class="n">dt</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">days</span> <span class="o">=</span> <span class="n">dt</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'time'</span><span class="p">]</span> <span class="o">=</span> <span class="n">time</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'days'</span><span class="p">]</span> <span class="o">=</span> <span class="n">days</span>
<span class="k">except</span> <span class="ne">AttributeError</span><span class="p">:</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'time'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'UNK'</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'days'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'UNK'</span>
<span class="k">continue</span>
<span class="k">except</span> <span class="ne">IndexError</span><span class="p">:</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'time'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'CLOSED'</span>
<span class="n">d</span><span class="p">[</span><span class="n">d</span><span class="o">.</span><span class="n">keys</span><span class="p">()[</span><span class="mi">0</span><span class="p">]][</span><span class="n">sid</span><span class="p">][</span><span class="s">'days'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'CLOSED'</span>
<span class="k">continue</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="s">'sched.json'</span><span class="p">),</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">json</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">deptDict</span><span class="p">,</span> <span class="n">f</span><span class="p">)</span>
<span class="k">return</span> <span class="n">item_counter</span>
</pre></div>
<p>parse() is a helper function for scrapeDepts() and scrapeCourses().</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="n">link</span><span class="p">):</span>
<span class="k">print</span> <span class="o">>></span> <span class="n">sys</span><span class="o">.</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Parsing </span><span class="si">%s</span><span class="s">"</span> <span class="o">%</span> <span class="n">link</span><span class="p">[</span><span class="o">-</span><span class="mi">15</span><span class="p">:]</span>
<span class="n">ind</span> <span class="o">=</span> <span class="n">lh</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">link</span><span class="p">)</span>
<span class="k">print</span> <span class="o">>></span> <span class="n">sys</span><span class="o">.</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Parsing complete. Fetching div#main"</span>
<span class="n">main</span> <span class="o">=</span> <span class="p">[</span><span class="n">div</span> <span class="k">for</span> <span class="n">div</span> <span class="ow">in</span> <span class="n">ind</span><span class="o">.</span><span class="n">findall</span><span class="p">(</span><span class="s">".//div"</span><span class="p">)</span> <span class="k">if</span> <span class="n">div</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">"id"</span><span class="p">)</span> <span class="o">==</span> <span class="s">"main"</span><span class="p">]</span>
<span class="k">print</span> <span class="o">>></span> <span class="n">sys</span><span class="o">.</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Fetch complete. Returning to main process."</span>
<span class="k">return</span> <span class="n">main</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</pre></div>
<p>maketime() is basically the function I had been wanting to write in the first place, if I had been provided with some legit raw data. It takes the machine-readable data and turns it into a much more manageable data structure. In this case, it's a list. Then using the <a class="reference external" href="http://docs.python.org/library/time.html">time library</a> it transforms the string describing the course start and end times, first into a list of <a class="reference external" href="http://docs.python.org/library/time.html#time.struct_time">time.struct_time</a> objects. Finally, I use struct_time's attributes to transform that list into a list of integers.</p>
<div class="highlight"><pre><span class="k">def</span> <span class="nf">maketime</span><span class="p">():</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'sched.json'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">sched</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="n">courses</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">sched</span><span class="o">.</span><span class="n">iterkeys</span><span class="p">():</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">sched</span><span class="p">[</span><span class="n">k</span><span class="p">]:</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">i</span><span class="o">.</span><span class="n">iterkeys</span><span class="p">():</span>
<span class="k">for</span> <span class="n">h</span> <span class="ow">in</span> <span class="n">i</span><span class="p">[</span><span class="n">j</span><span class="p">]:</span>
<span class="n">courses</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">j</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">i</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">h</span><span class="p">][</span><span class="s">'time'</span><span class="p">],</span> <span class="n">i</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">h</span><span class="p">][</span><span class="s">'days'</span><span class="p">]])</span>
<span class="c"># Strip out all 'UNK' and 'CLOSED' courses.</span>
<span class="n">courses</span> <span class="o">=</span> <span class="p">[</span><span class="n">course</span> <span class="k">for</span> <span class="n">course</span> <span class="ow">in</span> <span class="n">courses</span> <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">course</span><span class="p">[</span><span class="mi">2</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="nb">int</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">course</span> <span class="ow">in</span> <span class="n">courses</span><span class="p">:</span>
<span class="n">timeSplit</span> <span class="o">=</span> <span class="n">course</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'-'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">timeSplit</span><span class="p">:</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">strptime</span><span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="s">"%I:%M%p"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">y</span><span class="o">.</span><span class="n">tm_min</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">minute</span> <span class="o">=</span> <span class="s">'00'</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">minute</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">tm_min</span><span class="p">)</span>
<span class="n">hour</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">y</span><span class="o">.</span><span class="n">tm_hour</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">hour</span><span class="o">+</span><span class="n">minute</span><span class="p">)</span>
<span class="n">timeSplit</span><span class="p">[</span><span class="n">timeSplit</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">t</span><span class="p">)]</span> <span class="o">=</span> <span class="n">y</span>
<span class="n">course</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">timeSplit</span>
<span class="k">return</span> <span class="n">courses</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">scrapeDepts</span><span class="p">()</span>
<span class="n">ic</span> <span class="o">=</span> <span class="n">scrapeCourses</span><span class="p">()</span>
<span class="k">print</span> <span class="n">ic</span>
</pre></div>
<p>Turned out I was scraping about 2,850 individual pages to compile the data. Running this script took about an hour each time I ran it. At least now I'm past that and can move on with the rest of the project, which I <em>hope</em> to start this weekend.</p>