Python-Powered Smash'n'Grab

written on Thursday, May 12, 2011

After watching and listening my girlfriend wrestling with her course schedule for the fall semester, I got a "big idea" for another project. It's not ready to see the light of day, but suffice it to say it involves a better way of scheduling classes.

To start down the road of iterating on my project, I needed data. Specifically, I needed the schedule of every course offered by IUPUI in the fall: what days of the week each class was held, and at what times.

I assumed it would be as easy as sending a data request to the university helpdesk. I also assumed it would take 1-3 weeks for them to respond with the data. After all, they do make the schedule available as a PDF. That's clearly autogenerated, so they must have raw data sitting in a database somewhere, right?

The response I got was indecisive and confusing:

"Hi Matt,

I'm sorry but we currently don't have a way for students to obtain this type of information. Contact your instructor or department to see if they can provide a dataset for you.

Also, the IUPUI Registrar's website might help build your own dataset Thanks,

SIS Help Desk"

(Emphasis added)

I sent a follow-up email asking what the bolded text actually means, but got nothing back. So instead of waiting, I decided to just make my own.

I used Python's lxml library to power a script that scrapes IUPUI's Schedule of Classes sub-site. Then the script builds a JSON document populated with the data from the course relevant to my project. The structure of the sub-site, thankfully, is RESTful, which made writing the logic much easier.

I won't bore you with the nitty-gritty whys and wherefores of the problems I ran into here (plus, my code is commented). scrapeDepts() initializes the JSON file and populates it with department names:

import string
import json
import time
import sys
import os
import re
import lxml.html as lh

jsonSched = 'sched.json'

def scrapeDepts():
    '''Scrape the departments and export to json.'''
    divMain = parse('')
    depts = [link.text for link in divMain.findall(".//a")]
    deptDict = {}

    for dept in depts:
        d = parse(''
                  'x.html' % dept)
        crs = [{a.text.replace(' ', ''): {}} for a in d.findall(".//a") if
        deptDict[dept] = crs

    with open(os.path.abspath(jsonSched), 'w+') as f:
        json.dump(deptDict, f)


scrapeCourses() is heavily commented for my own sanity. I've got probably more list comprehensions than I need, but they're more readable this way. Plus, it works, and the part of the process the list comprehensions handle aren't going to impact total run time in any appreciable way on a dataset this small.

def scrapeCourses():
    '''Scrape the courses for each department.'''
    item_counter = 0
    with open(os.path.abspath('sched.json'), 'r+') as f:
        deptDict = json.load(f)

    for key in deptDict.keys():
        # item_counter keeps a running tally of all the department and
        # course pages the parser touches. It increments once for a dept.
        # page, and once for each course page on the department.
        item_counter += 1
        for d in deptDict[key]:
            item_counter += 1
                # Some courses did not parse properly in scrapeDepts() so
                # I had to include this try/except loop to handle
                # IOErrors.
                f = parse(''
                          'es/%s/%s.html' % (key, d.keys()[0]))

            # This is lxml syntax to find all <pre></pre> tags. `.//foo`
            # finds all <foo></foo> tags.
            pre = f.findall(".//pre")[0]

            # The text content for the <pre> tag on a given dept/course
            # web page comes through as an unformatted block of text. `t`
            # is a list comprehension that splits this block of text into
            # separate lines, including each separate line iff. it has at
            # least one character. This conditional is necessary because
            # splitlines() will include empty strings as lines. e.g.:
            #    ['hello world', '', 'my name is matt', '', 'how are you']
            t = [l.strip() for l in pre.text_content().splitlines() if
                 len(l.strip()) > 0]

            # `lines` is a list comprehension to gather all the lines
            # from `t` that began with a digit. This is a heuristic
            # particular to
            lines = [line for line in t if line[0] in string.digits]

            for line in lines:
                sid = 'session%d' % lines.index(line)
                d[d.keys()[0]][sid] = {'time': '',
                                       'days': ''}
                    # This regex matches string segments like:
                    #    '03:30P-04:45P     MWF'
                    # Exceptions are caused when a course is closed,
                    # or when the times of the class are TBD.
                    reg ="(?P<time>\d+:\d+[AP]-\d+:\d+[AP]\W+[MTWRF"
                                    "]{1,5})", line)
                    dt ='time').split()
                    time = dt[0]
                    days = dt[1]
                    d[d.keys()[0]][sid]['time'] = time
                    d[d.keys()[0]][sid]['days'] = days
                except AttributeError:
                    d[d.keys()[0]][sid]['time'] = 'UNK'
                    d[d.keys()[0]][sid]['days'] = 'UNK'
                except IndexError:
                    d[d.keys()[0]][sid]['time'] = 'CLOSED'
                    d[d.keys()[0]][sid]['days'] = 'CLOSED'

    with open(os.path.abspath('sched.json'), 'w') as f:
        json.dump(deptDict, f)

    return item_counter

parse() is a helper function for scrapeDepts() and scrapeCourses().

def parse(link):
    print >> sys.stderr, "Parsing %s" % link[-15:]
    ind = lh.parse(link)
    print >> sys.stderr, "Parsing complete. Fetching div#main"
    main = [div for div in ind.findall(".//div") if div.get("id") == "main"]
    print >> sys.stderr, "Fetch complete. Returning to main process."
    return main[0]

maketime() is basically the function I had been wanting to write in the first place, if I had been provided with some legit raw data. It takes the machine-readable data and turns it into a much more manageable data structure. In this case, it's a list. Then using the time library it transforms the string describing the course start and end times, first into a list of time.struct_time objects. Finally, I use struct_time's attributes to transform that list into a list of integers.

def maketime():
    with open('sched.json', 'r') as f:
        sched = json.load(f)
    courses = []

    for k in sched.iterkeys():
        for i in sched[k]:
            for j in i.iterkeys():
                for h in i[j]:
                    courses.append([j, h, i[j][h]['time'], i[j][h]['days']])

    # Strip out all 'UNK' and 'CLOSED' courses.
    courses = [course for course in courses if isinstance(course[2][0], int)]

    for course in courses:
        timeSplit = course[2].split('-')
        for t in timeSplit:
            y = time.strptime(t, "%I:%M%p")

            if y.tm_min == 0:
                minute = '00'
                minute = str(y.tm_min)

            hour = str(y.tm_hour)
            y = int(hour+minute)
            timeSplit[timeSplit.index(t)] = y
        course[2] = timeSplit

    return courses

if __name__ == '__main__':
    ic = scrapeCourses()
    print ic

Turned out I was scraping about 2,850 individual pages to compile the data. Running this script took about an hour each time I ran it. At least now I'm past that and can move on with the rest of the project, which I hope to start this weekend.

This entry was tagged lxml, parsing and python