scvalex.net

23. Zingr

Since Google has decided to shut down Google Reader, we should do what any self-respecting software developer would do: complain about it on Hacker News, and write a replacement that works for us. Today, we’ll be writing Zingr (“Zingr is not Google Reader”), a single-user web-based news aggregator in Python, SQLite3, Flask, Mootools, and Knockout.

The goal is to have a single Python script, that opens an SQLite3 database, queries subscribed feeds, fetches them, parses them, inserts the results back into the database, and has a nice Knockout-based webapp frontend. For the impatient, the finished project is on Github and this is what it looks like:

Zingr Web UI
Zingr Web UI

We start with the Python backend. The Flask micro framework powers the web interface. Since we have a feed fetcher thread, and a web thread, we import threading. We want to make migrations from Google Reader painless, so we support parsing the OPML files it exports subscriptions to; we use xml.dom.minidom for this. We store everything in an sqlite3 database, and we need a few other standard modules for various tasks. Finally, we parse feeds with feedparser (pip install feedparse, or easy_install feedparser).

from __future__ import print_function

from flask import Flask, Response, send_from_directory, request, json
app = Flask(__name__)

from threading import Thread
import xml.dom.minidom
import sqlite3, os, time, logging, datetime

import feedparser

As an aside, we use simple SQL. Although it would be more hip to use something like SQLAlchemy, it seems overkill in this case. Every time we need to access the database, we open the following one, do what we need, and close it.

DB_NAME = "zingr.db"

Backend

The web API we expose is the minimal one for the task at hand.

/             - serve the Zingr HTML
/r/*          - serve static resources
/feeds        - return a JSON with the urls and titles
                of available feeds
/add-feed     - add the feed in the given url
/import-opml  - add the feeds in the uploaded OPML
/feed-entries - return a JSON with the contents and
                metadata of the entries in the given
                feed
/mark-read    - mark the given entry as read (and
                don't return it in future feed-entries)

The database is similarly minimal: we only have tables for feeds and feed entries.

def init_db():
    """Initialise the database, if it does not already exist."""
    if not os.path.exists(DB_NAME):
        with sqlite3.connect(DB_NAME) as db:
            db.execute("CREATE TABLE feeds ( title TEXT, url TEXT PRIMARY KEY ) ")
            db.execute("CREATE TABLE entries ( updated TEXT, feed TEXT, title TEXT, url TEXT, content TEXT, read INTEGER, CONSTRAINT entries_pkey PRIMARY KEY ( feed, url ) )")
            db.commit()

We first provide a function to start the web app:

def start_server():
    app.run(debug=True, use_reloader=False)

We implement the first two API calls by serving static files.

@app.route("/")
def index():
    return send_from_directory(app.root_path, "index.html")

@app.route("/r/<path:filename>")
def resource(filename):
    return send_from_directory("r", filename)

To get the list of available feeds, we query the database. We include with each feed a count of unread entries.

@app.route("/feeds")
def feeds():
    saved_feeds = {}
    with sqlite3.connect(DB_NAME) as db:
        saved_feeds = [{"title": title,
                        "url": url}
                       for title, url in db.execute("SELECT title, url FROM feeds").fetchall()]
        for feed in saved_feeds:
            count = db.execute("SELECT COUNT(*) FROM entries WHERE feed=? AND read=0",
                               [feed["url"]]).fetchone()[0]
            feed["count"] = count
    return Response(json.dumps(saved_feeds), mimetype="application/json")

Adding a feed is just a matter of inserting a row into the database. The add-feed API call just exposes this to the web. The import-opml one also needs to get the uploaded OPML file and parse it.

def addFeedToDb(feedUrl, db):
    """Insert feed into database if it is not already present."""
    if db.execute("SELECT * FROM feeds WHERE url = ?", [feedUrl]).fetchone() is not None:
        app.logger.warning("Feed %s already exists" % (feedUrl,))
    else:
        db.execute("INSERT INTO feeds VALUES (?, ?)", [feedUrl, feedUrl])
        db.commit()

@app.route("/add-feed", methods=["POST"])
def addFeed():
    url = request.form.get("url")
    if url:
        app.logger.info("Adding feed %s" % (url,))
        with sqlite3.connect(DB_NAME) as db:
            addFeedToDb(url, db)
        fetch_feeds()
    return feeds()

@app.route("/import-opml", methods=["POST"])
def importOpml():
    opmlFile = request.files.get("opml-file")
    if opmlFile:
        dom = xml.dom.minidom.parse(opmlFile)
        feedElements = dom.getElementsByTagName("outline")
        with sqlite3.connect(DB_NAME) as db:
            for fe in feedElements:
                addFeedToDb(fe.getAttribute("xmlUrl"), db)
        fetch_feeds()
    return feeds()

We’ve been using HTTP POST requests in the API so far, rather than more RESTful alternatives. We do this to make the API calls uniform: import-opml has to be a POST because it uploads a file, so the rest should be POSTs too.

The feed-entries call is somewhat longer because it deals with more table columns. It returns only unread entries, and sorts them by date, newest-first; we store dates in international format (2012-12-25 13:00:00 is 1 PM last Christmas), so a lexicographic sort is enough.

@app.route("/feed-entries", methods=["POST"])
def feedEntries():
    feed_url = request.form.get("url")
    entries = []
    if feed_url is not None:
        with sqlite3.connect(DB_NAME) as db:
            entries = [{"updated": updated,
                        "title": title,
                        "link": link,
                        "content": content,
                        "read": read}
                       for (updated, title, link, content, read)
                       in db.execute("SELECT updated, title, url, content, read FROM entries WHERE feed=? AND read=0",
                                     [feed_url]).fetchall()]
    entries = sorted(entries, cmp = lambda a, b: -cmp(a["updated"], b["updated"]))
    return Response(json.dumps(entries), mimetype="application/json")

The last API call, mark-read, is just a row update in the database.

@app.route("/mark-read", methods=["POST"])
def markRead():
    feed_url = request.form.get("feed_url")
    url = request.form.get("url")
    if feed_url is not None and url is not None:
        app.logger.info("Marking as read %s from %s" % (url, feed_url))
        with sqlite3.connect(DB_NAME) as db:
            db.execute("UPDATE entries SET read=1 WHERE feed=? AND url=?", [feed_url, url])
            db.commit()
    return "ok"

We move on to our feed fetching component. First, we need a function that, given an URL, downloads the feed, parses it, and inserts any new entries into the database. Since we’re using feedparser, fetching and parsing feeds is simply the call feedparser.parse(url). We then reformat the date into international format, and insert everything into the database. The likely way the insertion can fail is if the new entry is already in the table; in this case, we carry on handling other entries.

def fetch_feed(url):
    """Fetch feed, insert new entries into database."""
    try:
        app.logger.info("Fetching feed %s" % (url,))
        feed = feedparser.parse(url)
        with sqlite3.connect(DB_NAME) as db:
            feedTitle = feed.feed.title
            db.execute("UPDATE feeds SET title=? WHERE url=?", [feedTitle, url])
            newEntries = 0
            for entry in feed.entries:
                title = entry.title
                content = entry.description
                link = entry.link
                updated = datetime.datetime(*(entry.published_parsed[0:6])).isoformat(" ")
                try:
                    db.execute("INSERT INTO entries VALUES (?, ?, ?, ?, ?, ?)",
                               [updated, url, title, link, content, 0])
                    newEntries += 1
                except Exception, e:
                    # We're ignoring entry updates for now.
                    # app.logger.warning("problem inserting %s into %s" % (link, url))
                    pass
            db.commit()
            app.logger.info("Inserted %d new entries" % (newEntries,))
        app.logger.info("Fetched feed %s" % (url,))
    except Exception, e:
        app.logger.warning("Error processing feed %s:\n%s" % (url, str(e)))

To update all the feeds, we run the previous function over all the feeds recorded in the database. Then, we do this periodically on separate thread.

def fetch_feeds():
    """Fetch all feeds in database."""
    with sqlite3.connect(DB_NAME) as db:
        for url in (row[0] for row in db.execute("SELECT url FROM feeds").fetchall()):
            fetch_feed(url)

def periodically_fetch_feeds():
    app.logger.info("Feed fetcher started")
    while True:
        fetch_feeds()
        time.sleep(10 * 60)     # sleep for 10min

Finally, we tie our entire backend together in the main function, which initializes the database, sets up logging to a file, starts up the web API and feed fetcher threads, and sets up the program termination.

def main():
    init_db()

    # Setup logging.
    file_handler = logging.FileHandler("zingr.log")
    log_format = "%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]"
    file_handler.setFormatter(logging.Formatter(log_format))
    app.debug_log_format = log_format
    app.logger.setLevel(logging.INFO)
    file_handler.setLevel(logging.INFO)
    app.logger.addHandler(file_handler)
    app.logger.info("zingr starting")

    webserver = Thread(target = start_server)
    webserver.daemon = True
    webserver.start()
    feed_fetcher = Thread(target = periodically_fetch_feeds)
    feed_fetcher.daemon = True
    feed_fetcher.start()
    app.logger.info("zingr started")

    # see http://www.regexprn.com/2010/05/killing-multithreaded-python-programs.html
    while True:
        try:
            webserver.join(100)
            feed_fetcher.join(100)
        except KeyboardInterrupt as e:
            break

if __name__ == "__main__":
    main()

There are a few issues with the design of the backend. First, there’s the lack of error handling. Then, there’s the using the database directly, and re-opening it for each use. Finally, we make no effort to ensure that we close the database cleanly. None of these would be admissible for a “real” project, but this is just a hobby project nobody will ever use (I use NewsBlur).

Frontend

The web UI is a single-page HTML5/Javascript application. We don’t use any CSS framework, but we do use the Mootools and Knockout Javascript libraries.

Before talking about Knockout, here’s the webapp’s HTML:

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <title>zingr</title>
    <link rel="stylesheet" href="/r/screen.css" media="screen">

    <script src="http://ajax.aspnetcdn.com/ajax/knockout/knockout-2.2.1.js"></script>
    <script src="http://ajax.googleapis.com/ajax/libs/mootools/1.4.5/mootools-yui-compressed.js"></script>
    <script src="/r/zingr.js"></script>
  </head>
  <body>
    <header>
      <h1>zingr</h1>
      <div id="controls">
        <form data-bind="submit: addFeedClicked">
          <input id="newFeedInput" data-bind="value: newFeedUrl, visible: addingFeed" />
          <button id="addFeedButton" type="submit">Add Feed</button>
        </form>
        <form data-bind="submit: importOpmlClicked">
          <input id="opmlInput" type="file" data-bind="visible: importingOpml" />
          <button id="importOpmlButton" type="submit">Import OPML</button>
        </form>
        <button id="reloadButton" data-bind="click: reload">Reload</button>
      </div>
    </header>

    <ul id="feeds" data-bind="foreach: feeds">
      <li class="feed" data-bind="click: $root.selectFeed, css: { selected: selected() }">
        <span data-bind="text: title"></span>
        <span class="count" data-bind="text: count, css: { unread: count > 0 }"></span>
      </li>
    </ul>

    <ul id="feedContent" data-bind="foreach: selectedFeedEntries">
      <li class="feedEntry">
        <h3><a data-bind="attr: { href: link }"><span data-bind="text: title"></span></a></h3>
        <div class="date" data-bind="text: updated"></div>
        <div class="content" data-bind="html: content"></div>
      </li>
    </ul>
  </body>
</html>

Unsurprisingly, it has a header with some controls, a list of feeds, and a list of feed entries. More interesting are the various data-bind attributes. When Knockout initializes, it finds these attributes, and binds the HTML elements to the corresponding Javascript variables. Then, when the variable changes, Knockout automatically updates the HTML. The official documentation is the definitive reference, but, to give an example, data-bind="text: title" conceptually means “the text of this HTML element is the value of the Javascript variable title”. So, we’re using Knockout to avoid messing with the DOM manually.

Moving on to the code, we first write a wrapper around console.log, which some old browsers don’t have.

function log() {
    if (console && console.log) {
        console.log.apply(console.log, arguments);
    }
}

We work with just two types of objects, Feeds and FeedEntrys, which just encapsulate static information about the feeds and their entries. The logic behind ko.observable is explained further down.

function Feed(feed) {
    var self = this;

    self.title = feed.title;
    self.url = feed.url;
    self.count = feed.count;
    self.selected = ko.observable(false);
}

function FeedEntry(entry, feed) {
    var self = this;

    self.updated = entry.updated;
    self.link = entry.link;
    self.content = entry.content;
    self.title = entry.title;
    self.read = ko.observable(entry.read != 0);
    self.feed = feed;
}

Now comes the interesting part, Knockout’s “model”, or the state of our webapp. First, we need the list of feeds. By making it a ko.observable, we’re letting Knockout handle propagating updates to it. The only function that changes feeds is setFeeds. It takes a list of objects representing feeds, turns them into Feed objects, and updates feeds with the new list (a ko.observable is really a function; to get its value, you call the function without parameters; to update its value, you call the function with one parameter). When setFeeds updates feeds, the HTML in #feeds also updates automatically thanks to data-bind="foreach: feeds", and the data-binds in the child elements.

function AppViewModel() {
    var self = this;

    self.feeds = ko.observable([]);

    self.setFeeds = function(feeds) {
        self.feeds(feeds.map(function(feed) {
            return (new Feed(feed));
        }));
    }

Next, we need a way to add new feeds. The way the “Add Feed” button works is this: when you click it the first time, a text box appears for you to enter the URL; the second time you click, it actually adds the feed specified in the text box. The data-bind that does this is value: newFeedUrl, visible: addingFeed. So, addingFeed controls the visibility of the #newFeedInput. We then add a feed by calling addFeed, which sends a POST request to our server.

    self.addingFeed = ko.observable(false);
    self.newFeedUrl = ko.observable("");

    self.addFeedClicked = function(e) {
        self.addingFeed(!self.addingFeed());
        if (self.addingFeed()) {
            self.newFeedUrl("");
            $("newFeedInput").focus();
        } else {
            self.addFeed(self.newFeedUrl());
        }
    }

    self.addFeed = function(url) {
        log("Adding feed: ", url);
        (new Request.JSON({
            url: "/add-feed",
            onSuccess: function(feeds) {
                log("Added feed: ", url);
                self.setFeeds(feeds);
            }
        })).send("url="+url);
    }

Importing OPMLs uses the same logic as above. The only extra complexity is in importOpml, where we upload a file to the server. Since Mootools doesn’t seem to have a wrapper for this, we do it manually.

    self.importingOpml = ko.observable(false);

    self.importOpmlClicked = function(e) {
        self.importingOpml(!self.importingOpml());
        if (self.importingOpml()) {
            $("opmlInput").focus();
        } else {
            self.importOpml($("opmlInput").files);
        }
    }

    self.importOpml = function(fs) {
        var f = fs[0];
        log("Adding OPML from file: ", f);
        var formData = new FormData();
        formData.append("opml-file", f);
        var req = new XMLHttpRequest();
        req.open("POST", "import-opml");
        req.onload = function(event) {
            feeds = JSON.parse(event.target.responseText);
            log("Got back feeds: ", feeds);
            self.setFeeds(feeds);
        };
        req.send(formData);
    }

The last button visible to the user is “Reload”, which gets all the feeds from the server. The data-bind for it is click: reload. The reload function is called periodically to keep the webapp synchronized with the server. The special case is the first time it’s called, when it sets the current feed to the first one.

    self.reload = function() {
        (new Request.JSON({
            url: "/feeds",
            onSuccess: function(feeds) {
                log("Reloaded feeds: ", feeds);
                self.setFeeds(feeds);
                if (self.feeds().length > 0 && !self.selectedFeed()) {
                    self.selectFeed(self.feeds()[0]);
                }
            }
        })).get();
    }

Whenever a user clicks on a feed, it becomes the selected one. The data-bind for this is click: $root.selectFeed, css: { selected: selected() }. Note that when a feed becomes selected, it is automatically highlighted in the HTML. We then request feed entries for it.

    self.selectedFeed = ko.observable(null);

    self.selectFeed = function(feed) {
        log("Select feed ", feed);
        if (self.selectedFeed()) {
            self.selectedFeed().selected(false);
        }
        feed.selected(true);
        self.selectedFeed(feed);
        $("feedContent").scrollTo(0);

        self.getFeedEntries(feed);
    }

Getting entries for the selected feed is just a matter of POSTing to the server. When selectedFeedEntries updates, the HTML is also automatically updated.

    self.selectedFeedEntries = ko.observable([]);

    self.getFeedEntries = function(feed) {
        (new Request.JSON({
            url: "/feed-entries",
            onSuccess: function(entries) {
                log("Got entries for ", feed, ": ", entries);

                self.selectedFeedEntries(entries.map(function(entry) {
                    return (new FeedEntry(entry, feed));
                }));
                self.checkRead();
            }
        })).send("url="+feed.url);
    }

Marking a feed entry as read is just a POST to the server; the tricky part is deciding when a feed entry has been read. We just use Google Reader’s algorithm for this: if a feed entry’s end is visible on the screen, we assume it has been read. So, checkRead, which is called every time the user scrolls, goes through all the entries, and marks the ones whose end is visible as read.

    self.markRead = function(entry) {
        log("Marking as read: ", entry.title);
        (new Request({
            url: "/mark-read",
            onSuccess: function() {
                log("Marked as read: ", entry.title);
                entry.read(true);
            }
        })).send("feed_url="+entry.feed.url+"&url="+entry.link);
    }

    self.checkRead = function() {
        var feedContentE = $("feedContent");
        var feedEntriesE = feedContentE.getElements("li.feedEntry");
        self.selectedFeedEntries().reduce(function (totalHeight, entry, i) {
            // If an entry is fully visible, it is read.
            if (totalHeight < feedContentE.scrollTop + feedContentE.clientHeight
                && !entry.read()) {
                self.markRead(entry);
            }
            return totalHeight + feedEntriesE[i].getSize().y;
        }, 0);
    }
}

That’s it for Knockout’s model; all we have to do now is set everything up. First, we need a function that resizes the elements so that they fit the entire screen. Then, once the DOM is ready, we resize the layout, initialize Knockout, set up a timer to reload the feeds every ten seconds, and setup the scroll event that triggers checkRead.

function setupLayout() {
    var height = document.getSize().y - $$("header")[0].getSize().y - 20;
    $("feeds").setStyle("height", height + "px");
    $("feedContent").setStyle("height", height + "px");
    var feedContentWidth = document.getSize().x - $("feeds").getSize().x - 28;
    $("feedContent").setStyle("width", feedContentWidth + "px");
}

document.addEvent("domready", function() {
    setupLayout();
    window.addEvent("resize", setupLayout);

    // Model is global.
    model = new AppViewModel();
    ko.applyBindings(model);
    model.reload();
    log("document loaded");

    var updateInterval = 10000;
    var updater = function () {
        model.reload();
        this.delay(updateInterval, this);
    };
    updater.delay(updateInterval, updater);

    $("feedContent").addEvent("scroll", function() {
        model.checkRead();
    });
});

And there you have it: a simple web-based news aggregator. The code took about nine and a half hours to write, and none of it was particularly tricky. Just a guess, I’d say making Zingr multi-user wouldn’t be hard at all, but making it support a lot of users at once would be.