Plan for Moving Off Haskell (Part 2): Technical Details

I’ve started with only the database changes, but auth and api changes are on-topic here too; I just didn’t want to wait to post (and this post is long enough already).

Continuing the discussion from Plan for Moving Off Haskell (Part 1): Overview:

Database Changes

The database is broken into two groups. There’s the authentication tables, which contain basically just “users”— what you log in as. Then there’s the crowdmatch tables, which contain information about pledges. The crowdmatch__patron table links the two: the crowdmatch tables refer to ids in the patron table, and patrons in that table have a user id (usr).

I think this split was a good idea; it makes changes to our auth system much less painful. That’s because many of the crowdmatch tables are there for a historical record. The patron id is repeated many times; we don’t want to have to update it everywhere if user ids change. This design means we only need to update it once per patron.

For now, I am only going to focus on changes to the crowdmatch tables, so I’ll omit the crowdmatch__ prefix in the discussion below.

Current status

Our database is created using Yesod’s default database interface, Persistent, based on a definition in Haskell. The current definition (shortened slightly, you can find the full version in Model.hs) looks like this.

Patron
    usr Integer -- UNIQUE
    created UTCTime
    paymentToken PaymentToken Maybe -- denormalized, can be retrieved from PaymentTokenHistory
    donationPayable DonationUnits -- denormalized, can be calculated from CrowdmatchHistory + PaymentHistory
    pledgeSince UTCTime Maybe -- denormalized, can be calculated from PledgeHistory. Used to check who should be included in the crowdmatch

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction -- ENUM('CREATE', 'DELETE')

PaymentTokenHistory
    patron PatronId
    time HistoryTime
    action StorageAction -- ENUM('CREATE', 'DELETE')

DonationHistory
    patron PatronId
    time HistoryTime
    amount DonationUnits
    fee Cents

CrowdmatchHistory
    patron PatronId
    date CrowdmatchDay
    amount DonationUnits -- denormalized, can be calculated from PledgeHistory

I can’t say I understand how it works, but the results are basically what you’d guess.

  • Every table has an id bigint NOT NULL added automatically. A type of PatronId means that column has the same type and that there’s a foreign key constraint so you can only insert values that already exist in Patron.id .
  • A bunch of the types are newTypes (non-Haskellers: read “aliases”[1]) defined in ModelDataTypes.hs; UTCTime and HistoryTime both translate to SQL timestamp.
  • If there’s no Maybe it’s NOT NULL (non-SQLers: read “not nullable”[2]).
  • StorageAction is actually a regular string (err, character varying) under the hood, but Haskell enforces that it will only be one of those two values.

Note: The denormalized fields exist because they’d be expensive to recompute based on the history. That’s fine. It would be nice to actually do the re-computation to double-check that everything is right, just before we charge people.

Should-Be-Uncontroversial Changes

Patron
    usr Integer
    created UTCTime
    paymentToken PaymentToken Maybe
    donationPayable Cents -- was DonationUnits
    pledgeSince UTCTime Maybe

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction

PaymentTokenHistory
    patron PatronId
    time HistoryTime
    action StorageAction

PaymentHistory -- was DonationHistory
    patron PatronId
    time HistoryTime
    amount Cents -- was DonationUnits
    fee Cents
    transactionInfo ???? -- [NEW] Research needed to figure out exactly what to store here

CrowdmatchHistory
    patron PatronId
    date CrowdmatchDay
    runAt UTCTime -- [NEW] Suggestions for a better name are welcome
    amount Cents -- was DonationUnits
Rationale for each

The term “Donation” is sometimes confusing – does it mean charging someone or adding to their balance? Clarify it.

  • DonationHistoryPaymentHistory

PaymentHistory is strictly a historical record, so we want to store the same units the transaction actualy happened in.

  • PaymentHistory.amount: DonationUnitsCents

We decided to round to the nearest cent at crowdmatch time (see part 1).

  • CrowdmatchHistory.amount: DonationUnitsCents

This also means that outstanding balances cannot contain fractions of a cent.

  • Patron.donationPayable: DonationUnitsCents

We currently run crowdmatches on the 3rd of the month. Anyone who was pledged on the 1st, and is still pledged on the 3rd, is counted as part of that month’s crowdmatches. This gives patrons 2 days to check how much they’ll be charged and drop their pledge before being counted in the crowdmatch. However, it also means we need to know the date that we ran the crowdmatches on, in order to re-calculate the value. So we need to store that date/time.

  • NEW: CrowdmatchHistory.runAt UTCTime

I’m pretty sure Stripe sends back some metadata about the transaction. I’m also pretty sure we want to store it, for auditability. I’m not sure exactly what that data is, though. Maybe the best thing here would be to use json in postgres and just store their whole response, as a blob.
At the very least, though, we probably want to store:

  • Whether the transaction failed or succeeded (so we can also log failed transactions)
  • Some form of transaction Id (so we can identify it to stripe, even if we didn’t store the full response)

Supporting Different Pledge Types

From the mechanism discussion, it’s clear that there are a lot of possible different variations, and that we might want to try some of them. Once we have multiple projects, we might even want to try different variations at the same time.

A coworker at my day job suggested that json-in-postgres might be a good fit here. Its performance is not as good if you need to query into the json structure, but if you are just retrieving the blob (so that the client can do the calculations), the overhead is pretty reasonable. Here’s what I was thinking.

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction
    metadataId PledgeMetadataId

PledgeMetadata -- Better name suggestions?
    info jsonb

For the current iteration of the mechanism, all entries in PledgeHistory would have the same metadataId, and the corresponding PledgeMetadata.info could look something like this. Remember that we only need to include the parameters we want to vary, not all the data, since the client will knows which algorithm to use based on the version field.

{
    "version": 1,
    "capCents": 600
}

If we decided on patron-based matching, we might go with something like this:

{
    "version": 2,
    "goalPatrons": 6000,
    "pledgeCents": 600
}

Or a more flexible/explicit version that can support both patrons and dollars:

{
    "version": 2,
    "goalUnits": "cents",
    "goal": 3600000,
    "pledgeCents": 600
}

If we decided to go with patron-chosen goals, there would be a different row in PledgeMetadata for each Pledge/Goal combination, and the PledgeHistory entries would not all point to the same one. This would run into scaling issues sooner, but I did some really rough back-of-the-napkin estimates (which I inconveniently forgot to write down) and we’d almost certainly be fine until around 10 projects with 10k patrons each.

Anyway, two decisions left:

  1. I’m not 100% decided on doing this; I’d like to get some more eyes on it and make sure there’s not a convincing argument against using json-in-postgres like this.

  2. Should version be a required field in the json blob? If yes, should it be a column outside the json blob instead?

Supporting Multiple Projects

Although we won’t be supporting multiple projects right away, it is just over the horizon. I do not think we should actually add these columns to the DB now. However, we should consider whether they can easily be added later— to make sure it’s just a matter of a new Project table and a bunch of ProjectIds added to existing tables, not any major refactoring the tables that exist.

Patron
    usr Integer
    created UTCTime
    paymentToken PaymentToken Maybe
    donationPayable Cents
    pledgeSince UTCTime Maybe -- Problem

Project
    name String
    -- Whatever other info we need here

PledgeHistory
    patron PatronId
    project ProjectId -- NEW
    time UTCTime
    action StorageAction

PaymentTokenHistory
    patron PatronId
    project ProjectId -- NEW
    time HistoryTime
    action StorageAction
    metadataId PledgeMetadataId

PledgeMetadata
    info jsonb

PaymentHistory
    patron PatronId
    time HistoryTime
    amount Cents
    fee Cents
    transactionInfo ????
    -- Need to verify that we wouldn't get any additional info to record here

CrowdmatchHistory
    patron PatronId
    project ProjectId -- NEW
    date CrowdmatchDay
    runAt UTCTime
    amount Cents

Mostly smooth, except pledgeSince needs to be pulled out into a different table, Pledge, basically a cache of PledgeHistory (or, from the other perspective, PledgeHistory being a log of changes to Pledge). We could do that now, or we could wait— it’s a relatively small refactor, although it does still involve copying data between tables.

Alternatively, I wonder if we just dropped pledgeSince, whether we could find a fast-enough way to query PledgeHistory directly for the most-recent update for some patron and project combination, given that time should be monotonic.


  1. NewTypes are not quite aliases; they use the same representation, but you can’t pass one in the place of the other, so it protects you from type errors. ↩︎

  2. In SQL, everything is nullable by default :tear: ↩︎