Plan for Moving Off Haskell (Part 2): Technical Details

I’ve started with only the database changes, but auth and api changes are on-topic here too; I just didn’t want to wait to post (and this post is long enough already).

Continuing the discussion from Plan for Moving Off Haskell (Part 1): Overview:

Database Changes

The database is broken into two groups. There’s the authentication tables, which contain basically just “users”— what you log in as. Then there’s the crowdmatch tables, which contain information about pledges. The crowdmatch__patron table links the two: the crowdmatch tables refer to ids in the patron table, and patrons in that table have a user id (usr).

I think this split was a good idea; it makes changes to our auth system much less painful. That’s because many of the crowdmatch tables are there for a historical record. The patron id is repeated many times; we don’t want to have to update it everywhere if user ids change. This design means we only need to update it once per patron.

For now, I am only going to focus on changes to the crowdmatch tables, so I’ll omit the crowdmatch__ prefix in the discussion below.

Current status

Our database is created using Yesod’s default database interface, Persistent, based on a definition in Haskell. The current definition (shortened slightly, you can find the full version in Model.hs) looks like this.

Patron
    usr Integer -- UNIQUE
    created UTCTime
    paymentToken PaymentToken Maybe -- denormalized, can be retrieved from PaymentTokenHistory
    donationPayable DonationUnits -- denormalized, can be calculated from CrowdmatchHistory + PaymentHistory
    pledgeSince UTCTime Maybe -- denormalized, can be calculated from PledgeHistory. Used to check who should be included in the crowdmatch

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction -- ENUM('CREATE', 'DELETE')

PaymentTokenHistory
    patron PatronId
    time HistoryTime
    action StorageAction -- ENUM('CREATE', 'DELETE')

DonationHistory
    patron PatronId
    time HistoryTime
    amount DonationUnits
    fee Cents

CrowdmatchHistory
    patron PatronId
    date CrowdmatchDay
    amount DonationUnits -- denormalized, can be calculated from PledgeHistory

I can’t say I understand how it works, but the results are basically what you’d guess.

  • Every table has an id bigint NOT NULL added automatically. A type of PatronId means that column has the same type and that there’s a foreign key constraint so you can only insert values that already exist in Patron.id .
  • A bunch of the types are newTypes (non-Haskellers: read “aliases”[1]) defined in ModelDataTypes.hs; UTCTime and HistoryTime both translate to SQL timestamp.
  • If there’s no Maybe it’s NOT NULL (non-SQLers: read “not nullable”[2]).
  • StorageAction is actually a regular string (err, character varying) under the hood, but Haskell enforces that it will only be one of those two values.

Note: The denormalized fields exist because they’d be expensive to recompute based on the history. That’s fine. It would be nice to actually do the re-computation to double-check that everything is right, just before we charge people.

Should-Be-Uncontroversial Changes

Patron
    usr Integer
    created UTCTime
    paymentToken PaymentToken Maybe
    donationPayable Cents -- was DonationUnits
    pledgeSince UTCTime Maybe

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction

PaymentTokenHistory
    patron PatronId
    time HistoryTime
    action StorageAction

PaymentHistory -- was DonationHistory
    patron PatronId
    time HistoryTime
    amount Cents -- was DonationUnits
    fee Cents
    transactionInfo ???? -- [NEW] Research needed to figure out exactly what to store here

CrowdmatchHistory
    patron PatronId
    date CrowdmatchDay
    runAt UTCTime -- [NEW] Suggestions for a better name are welcome
    amount Cents -- was DonationUnits
Rationale for each

The term “Donation” is sometimes confusing – does it mean charging someone or adding to their balance? Clarify it.

  • DonationHistoryPaymentHistory

PaymentHistory is strictly a historical record, so we want to store the same units the transaction actualy happened in.

  • PaymentHistory.amount: DonationUnitsCents

We decided to round to the nearest cent at crowdmatch time (see part 1).

  • CrowdmatchHistory.amount: DonationUnitsCents

This also means that outstanding balances cannot contain fractions of a cent.

  • Patron.donationPayable: DonationUnitsCents

We currently run crowdmatches on the 3rd of the month. Anyone who was pledged on the 1st, and is still pledged on the 3rd, is counted as part of that month’s crowdmatches. This gives patrons 2 days to check how much they’ll be charged and drop their pledge before being counted in the crowdmatch. However, it also means we need to know the date that we ran the crowdmatches on, in order to re-calculate the value. So we need to store that date/time.

  • NEW: CrowdmatchHistory.runAt UTCTime

I’m pretty sure Stripe sends back some metadata about the transaction. I’m also pretty sure we want to store it, for auditability. I’m not sure exactly what that data is, though. Maybe the best thing here would be to use json in postgres and just store their whole response, as a blob.
At the very least, though, we probably want to store:

  • Whether the transaction failed or succeeded (so we can also log failed transactions)
  • Some form of transaction Id (so we can identify it to stripe, even if we didn’t store the full response)

Supporting Different Pledge Types

From the mechanism discussion, it’s clear that there are a lot of possible different variations, and that we might want to try some of them. Once we have multiple projects, we might even want to try different variations at the same time.

A coworker at my day job suggested that json-in-postgres might be a good fit here. Its performance is not as good if you need to query into the json structure, but if you are just retrieving the blob (so that the client can do the calculations), the overhead is pretty reasonable. Here’s what I was thinking.

PledgeHistory
    patron PatronId
    time UTCTime
    action StorageAction
    metadataId PledgeMetadataId

PledgeMetadata -- Better name suggestions?
    info jsonb

For the current iteration of the mechanism, all entries in PledgeHistory would have the same metadataId, and the corresponding PledgeMetadata.info could look something like this. Remember that we only need to include the parameters we want to vary, not all the data, since the client will knows which algorithm to use based on the version field.

{
    "version": 1,
    "capCents": 600
}

If we decided on patron-based matching, we might go with something like this:

{
    "version": 2,
    "goalPatrons": 6000,
    "pledgeCents": 600
}

Or a more flexible/explicit version that can support both patrons and dollars:

{
    "version": 2,
    "goalUnits": "cents",
    "goal": 3600000,
    "pledgeCents": 600
}

If we decided to go with patron-chosen goals, there would be a different row in PledgeMetadata for each Pledge/Goal combination, and the PledgeHistory entries would not all point to the same one. This would run into scaling issues sooner, but I did some really rough back-of-the-napkin estimates (which I inconveniently forgot to write down) and we’d almost certainly be fine until around 10 projects with 10k patrons each.

Anyway, two decisions left:

  1. I’m not 100% decided on doing this; I’d like to get some more eyes on it and make sure there’s not a convincing argument against using json-in-postgres like this.

  2. Should version be a required field in the json blob? If yes, should it be a column outside the json blob instead?

Supporting Multiple Projects

Although we won’t be supporting multiple projects right away, it is just over the horizon. I do not think we should actually add these columns to the DB now. However, we should consider whether they can easily be added later— to make sure it’s just a matter of a new Project table and a bunch of ProjectIds added to existing tables, not any major refactoring the tables that exist.

Patron
    usr Integer
    created UTCTime
    paymentToken PaymentToken Maybe
    donationPayable Cents
    pledgeSince UTCTime Maybe -- Problem

Project
    name String
    -- Whatever other info we need here

PledgeHistory
    patron PatronId
    project ProjectId -- NEW
    time UTCTime
    action StorageAction

PaymentTokenHistory
    patron PatronId
    project ProjectId -- NEW
    time HistoryTime
    action StorageAction
    metadataId PledgeMetadataId

PledgeMetadata
    info jsonb

PaymentHistory
    patron PatronId
    time HistoryTime
    amount Cents
    fee Cents
    transactionInfo ????
    -- Need to verify that we wouldn't get any additional info to record here

CrowdmatchHistory
    patron PatronId
    project ProjectId -- NEW
    date CrowdmatchDay
    runAt UTCTime
    amount Cents

Mostly smooth, except pledgeSince needs to be pulled out into a different table, Pledge, basically a cache of PledgeHistory (or, from the other perspective, PledgeHistory being a log of changes to Pledge). We could do that now, or we could wait— it’s a relatively small refactor, although it does still involve copying data between tables.

Alternatively, I wonder if we just dropped pledgeSince, whether we could find a fast-enough way to query PledgeHistory directly for the most-recent update for some patron and project combination, given that time should be monotonic.


  1. NewTypes are not quite aliases; they use the same representation, but you can’t pass one in the place of the other, so it protects you from type errors. ↩︎

  2. In SQL, everything is nullable by default :tear: ↩︎

Authentication

I’ve added a comment to part 1 with a little more background of why we’d like to move to an off-the-shelf auth solution and what it should provide.

I’m not particularly familiar with this area, so I mostly just have questions at this point.

  • Should the solution provide its own page (e.g. we just redirect to a subdomain for login and then it redirects back, essentially an oauth/SSO type flow where the auth system acts like the 3rd party), or is it just a backend?

  • How to integrate with Discourse SSO?

  • What methods do they have for migrating from our old account database?

Some options that I know about are gluu and keycloak. I don’t know how they compare in terms of ease of setup and features.

Database, again

Whatever we choose, I’d also like to make some database changes to support the auth breakout.

Right now we have a usr table. The id column corresponds to the usr field in the Patron table above. Its schema looks like this:

User
    id Int -- unique
    email Text -- unique
    digest ByteString -- password hash
    created UTCTime
    passUpdated UTCTime

Email, password hash, and time of last update are things that should be stored in the auth system. However, I think it’s good to keep the usr table around with just an id and creation date. That will be the stable id to represent a user in our system, even if we end up swapping out auth systems in the future. The patron id could be used here instead, but I like the idea of having the user be something outside the crowdmatching code; I think it’ll make life less painful if we ever want to make changes to the crowdmatching system in the future.

So, for our existing DB, we’d end up with tables like:

User
    id Int -- unique
    created UTCTime

LoginV1
    id Int
    userId Int -- corresponds to id in user table
    email Text -- unique
    digest ByteString -- password hash
    created UTCTime -- I think this is worth duplicating
    passUpdated UTCTime

Depending on how the auth system we choose works, we may end up adding a new table, similar to LoginV1, which maps the auth system identifier to a user id.

API

The API is probably the simplest part. We need routes to:

Auth

Depending on the auth solution we choose, the api might need to have a thin wrapper for some calls (login, logout, signup, confirm signup, reset password, confirm reset, change email).

We’ll end up with some secret value (e.g. token or jwt) which we’ll send along to other api requests in either a cookie or header.

Pledge/Unpledge

This is probably as simple as PUT/DELETE to /p/snowdrift/pledge. Later, when there are variations on pledging, the request body will contain details about that (e.g. amount pledged), but for now it doesn’t need anything.

Pledge is only allowed if the user has a (valid) credit card stored already.

Add/remove credit cards

The frontend will store a credit card with Stripe through their api and receive a Stripe token in return. This token is then sent to us for storage. This should also be a simple PUT/DELETE to something like /user/payment/stripe.

Removing a card is only allowed if the user is not currently pledged and has no outstanding balance (pending donation); if either of those are not true, this should return an error. The body should be a json blob with this typescript type: { errors: { type: string, details: string }[] } In English, that’s basically a list of errors, each of which has a unique type and a description that has more details. This would be used so the frontend can prompt the user to un-pledge or pay off their balance first.

Adding a card if there is already a card added should be allowed, and should replace the current card. Stripe should validate cards, so this is allowed even if there is an outstanding balance.

Pay off a balance (pending donation)

To keep fees below 10%, normally donations are not charged until the pending donation exceeds $3.79. However, if someone wants to stop being a patron and remove their card, we need to provide a way for them to zero out their balance, in order to maintain the integrity of matching. This also means we need to wait to actually zero out the balance until the transaction clears.

For MVP, we may be able to do without (automating) this, and have people contact us so we can run the payout program manually for that person.

Show Pending Donation, Crowdmatch History, and Payment History

For showing on the user’s dashboard.

Show Project Information

At this point it’s just the number of patrons pledged.


The above is not quite detailed enough for a spec if we were to hire a freelancer to implement it, but I think it has all the key information. Anything I missed?

A post was merged into an existing topic: Moving Off Haskell (Part 3): Choosing a Tech Stack

Here’s the order that I think we should proceed:

  1. Make some of the “Uncontroversial DB Changes”
    • DO: Change from DonationUnits to Cents
    • DO: Add runAt timestamp to CrowdmatchHistory table
    • DON’T: Rename the DonationHistory table to PaymentHistory yet.
    • DON’T: Add the transactionInfo field yet.
  2. Start charging patrons (after contacting them and doing test charges on team member cards)
  3. Replace the current website with the new elm site and to-be-written api and to-be-decided auth system.
  4. Make the rest of the DB changes.

This gets the “things that make our life simpler once we start charging people” value before we start actually charging people, without requiring us to make too many changes in the Haskell code base.

edit: note that this is just to get us to the “off of Haskell” stage, not to multiple projects; an alternative to consider is trying to leap-frog and go straight to building for multiple projects.

We’re about due for an update. We’ve had a small change of course.

The above was (and still is) a fine plan if we’re only thinking as far as the “get off Haskell” stage. However, @Adroit has pushed (rightly, I think) for us to consider a slightly longer horizon: our real next goal is to get other projects on the platform, and it would be great if we can rewrite directly from the single-project Haskell backend to a multi-project node backend.

Additionally, we’ve run into an annoyance with the Haskell site: it requires OpenSSL 1.1 to compile, and many distros are now shipping OpenSSL 1.3. Downgrading is both a bad idea and somewhat difficult, so currently the best way to build the site if you’re on a modern distro is to build in a VM on an older version of your distro that still includes OpenSSL 1.1. That’s annoying at the least, and makes even small tweaks to the Haskell site much harder to implement. This makes getting off Haskell in a timely fashion even more important.

There’s a wonderful compromise that achieves both of these goals:

  1. Build a read-only backend api in node, that connects to the current postgres database.
  2. Hook the new Elm frontend up to this read-only api.
  3. Run the Haskell and Elm sites alongside each other. Move the Haskell site to https://old.snowdrift.coop, and move the Elm site to https://snowdrift.coop (instead of https://new.snowdrift.coop, where it is now).
    • On pages that require writes (password reset, pledge/unpledge, add/remove credit card info), include a link to the relevant old.snowdrift.coop page.
  4. Focus on building out multi-project functionality in the node backend, using Payload, as we have ~decided in Moving Off Haskell (Part 3): Choosing a Tech Stack
  5. When the new backend+frontend are ready to handle write functionality, migrate existing data from Postgres to Payload (MongoDB) and retire the Haskell site.

I’m kind of going to insist on this approach[1], but I don’t think anyone will have objections; I’m pretty sure it just makes the most sense. The goal is to be at step 3 by March of this year, and to start charging patrons of Snowdrift.coop around that point as well (once we’ve got our non-profit status cleared up, so we can make sure fees & taxes are right up-front).


  1. By which I mean, I’ve already secretly written most of step 1, to-be-revealed in tomorrow’s weekly meeting ↩︎

How much does using a “node backend” fundamentally tied to the NodeJS packaging ecosystem? If Deno or Bun turn out to be a superior option later, will it be easy enough to transition potentially?

A lot. Well, it’s tied to the extent that the backend uses software that’s only written on NodeJS. To avoid that we’d have to write pretty much everything under the sun from scratch.

Deno may already be technically superior, but its package ecosystem is not, by a long shot. When it does become superior in this way (distant future if ever), it may be “easy enough to transition properly”, yes, because it’s really only the lack of package ecosystem that prevents us from using it right now.

For example, Payload runs on Node, and uses many Node packages. There is no particular reason Payload couldn’t run on Deno, except that many tools and libraries aren’t there yet. Once they are, the Payload team may even choose to enable Deno support themselves. If they don’t, but Deno comes to support all the libraries Payload needs (e.g. passport) and we need (e.g. Stripe), we could feasibly do it ourselves.

1 Appreciation