I have built a Python ETL tool that exports every record in every table in the knack database and saves it to data warehouse.
The primary driver for my client is the steep cost of adding rows. In order to keep his database below the threshold on plan, we are purge no longer needed records.
In this particular use case, not every record is needed in the “live app,” so we are purging rows and archiving in a data warehouse (Motherduck) and building simple BI reports (Preset.io) to access the historical data. The data pipeline runs nightly in GitHub Actions.
The tech stack was chosen to be cost-efficient.
There is a good chance this warehouse solution can added for $0 / months. The Motherduck free tier includes 10GB of storage 10 compute hours each month. Preset.io also has a reasonable free tier. GitHub also has a useful free tier.
The code is still pretty entangled with this client’s environment.
Getting Knack Data into a SQL database seems useful, and I’ve enjoyed using SQL to aid research and building SQL Views for analytic reports.
I’m still fairly new to the Knack realm, most of my other clients are on Python-centric tech stacks (Streamlit, Anvil.works, PostreSQL, etc) …so I only dive into Knack when needed for this client.
Which is to say, I have no idea if this is a common challenge where it’s worth my time/effort to do something with these building blocks.
Feedback welcome. I’d love the perspective of this community on SQLDatabase/Data Pipeline/Data Warehousing problem & solutions space.
What Data Warehouse (Snowflake, BigQuery, ?) and BI tools are folks using? I’ve PowerBI in a few posts.
I’m very interested in understanding more about what you have achieved as I foresee a similar need in the near future. Given the driver was a large number of records, and I assume you are using APIs to grab the data, how are you dealing with the daily API limit? As for target BI platform in our case that would be MS Fabric and Power BI
@LeighBerre96926
Yep. Luckily for now, we upgraded our API usage so we have the headroom to complete the data loader job.
That said, I built a throttler where you set X number of API requests in reserve and the data loader won’t ever exceed that. So stop loading data when Current Usage + Reserve >= Plan Daily Total.
Our current reserve is about 110% of a recent peak, on assumption that will be enough so that a typical daily usage can work without interruption.
The goal at this point was to ensure the data loader won’t exhaust the accounts API limit. Which it does.
But since we haven’t hit our API limits, I haven’t started working on sequencing of data so that over x periods 100% of data is loaded.
That is if we need 10,000 API requests to load the data and the daily limit is 5,000, how do we minimize the staleness of data knowing that we cannot refresh the entire dataset in single run. I have a few ideas, but haven’t implemented anything since this isn’t pressing issue right now.
One caveat, the actual mechanism I’m taking advantage of to get “remaining API for the day” recently stopped working and my account is in bit of a funky state, and support suggest that my client unshare everything with me and re-share. I’m in the middle of some critical work and I don’t want to risk losing a few days while my account access gets squared away.
This used to work for me for real-time API usage states
https://api.knack.com/v1/accounts/{{KNACK_ACCOUNT_ID}}/usage/api
As I said, it stopped, but my account is also funky right now so I don’t know if access to this endpoint disappeared or it’s just my account.
Sounds like you have a really good model there. If it plans out as I expect I’ll reach our and try to pick your brain a little more.