Indexing articles on my blog
I had a couple of goals when deciding how to structure the content-processing pipeline for my blog:
-
Articles written in MDX files (Markdown with JSX) so I can conveniently author them in my neovim editor.
Initially, I thought about storing them in a database instead of files, but that would increase my costs, since I would have to host a database. I also thought of using a CMS like Strapi, but again, that costs more money than just static files. I also would have more difficulty using my beloved neovim to write content there.
-
Articles are stored in the
content/articles
directory. The structure of that directory is not strict.I decided to group them into a 2-level hierarchy initially. Grouped first by the year, and then by the month of the created date (e.g.
content/articles/2022/06/18-articles-indexes.mdx
). However, I want to allow myself for the liberty of having series for which all articles would be stored in a single directory, for examplecontent/articles/fp-ts/01-piping-values.mdx
). -
Each article belongs to one or more topics.
-
Each topic has a description.
-
Each article has an arbitrary slug. There is no scheme for the slug format.
These are great for the content author (i.e. me). I am not so sure about the application that serves my blog would be happy, since, in the naive case, it has to do a lot more work on each request:
-
Scan all the articles in the
content/articles
directory. -
Parse their frontmatter.
-
Depending on the URL:
- for article pages (when we know the slug), find the article that matches the slug, parse and show its MDX content,
- for a single topic page, filter the articles with that topic and list them,
- for the list of topics, find all topics from all articles and list them,
- for the home page, sort all articles by their created date and show them.
I am using Next.js as the meta-framework. Thanks to its Static Generation, most of that work would be done during build time. It is much better than doing that at request-time, since the pages are already prebuilt when the user requests them, leading to a snappier user experience.
This approach would scale to probably hundreds of articles. I could have left it as it was and pay the price of walking and parsing the articles multiple times during each build. That was not good enough for me. I wanted to see if I reduce the duplicate work.
Warning This is a case of premature optimization. I am solving a performance problem without measuring it and finding out if it is a bottleneck. I am doing it to gain experience solving such a problem and to feel good about myself. Do not do that in a work environment š
The needs of each page
Let's identify what the needs for each page type of my blog are.
- Home page - an ordered list of articles.
- Topics page - a list of all topics mentioned in at least 1 article.
- Topic page (for each topic) - an ordered list of articles with that topic.
- Article page (for each article slug) - file path to the article file.
Indexes are the solution
Similarly to how a database uses indexes to find rows faster, my content-processing pipeline can use indexes to build pages faster than parsing the content files again and again.
My idea was to have a single script that scans all articles and produces a couple of JSON files which would serve as indexes. It would run before Next would build the pages. When the pages are being built, instead of scanning the articles again, these index files would be used.
Aside from building the indexes, the script also runs some validation on the content, like ensuring that all articles have correct frontmatter metadata, there are no duplicate slugs, and all topics have descriptions.
Structure of indexes
The structure of the indexes looks as follows:
10:54 $ exa --tree content-indexes/
content-indexes
āāā all-articles.json
āāā slug-reverse-mapping.json
āāā topics
ā āāā blog.json
ā āāā test.json
āāā topics.json
They hold the following contents:
11:02 $ cat content-indexes/all-articles.json
[
{
"title": "Indexing articles on my blog",
"date": "2022-06-18T00:00:00.000Z",
"tags": [
"blog"
],
"slug": "indexing-blog-articles",
"summary": "How to avoid duplicate work when serving pages for my blog? Building up indexes that summarize the articles in a concise format is one solution to this problem.",
"readingTimeMin": 4.13
},
...
]
11:04 $ cat content-indexes/slug-reverse-mapping.json
{
"indexing-blog-articles": {
"metadata": {
"title": "Indexing articles on my blog",
"date": "2022-06-18T00:00:00.000Z",
"tags": [
"blog"
],
"slug": "indexing-blog-articles",
"summary": "How to avoid duplicate work when serving pages for my blog? Building up indexes that summarize the articles in a concise format is one solution to this problem.",
"readingTimeMin": 4.13
},
"filePath": "content/articles/2022/06/18-article-indexes.mdx"
},
...
}
11:08 $ cat content-indexes/topics.json
[
"blog",
"test"
]
11:08 $ cat content-indexes/topics/blog.json
{
"articles": [
{
"title": "Indexing articles on my blog",
"date": "2022-06-18T00:00:00.000Z",
"tags": [
"blog"
],
"slug": "indexing-blog-articles",
"summary": "How to avoid duplicate work when serving pages for my blog? Building up indexes that summarize the articles in a concise format is one solution to this problem.",
"readingTimeMin": 4.13
},
...
],
"description": "Articles related to my blog. These could be describing the architecture,\nthe decisions made, etc.\n"
}
There is duplication of article metadata in the indexes. That is on purpose. The indexes hold denormalized data. Each index file contains all information necessary to render the route that it is supposed to help serve.
Tradeoffs
As with most decisions, this one also comes with some tradeoffs.
Firstly, the indexes can become outdated. I have to explicitly run
npm run create-indexes
to create/update them. If I forget to run them after
adding a new article, there is no error message. I simply need to remember to
run it once in a while during content-authoring. The production environment (aka
Vercel) is not affected, because
indexes are always created in the npm build
script.
Another downside of using indexes is that the code got more complex. Instead of doing the same content-parsing in each route, I had to write code to create and then read these indexes. It was a cool exercise to write most of the logic using fp-ts, but still, that is code that I need to maintain.
Bonus: incremental generation of article pages
Another premature optimization I could implement was to defer the building of article pages until they are requested by some user. I could imagine that if I had many articles, the build process would take longer, since each article's MDX content would have to be parsed before the build is complete. I didn't measure the MDX parsing performance, which gives me the right to say this is premature optimization. I wanted to overengineer it anyway.
Next.js has a feature called Incremental Static Regeneration. Long story short, it defers building some pages until they are requested by a user. Article pages are a perfect use case. Instead of building all article pages up-front, I could only build a handful of them (perhaps the most popular articles and the most recent ones) and defer building all the other articles until some user wants to read it. The build process remains fast, since only a constant number of articles are built then, and for the rest of articles, the first user would have to wait a few seconds longer for the article to be processed. I have yet to tell Next which articles to prebuild, but this seems like the best of both worlds.
Index files not found during incremental generation
Right after deploying my changes to Vercel, I noticed that article pages show
errors due to not being able to read the all-contents.json
index file. After
some analysis I found out that I made a couple of assumptions that are true when
running the Next server locally, but are broken on
Vercel Edge Functions
(which is where the incremental generation happens):
-
Next compiles page files and moves them to a different location, so using
__dirname
orimport.meta.url
to find the root of the repository is no longer viable. -
Next only includes a handful of repository files to be available in Edge Functions. The process of figuring out which files are included is called Output File Tracing.
I was using
import.meta.url
as a
modern way of finding the location of the current file, based on which I
computed the file to the content-indexes
directory. This worked well locally
(both in next dev
and next start
). The structure was no longer maintained in
Edge Functions, which made these paths invalid. The problem is easy to solve -
Next claims that
process.cwd()
will point to the root of the repository.
After all, using process.cwd()
made the code simpler.
Even after using correct paths, the content
directory would not be found when
the article pages were generated. Here is what logging files at process.cwd()
yielded during generating an article:
{
"filesAtCwd": [
".next",
"___next_launcher.cjs",
"___vc",
"node_modules",
"package.json",
"pages"
]
}
Next's automated output file tracing was not smart enough to figure out that I
needed both content
and content-indexes
directory to render an article page.
Fortunately, there is
a unstable_includeFiles
configuration option
you can specify on a page to tell Vercel to include some files to be available
during incremental generation
(source):
export const config: PageConfig = {
unstable_includeFiles: ["content-indexes/**/*.*", "content/**/*.*"],
};
Then, both directories and their contents were included in Edge Functions runtime and the article pages were built correctly.
Disclaimer Notice that the globs end with **/*.*
. This is because globs
must point to files only. Ending the glob with **/*
would make it resolve
also to directories, which makes Next happily throw an EISDIR
error when
building on Vercel:
Traced Next.js server files in: 377.934ms
[Error: EISDIR: illegal operation on a directory, read] {
errno: -21,
code: 'EISDIR',
syscall: 'read'
}
Conclusion
Building a blogging platform to match my needs is a cool project. So is solving problems prematurely. As I said before, I would not recommend doing the same in a professional setting, but solving interesting problems is always welcome when it only costs me time and gives me a chance to grow myself as an engineer.
During the process I discovered some caveats related to hosting applications on Vercel. The worst one to debug was its Output File Tracing, since I had to push a commit each time I tweaked some configuration, which was frustrating.
I managed to make the blog platform work in the end. This article is a proof of that - it is the first real article to be processed by my content-processing pipeline.