Sometimes, it's hard for me to explain (even to myself) what I do all day at my job. I maintain some pretty simple, static web pages at my job that should be low effort. But recently, I created and fixed a bug that involved so many unexpected things, and manifested in a weird enough way that I thought it would be useful to demonstrate the complexity of a frontend engineer's job.
The Situation
Over the past couple months, my team has been merging two codebases into the same
git repository. The work was fairly simple: two Ember.js apps
that can be merged into one. The apps have a few differences that can be handled
with some if
statements or by consolidating behaviors. For example, if they
use different versions of a dependency, it's easy enough to pick the more recent
one, test and move on.
The Bug
When testing the build from the merged codebase in a QA environment, someone noticed and filed a bug that the top navigation bar appears unstyled for a second.
Debugging: The Hunch
Since this is, as they say, NotMyFirstRodeo, my first hunch is that server side
rendering is broken, skipping a <link rel=stylesheet>
tag injection at runtime.
I've seen this happen before, and I've just fixed a JS exception in SSR the day before.
Because I'm unable to cursorily reproduce the bug, I send it back to the originator
and move on with my day. The next day, I get the same bug back. Not Fixed.
This time I reach out to the originator and they point out that the bug only reproduces in Firefox. I had missed that in the report the day before!
This time, I put in a little more effort. I build the production server locally.
FastBoot (Ember.js's SSR solution) has a different app server for local builds
(using a middleware that injects into the regular development ember server
),
and a different server for prodution (fastboot-app-server
). They're both roughly
the same, and the meat of the SSR is the same fastboot
package, but as luck
would have it, The fastboot
package in our production server is using v3 and
and the one in ember-cli-fastboot
is v2. It hasn't happened yet, but I'm always
paranoid that some day this will cause that special flavor of "works on my machine bug"
that we love to see as developers. eye roll
So I build the production server and spin it up.
DOES. NOT. REPRODUCE.
At this moment, if I had sleeves I would have rolled them up. It's time to dig into the QA environment. See the belly of the beast.
I ssh
into one of the hosts and tail the server logs. These logs should be
forwarded to Splunk, but I want to see them in realtime and I have no idea how
the Splunk team's synchronization works, or how long it takes. (There's a whole
aside here about how long it took me to get access to and setup the credentials
to be able to ssh
into these remote machines. Long story short, it took a long
time when I first got it, and if I ever lost access, I don't know how long it
would take to get again.)
The issue is apparent. FastBoot.require('some-package')
is throwing an exception,
and the stack trace is all over the logs. That's a red flag if I ever saw one!
That would definitely break the server side render and cause an flash of unstyled text.
I still have two remaining mysteries: why can't I reproduce this locally? and
why only Firefox? A server side exception has no bearing on the browser.
Looking at the code, it's clear that some-package
is in the devDependencies
,
and fastbootDependencies
keys. For it to be part of the available dependencies
in the production server, it needs to be in dependencies
.
{
"name": "my-app",
"devDependencies": {
- "some-package": "1.3.0",
},
"dependencies": {
+ "some-package": "1.3.0",
},
"fastbootDependencies": [
"some-package"
]
}
I also confirm that before the repositories merged, this was configured correctly.
An Aside on FastBoot
During ember build
, the ember-cli-fastboot
addon programmatically generates
a package.json
based on the actual source package.json
file. But it needs
packages to be in both the dependencies
section (to get the version) and the
fastbootDependencies
(to allow list it in the sandbox). There are a couple issues
with this approach:
- It's impossible to deterministically install dependencies for production, as there is no lockfile.
- It's very easy to get it wrong (which is still a sensible tradeoff for the extra security!)
Back to Debugging
So why doesn't it reproduce locally?
It turns out that FastBoot.require()
wraps Node's require
, which can
snake up your directory tree to find modules in node_modules/
in parent directories.
Try this in an empty directory
## create a directory tree
mkdir -p parent/middle/leaf
## create a pseudo package in the top level of that tree
mkdir -p parent/node-modules/bliggity-bloo
echo "module.exports.default = 'what up council!'" >> parent/node-modules/bliggity-bloo/index.js`
## go into the leaf directory
cd parent/middle/leaf
## evaluate some code to exercise the pseudo package
##
## The -e flag is to evaluate JS without entering the Node repl.
## You can also enter the Node repl and require('bliggity-bloo') for the
## same effect.
node -e "console.log(require('bliggity-bloo').default)"
Notice that the bliggity-bloo
package is two directories above, but it's
still available to require!
So that explains why I couldn't reproduce the bug locally. My dist
directory
was nested inside my app code directory, which has its own node_modules
that
include development dependencies. So when running the production server using
from my-app/dist/
, FastBoot.require()
worked just fine on my machine even
with the production server!
But if I move my production dist
somewhere else, say:
# move the dist direct to my hom directory
mv my-app/dist ~/
# go there
cd ~/dist
# start the server
I can reproduce the SSR failure with the same stack trace as I saw in the QA server logs!
Debugging: Firefox vs Chrome
Why does the bug only reproduce in Firefox?
I haven't figured this out, but what I did notice is that I could reproduce
in Chrome by simulating a slower CPU. My hunch here is that the Firefox
prioritizes the <link>
injection and then the resource download slightly differently
than Chrome. Without CPU throttling, the flash of unstyled text was not observable
to the human eye.
I would love to know if someone has any ideas!
Zooming out
So why do we inject a <link>
stylesheet at runtime in the first place? Why isn't
it just part of index.html
? The navigation on our website is built by a different
team to be applied across many disparate domains. They provide JS and CSS assets
as a package. To render it, we have to find the right version for the right locale
at runtime and inject that into the DOM. We can usually do this during SSR, but
when that fails, the same functionality runs on the client (the browser) as well.
This isn't the most convenient architecture to maintain, but it's a function of how different teams work together. Conway's Law in full effect.
Conclusion
Sometimes things are more complicated than they seem! Sometimes we make them more complicated than they need to be. Sometimes the truth is somewhere in the middle and complications arise from constraints that we can't control or are too not worth changing.
I don't think this was a particularly hard problem to solve. I knew where to look and I knew next steps for debugging as each step before left me with more questions. But at the very least, it took two cycles of communicaton with the bug originator, two rounds of reproducing the bug, and two rounds of switching contexts. This is the type of thing that eats up time in large organizations. This is the type of thing to be vigilant about. No bugs made it to production, but how often do our processes (or more generally: the status quo) have an opportunity cost that we don't capture?