Three silent broker bugs we caught this week
A log line that fires every five minutes for three weeks isn't noise. It's a real bug you've stopped noticing. This week one user-reported symptom unraveled into three separate broker integration bugs we'd been losing — failing in ways quiet enough that the only honest signal was a log entry that mixed in with the normal warnings. The pattern in all three is the same: the system stopped screaming about something it should have been screaming about. Here's all three, plus what we changed in how we read logs going forward.
The first signal
User reports a button doesn't work. Specifically: "I clicked Connect E*Trade in Safari and got an OAuth 404 error in the dashboard." That's not a normal failure mode — OAuth flows fail in lots of ways (timeouts, expired tokens, signature errors), but a clean Tomcat 404 on the request_token endpoint means the route is gone. Either E*TRADE deprecated it (real), the consumer key got blocked (possible), or our request was malformed (also possible).
We probed. First test: hit the URL directly via curl with a browser-like User-Agent.
$ curl -sS -o /dev/null -w "%{http_code}\n" \
https://api.etrade.com/oauth/request_token
404
OK. Not a User-Agent thing. Then with a proper OAuth signature via the Python requests-oauthlib library:
>>> cli = etrade_client.get_client()
>>> cli.start_oauth()
EtradeError: Token request failed with code 404
Same shape. The endpoint is genuinely returning 404 to whatever we send. So we did the obvious thing — tested with the actual OAuth signature, built by hand:
>>> # ...manually constructed OAuth 1.0a header...
>>> r = requests.post(URL, headers={'Authorization': auth_hdr})
>>> r.status_code
200
>>> r.text[:80]
'oauth_token=X3d9Dbfdl%2FvSk0hQsKO%2FH9FR9PCnaDXOlwP46C7i3dA%3D...'
So now the situation is weird. The endpoint works perfectly from a hand-built call, returns 404 from the long-running server, returns 404 from requests-oauthlib in a fresh Python REPL, and returns 200 from requests-oauthlib with audit_framework imported. Time to actually trace.
Bug 1 — OAuth stale-pool 404 (long-running processes only)
What we eventually nailed: the OAuth endpoint works for fresh TCP connections and silently fails for connections reused from a long-lived urllib3 connection pool. The Python requests library, by default, reuses sockets across calls in the same Session. Over hours and hours, those sockets accumulate state — keepalive timing, HTTP/2 framing quirks, whatever E*TRADE's WAF doesn't like — and the gateway starts returning 404 (not 401, not 429 — 404) to requests that arrive on those reused sockets.
Reproducing it in a fresh Python REPL fails: the fresh process has no stale pool, so the call goes out on a brand-new TCP socket and the gateway accepts it. That's why my own debugging kept "fixing" it — not by fixing anything, but by spawning a new Python process for each test.
The same bug class hit us once before, in v6.7.39, on the /orders/preview endpoint. The pattern was identical: fresh Python = 200, control_server = HTTP 500 + E*TRADE code 100. We patched that endpoint at the time by retrying with a fresh Session on failure. We did not, because we hadn't seen it yet, extend the pattern to the OAuth flow.
The fix in v7.8.75 is one line:
oauth = OAuth1Session(...)
oauth.headers["Connection"] = "close" # <- defeats keep-alive
resp = oauth.fetch_request_token(URL)
Setting Connection: close tells the server to drop the connection after the response, which forces urllib3 to dial a fresh socket the next time. Less efficient, but the OAuth flow runs at most a few times per day — not a hot path. Better to be a little slower than to fail silently inside the long-running server.
Pinned in tests/test_etrade_endpoint_pinning.py: start_oauth() and complete_oauth() must both set Connection: close. The pin defends against a future "tidy up the headers, requests handles keep-alive fine" cleanup that would silently reintroduce the bug.
Bug 2 — /v1/accounts/list route deprecation (404 in plain sight)
While debugging the OAuth one, we noticed something in server.log that had been there for weeks:
Broker reconcile failed:
E*Trade GET /v1/accounts/list → 404:
{'raw': '<!doctype html><html lang="en"><head>
<title>HTTP Status 404 — Not Found</title>...'}
Every five minutes. For weeks. The badge color hadn't changed; the dashboard still felt healthy because everything else worked — balance, portfolio, raise-stop — just the reconcile loop was quietly 404-ing. I'd seen the line dozens of times when scrolling logs for other reasons and mentally classified it as "transient API issue."
Then we live-probed against prod E*TRADE with two URL variants:
GET /v1/accounts/list → 404 (Tomcat HTML)
GET /v1/accounts/list.json → 401 (oauth_problem=token_expired)
The 401 is the good error — it means the route works and just needs a fresh token. The 404 means the suffix-less route is genuinely retired on prod. E*TRADE quietly migrated /v1/accounts/list to require the .json suffix without announcing it (or maybe they did, and I missed it; same outcome from where I was sitting).
Fix in v7.8.73 is one character — literally appending .json in etrade_client.py. The interesting part isn't the fix; it's the three weeks of "this isn't important enough to investigate." A 5-minute drumbeat of 404s in server logs is annoying-but-tolerable noise until you realize the consequence: broker_state daemon couldn't resolve the account key, which broke account-specific reads for every code path that used it, which broke the Reconcile pill, which would have surfaced the issue if it had ever fired correctly. The downstream consequences were all happening invisibly.
Pinned with three regression tests: the function must call the .json variant, must NOT call the suffix-less form, and the bare /v1/accounts/list literal must appear nowhere else in the file. Defends against a "drop the .json, the Accept header is enough" cleanup PR.
Bug 3 — FEED DEGRADED false alarm (wrong asset class)
The third one was a different shape, found while tracing the first two. The data-feed health chip in the Risk Pulse row had been amber/red for weeks. The session-level provider counters showed something like this:
tradier 791 ✓ / 370 ✗
polygon_quote 0 ✓ / 370 ✗
etrade 0 ✓ / 370 ✗
yfinance 369 ✓ / 0 ✗
The fingerprint — identical 370-fail counts across three different providers — was the giveaway once we looked. Three independent provider fails arriving at exactly the same count on the same poll cadence isn't coincidence. It's the same set of calls failing across all three providers because they were all being asked the same impossible question.
The macro_fetcher daemon polls eight index / futures symbols every ~40 seconds: ^VIX, ^TNX, CL=F, ES=F, and so on. Those symbols are indices and futures. Tradier, Polygon Quote, and E*TRADE are US-equity / options APIs. They structurally cannot quote an index or a future. Every macro fetch returned None and incremented the fail counter on three providers that were doing their job correctly — refusing to answer a question they couldn't answer.
Fixed in v7.8.42 by gating the equity-grade tiers behind a shared _is_index_futures_forex(ticker) predicate. Macro symbols now skip the equity tiers entirely and cascade straight to yfinance (the canonical source for them). Calling Tradier with ^VIX was never an error — it's an asset-class mismatch — and shouldn't pollute the provider-health counter.
Same bug class as the v7.8.2 E*TRADE false alarm, which we'd band-aided on the dashboard side at the time (recategorizing E*TRADE as a fallback-tier so a 33-fail counter wouldn't trip the badge). v7.8.42 fixes the root cause in data_providers.py: routing gets fixed at the routing layer, not papered over at the display layer.
The common thread
All three failed in ways that looked like normal noise — warning logs mixed in with other warnings, badge tints that drifted gradually rather than flipping, errors that didn't bubble all the way up to the user. The work isn't catching the bug when it screams. It's noticing when the system stops screaming about something it should be screaming about.
Three concrete things we changed in how we read logs after this week:
- Repeated identical errors are not transient. A log line that fires every five minutes for three weeks is a structural failure, not a glitch. The phrase "I've been seeing that one for a while" is the smell. Investigate immediately.
- Identical counters across independent systems are diagnostic. If three different providers have identical fail counts on the same cadence, they're not three independent failures — they're one routing bug hitting three innocent providers.
- 404 on an API endpoint is information, not noise. A 404 means the route doesn't exist. APIs return 401 for auth issues, 429 for rate limits, 500 for server errors. A 404 specifically says this URL is wrong — treat it as a deprecation notice the vendor forgot to send.
Plus: pin tests on every fix. The first one (OAuth Connection: close) and the second one (/v1/accounts/list.json) both got AST-introspection regression tests so a future "tidy up" can't quietly remove them. The third one got a provider-routing regression test on the predicate.
What this changes about the product
Nothing visible. Broker reconciliation works the way it always promised to. The data-feed health chip now tells the truth. The OAuth re-auth flow finishes cleanly on the first click. From the user's perspective: things that were quietly broken are quietly fixed, and the logs are a little quieter going forward.
What it changes about how we work: a slightly higher standard for reading logs. The "noise floor" of warnings we tolerate is going down. Anything that fires more than three times a day with the same text now gets investigated as a real bug until proven otherwise. The bar is higher because the alternative is what we just lived through — three real bugs that hid in plain sight for weeks.
The pitch (unchanged)
Swing Deck sells discipline applied to a process. The framework forces you to read 11 sub-scores before you trade. The risk pillars force you to acknowledge cap exposure before you size up. The Reconcile pill, after v7.8 ships, forces you to acknowledge broker-vs-dashboard drift before you act. This kind of work — discipline applied to the infrastructure itself — is what makes the rest credible. If we let the broker reconciliation 404 silently for weeks, we don't get to claim our broker integration is robust. So we treat the boring infrastructure bug-class lessons with the same seriousness as the trader-facing features.
Claim a Founding Slot — $14.50/mo
Want to see the actual fix commits? They're in the v7.8 stack: 1965bdc (compute_daily guard), ceecda8 (/v1/accounts/list .json), 37f2a33 (OAuth Connection: close), f2647c9 (FEED DEGRADED gate). Source at github.com/pinoy81/swing-audit. Public push happens June 1 when GitHub Actions quota refreshes — 16 commits stacked locally until then.
Disclosure: Swing Deck is built and operated by one person. The product is local-first; positions, broker tokens, and journal entries never leave your machine. AI features use your own API keys (BYOK). We don't proxy your data through our servers. Pricing as of 2026-05-16. Past performance is not indicative of future results. Nothing in this post is investment advice; it's a description of what software does.