Embedded Systems

1. The "Transmits but never receives" LR1121 Bug

A T3-S3 LR1121 relay transmitted perfectly but never received anything. Every sign pointed at dead hardware.

The Investigation & Fix

I mapped the entire RX chain interrupts instead of just RX_DONE. Zero partial detections meant the state machine never engaged. Root cause: RadioLib's startReceive() only arms continuous RX cleanly from STANDBY. Post-boot, it silently failed to arm. The fix had two halves: radio.standby() before the boot-time arm, plus a re-arm after every transmit (the chip returns to standby after TX). RX reliability went from 0% to 100%. Lesson: Map the whole signal chain, not just the success interrupt.

C++ Undefined Behavior

2. SX1302 `-1` GPIO Sentinel.

SPI reads worked but writes to the SX1302 front-end were ignored. Tracing the reset path into the ESP32 port revealed that the reset mask was computed using a `-1` sentinel for an unconnected pin.

The Fix: (1 << -1) is undefined behavior in C. The corrupted mask meant the radio was never actually reset. I added explicit guards before shifting the mask. The board became production-ready for receive use.

Full-Stack Networking

3. HTTP/1.1 Socket Starvation.

Over the field Pi, the portal hung with an empty map—yet `curl` answered the endpoints in under 0.1s.

The Fix: DevTools showed exactly 6 ESTABLISHED sockets. HTTP/1.1 browsers cap at 6 connections per host, and I had wired two permanent Server-Sent-Events streams onto every page, eating the entire pool. I made the data-stream SSE lazy, opening it only when needed.

Spatial Algorithms

4. The Fiji → Hawaii Dateline Bug.

The global A* ocean router sent a 2,700-nm Pacific crossing 13,000 nm around the world.

The Fix: The cost landscape was a single -180 to +180 grid that clipped at the boundary. I solved it by searching a wrapped longitude space—replicating water cells at L ± 360° so the geodesically short path across the dateline became reachable. Verified at 2,729 nm, with non-dateline routes (NYC→Lisbon, CA→HI) provably unchanged.

Performance

5. The GZip Compression Storm.

Cold loads took 6.4 seconds because GZipMiddleware recompressed multi-MB static bundles on the event loop for every request.

The Fix: I pre-compressed *.gz siblings once at startup and served them directly. Cold-boot DOMContentLoaded dropped from 6,400 ms to 516 ms.

Sensor Data Integrity

6. Phantom GPS — a Sensor 100 km From Where It Sat.

A Supreme node showed a wandering position up to ~100 km from its actual bench location, and had quietly archived ~370 bogus coordinates that contaminated any historical query. The module reported 0–2 satellites and an HDOP of 25+—no real lock—yet TinyGPS++ flagged the coordinates "valid," so the firmware transmitted them unconditionally and the gateway stored every one.

The Fix: "Valid" from the GPS library is an internal flag, not a quality guarantee. I gated position at two layers. In firmware, a gpsHasRealFix() check rides latitude/longitude only when the fix is valid and fresh (< 5 s) and ≥ 4 satellites and HDOP in range; otherwise the node reports its satellite count and HDOP but no position. At the gateway, an ingest-side quality filter so last-known-fix only counts frames with real fixes. I deliberately chose non-destructive filtering over a destructive DB cleanup. Lesson: validate real data at the point of origin and the point of entry—once phantom data is archived it is nearly impossible to remove cleanly.

Subprocess & Environment

7. The Wrong Interpreter in a Fallback Chain.

Copernicus wind / current / SST overlays rendered as "degraded, empty frames." The gateway logs showed the rendering subprocess exiting with ModuleNotFoundError. The overlay subprocess runs copernicusmarine.subset under a chosen interpreter, and the selection logic blindly trusted the first candidate that existed.

The Fix: The fallback chain was (1) an env var, (2) a cached bundled runtime, (3) sys.executable. The bundled runtime existed but lacked the copernicusmarine package—so the subprocess picked it, failed the import, and returned an empty placeholder, while the system Python that did have the package sat unused. I capability-check before committing: test each candidate with importlib.util.find_spec(...) and use the first that actually has it. Lesson: capability-test before you trust—the "better" cached option may be incomplete.

Concurrency

8. Fail-Fast Lock Contention.

Toggling a Copernicus field on would intermittently return 0 features; toggling off and on sometimes fixed it. Maddeningly timing-dependent. The field endpoint guarded an expensive remote download with a non-blocking lock—so whenever a background cache-warmer or any concurrent request held it, an interactive toggle instantly fell back to a degraded empty response instead of waiting.

The Fix: Interactive endpoints now wait up to 20 s for the lock, and additionally fall back to reading the disk-cached cube without the lock (the lock only needs to serialize remote downloads; reading an existing cube is safe and concurrent). Six simultaneous cold requests—the case that degraded 8/8 to empty—now all return real data. Lesson: prefer blocking-with-timeout plus a read-only fallback so the system degrades to stale rather than empty.