All Systems Go: Creating your own startup checks with Nerves

By: Chris Freeze
Space command center

Today I’ll be talking about a fun adventure I had with a simple project I did to familiarize myself with Nerves. My exploration introduced me to a whole new world of technical challenges I had never had to deal with before, but also led to me finding great solutions to said problems with the help of the Nerves team!

Getting Started

Nerves is a library and framework for crafting and deploying bulletproof embedded software in Elixir. It allows you to write Elixir and deploy it to devices such as a Raspberry Pi. It also has many libraries for working with things such as servos and sensors, allowing you to use Elixir to interact with the real world. Alternatively, you can just use it to run normal applications like bots inexpensively, which is what I settled on for my first independent Nerves project. Everything was going well. I was using a library for my targeted application which took care of most of the heavy lifting. I also added in a Phoenix app for a “control panel” of sorts that could let me tweak settings or even impersonate my bot. Fun! It all worked great on my dev machine, so surely it would work the same on my Pi. Feeling confident, I ran mix firmware.burn and started up the Pi.

I was met by a fast scrolling wall of text, signifying something had crashed the VM. I had little time to attempt to read what the cause was before the Pi restarted. A few more attempts let me conclude it was not a one-time error. Something was preventing the app from starting on the PI but was not preventing it from starting in dev. Eventually I was able to find a helpful section of the avalanche of error text:

[info] Application discord_bot exited: Bot.Application.start(:normal, []) returned an error: shutdown: failed to start child: BotLib.Client
    ** (EXIT) exited in: GenServer.call(#PID<0.1173.0>, {:resource, :get, "users/@me", nil}, 5000)
        ** (EXIT) an exception was raised:
            ** (HTTPoison.Error) :nxdomain
            ...

The library I was using seemed to be making HTTP requests during its startup, and there seemed to be no internet access when the library made those calls, resulting in an error and a failure to start. I wanted to avoid writing my own bot API since this was an exercise to get familiar with Nerves. I had the pi plugged into my switch during its boot, but acquiring an IP address isn’t instantaneous. Further debugging revealed yet another problem. Even if I disabled automatic starting of the bot library and manually started it through a TTY session, it still failed to start due to an SSL error. The bot library used HTTPS to send certain requests to the remote API, and the pi had no means of keeping time when turned off. I needed to figure out a way to pause the loading of my application until two things were confirmed.

  1. I needed to have internet connection.
  2. I needed to set the system time via NTP.

Enter Shoehorn

shoehorn (previously known as bootloader) is a library that allows you to control the order in which certain apps start up, and ensure they start before everything else. If you use nerves_init_gadget, it comes with shoehorn by default. With shoehorn added, you can set up your configuration like so:

config :shoehorn,
  init: [:nerves_runtime, :nerves_network, :nerves_init_gadget, {SystemCheck, :ensure_environment, []}],
  app: :pi_bot_fw

The trick to this whole process and the subject of this blog post is the final item in our shoehorn list, which you may have recognized as an MFA (Module, Function, Arity) tuple.

Using Shoehorn

To solve my Pi startup problems, I created a function which will block the startup of our app until it sees fit. I needed someplace to put this function, so I created a extremely simple app called SystemCheck. It has two functions even though one is sufficient (I left it at two to indicate the potential of expansion upon its features in the future). The function referenced in the MFA tuple above is a function which is intended to call any number of functions to verify the integrity of the environment before starting the main application, though right now it just calls set_time/0. As its name would imply, set_time tries 5 times to set the time, waiting a longer amount of time in-between each try until it finally gives up and prevents the app from starting by triggering a device reboot.

  def set_time(tries \\ 0)

  def set_time(tries) when tries < 4 do
    Process.sleep(1000 * tries)

    case :inet_res.gethostbyname('0.pool.ntp.org') do
      {:ok, {:hostent, _url, _, _, _, _}} ->
        do_try_set_time(tries)

      {:error, err} ->
        Logger.error("Failed to set time (#{tries}): DNS Lookup: #{inspect(err)}")
        set_time(tries + 1)
    end
  end

  def set_time(_tries), do: #reboot

do_try_set_time is a slightly more complex function that calls out to ntpd, if you would like to see its implementation, the source of this code is available here. I used FarmBot’s NTP module as a close guide when writing this code, so big thanks to them and the help they gave me on Slack when I was figuring this out for the first time!

set_time is able to simultaneously ensure the time is set and check for internet because if it succeeds in setting the time, that means it managed to make a successful connection to the NTP server. A more advanced version of ensure_environment could check specific hosts important to the app instead of just an NTP server for extra validation, but I was satisfied with relying on ntpd for this project.

When the function succeeds, it returns :ok, which allows the app to start successfully. If it fails, it calls Nerves.Runtime.reboot which prevents the app from starting, and will eventually try again to connect to the internet once the device reboots.

Additional notes

With additional functions being called in ensure_environment, you could use a with statement—or something similar to Plug.Conn’s :halted key with a map passed through the different checks—to prevent making multiple checks if one has already failed. Perhaps the current behavior is exactly what you want so every failure on start gives you a list of everything that went wrong. There are many ways to approach this problem based on your needs. Now that you have full control over your application’s startup sequence, you need fear no embedded device problem that comes your way—at least on the software side.

Wrapping up

I’d like to give additional thanks to the developers at FarmBot and the Nerves Slack channel in the Elixir Slack (#nerves) for helping me out with this and all my other Nerves questions. They are without question one of the most helpful groups of developers I’ve ever had the pleasure of asking questions to. If you have any questions about Nerves or Shoehorn, check out the Shoehorn docs or Nerves docs or ask a question on Slack.