Optimising the Continuous Integration for Apache NuttX RTOS (GitHub Actions)

📝 10 Nov 2024

Optimising the Continuous Integration for Apache NuttX RTOS

Within Two Weeks: We squashed our GitHub Actions spending from $4,900 (weekly) down to $890…

Within Two Weeks: We squashed our GitHub Actions spending from $4,900 (weekly) down to $890

Previously: Our developers waited 2.5 Hours for a Pull Request to be checked. Now we wait at most 1.5 Hours! (Pic below)

This article explains everything we did in the (Semi-Chaotic) Two Weeks for Apache NuttX RTOS…

Shut down the macOS and Windows Builds, revive them in a different form
Merge Jobs are super costly, we moved them to the NuttX Mirror Repo
We Halved the CI Checks for Complex PRs. (Continuous Integration)
Simple PRs are already quite fast. (Sometimes 12 Mins!)
Coding the Build Rules for our CI Workflow, monitoring our CI Servers 24 x 7
We can’t run All CI Checks, but NuttX Devs can help ourselves!

Previously: Our developers waited 2.5 Hours for a Pull Request to be checked. Now we wait at most 1.5 Hours

§1 Rescue Plan

We had an ultimatum to reduce (drastically) our usage of GitHub Actions. Or our Continuous Integration would Halt Totally in Two Weeks!

After deliberating overnight: We swiftly activated our rescue plan…

Submit / Update a Complex PR:

CI Workflow shall trigger only Half the Jobs for CI Checks.

(A Complex PR affects All Architectures: Arm32, Arm64, RISC-V, Xtensa, etc. Will reduce GitHub Cost by 32%)
Merge a Complex PR:

CI Workflow shall Run All Jobs like before.

(arm-01 … arm-14, risc-v, xtensa, etc)
Simple PRs:

No change. Thus Simple Arm32 PRs shall build only arm-01 … arm-14.

(A Simple PR concerns only One Single Architecture: Arm32 OR Arm64 OR RISC-V etc)
After Merging Any PR:

Merge Jobs shall run at NuttX Mirror Repo.

(Instead of OG Repo apache/nuttx)
Two Scheduled Merge Jobs:

Daily at 00:00 UTC and 12:00 UTC.

(No more On-Demand Merge Jobs)
macOS and Windows Jobs:

Shall be Totally Disabled.

(Until we find a way to manage their costs)

We have reasons for doing these, backed by solid data…

We wasted GitHub Runners on Merge Jobs that were eventually superseded and cancelled

§2 Present Pains

We studied the CI Jobs for the previous day…

Analysis of CI Jobs over 24 Hours

Many CI Jobs were Incomplete: We wasted GitHub Runners on Merge Jobs that were eventually superseded and cancelled (pic above, we’ll come back to this)

Screenshot 2024-10-17 at 1 18 14 PM

Scheduled Merge Jobs will reduce wastage of GitHub Runners, since most Merge Jobs didn’t complete. Only One Merge Job completed on that day…

Screenshot 2024-10-17 at 1 16 16 PM

When we Halve the CI Jobs: We reduce the wastage of GitHub Runners…

Screenshot 2024-10-17 at 1 15 30 PM

This analysis was super helpful for complying with the ASF Policy for GitHub Actions! Next we follow through…

Disable macOS Builds

§3 Disable macOS and Windows Builds

Quitting the macOS Builds? That’s horribly drastic!

Yeah sorry we can’t enable macOS Builds in NuttX Repo right now…

macOS Runners cost 10 times as much as Linux Runners.

To enable One macOS Job: We need to disable 10 Linux Jobs! Which is not feasible.
Our macOS Jobs are in an untidy state right now, showing many many warnings.

We need someone familiar with Intel Macs to clean up the macOS Jobs.

(See the macOS Log)
That’s why we moved the macOS Builds to the NuttX Mirror Repo, which won’t be charged to NuttX Project.

(Discussion here)

(macOS Build Farm coming soon!)

NuttX Dashboard

Can we still prevent breakage of ALL Builds? Linux, macOS AND Windows?

Nope this is simply impossible…

In the good old days: We were using far too many GitHub Runners.

This is not sustainable, we don’t have the budget to do all the CI Checks we used to.
Hence we should expect some breakage.

We should be prepared to backtrack and figure out which PR broke the build.
That’s why we have tools like NuttX Dashboard (pic above), to detect breakage earlier.

(Without depending on GitHub CI)
Remember to show Love and Respect for NuttX Devs!

Previously we waited 2.5 Hours for All CI Checks. Now we wait at most 1.5 Hours, let’s stick to this.

What about the Windows Builds?

Recently we re-enabled the Windows Builds, because they’re not as costly as macOS Builds.

We’ll continue to monitor our GitHub Costs. And shut down the Windows Builds if necessary.

(Windows Runners are twice the cost of Linux Runners)

Normally our CI Workflow will trigger a Merge Job, to verify that everything compiles OK after Merging the PR

§4 Move the Merge Jobs

What are Merge Jobs? Why move them?

Suppose our NuttX Admin Merges a PR. (Pic above)

Normally our CI Workflow will trigger a Merge Job, to verify that everything compiles OK after Merging the PR.

Which means ploughing through 34 Sub-Jobs (2.5 elapsed hours) across All Architectures: Arm32, Arm64, RISC-V, Xtensa, macOS, Windows, …

This is extremely costly, hence we decided to trigger them as Scheduled Merge Jobs. I trigger them Twice Daily: 00:00 UTC and 12:00 UTC.

Screenshot 2024-10-19 at 11 33 46 AM

Is there a problem?

We spent One-Third of our GitHub Runner Minutes on Scheduled Merge Jobs! (Pic above)

Our CI Data shows that the Scheduled Merge Job kept getting disrupted by Newer Merged PRs. (Pic below)

And when we restart a Scheduled Merge Job, we waste precious GitHub Minutes.

(101 GitHub Hours for one single Scheduled Merge Job!)

Merge Job kept getting disrupted by Newer Merged PRs

Our Merge Jobs are overwhelming!

Yep this is clearly not sustainable. We moved the Scheduled Merge Jobs to a new NuttX Mirror Repo. (Pic below)

Where the Merge Jobs can run free without disruption.

(In an Unpaid GitHub Org Account, not charged to NuttX Project)

Optimising the Continuous Integration for Apache NuttX RTOS

What about the Old Merge Jobs?

Initially I ran a script that will quickly Cancel any Merge Jobs that appear in NuttX Repo and NuttX Apps.

Eventually we disabled the Merge Jobs for NuttX Repo.

(And for NuttX Apps)

(Restoring Auto-Build on Sync)

How to trigger the Scheduled Merge Job?

Every Day at 00:00 UTC and 12:00 UTC: I do this…

Browse to the NuttX Mirror Repo
Click “Sync Fork > Discard Commits”
Which will Sync our Mirror Repo based on the Upstream NuttX Repo
Run this script to enable the macOS Builds: enable-macos-windows.sh

(UPDATE: We now use sync-build-ingest.sh)
Which will also Disable Fail-Fast and grind through all builds. (Regardless of error, pic below)
And Remove Max Parallel to use unlimited concurrent runners. (Because it’s free! Pic below)
If the Merge Job fails with a Mystifying Network Timeout: I restart the Failed Sub-Jobs. (CI Test might overrun)
Wait for the Merge Job to complete. Then Ingest the GitHub Logs (like an Amoeba) into our NuttX Dashboard. (Next article)
Track down any bugs that Fail the Merge Job.

Disable Fail-Fast and Remove Max Parallel

Is it really OK to Disable the Merge Jobs? What about Docs and Docker Builds?

Docker Builds: When Dockerfile is updated, it will trigger the CI Workflow docker_linux.yml. Which is not affected by this new setup, and will continue to execute. (Exactly like before)
Documentation: When the docs are updated, they are published to NuttX Website via the CI Workflow main.yml from the NuttX Website repo (scheduled daily). Which is not affected by our grand plan.
Release Branch: Merging a PR to the Release Branch will still run the PR Merge Job (exactly like before). Release Branch shall always be verified through Complete CI Checks.

(More about this)

Isn’t this cheating? Offloading to a Free GitHub Account?

Yeah that’s why we need a NuttX Build Farm. (Details below)

Halve the CI Checks for a Complex PR

§5 Halve the CI Checks

(Update: Right now we run 100% of CI Jobs for Complex PRs)

One-Thirds of our GitHub Runner Minutes were spent on Merge Jobs. What about the rest?

Two-Thirds of our GitHub Runner Minutes were spent on validating New and Updated PRs.

Hence we’re skipping Half the CI Checks for Complex PRs.

(A Complex PR affects All Architectures: Arm32, Arm64 RISC-V, Xtensa, etc)

Which CI Checks did we select?

Today we start only these CI Checks when submitting or updating a Complex PR (pic above)

arm-03, 05, 06, 07, 08, 10, 13
risc-v-01, 02, 03
sim-01, 02
xtensa-01, arm64-01, x86_64-01, other

(See the Pull Request)

(Synced to NuttX Apps)

Why did we choose these CI Checks?

We selected the CI Checks above because they validate NuttX Builds on Popular Boards (and for special tests)

Target Group	Board / Test
arm-01	Sony Spresense (TODO)
arm-05	Nordic nRF52
arm-06	Raspberry Pi RP2040
arm-07	Microchip SAMD
arm-08, 10, 13	STM32
risc-v-02, 03	ESP32-C3, C6, H2
sim-01, 02	CI Test, Matter

We might rotate the list above to get better CI Coverage.

(See the Complete List of CI Builds)

(Sorry we can’t run xtensa-02 and arm-01)

Complex PR vs Simple PR

What about Simple PRs?

A Simple PR concerns only One Single Architecture: Arm32 OR Arm64 OR RISC-V OR Xtensa etc.

When we create a Simple PR for Arm32: It will trigger only the CI Checks for arm-01 … arm-14.

Which will complete earlier than a Complex PR.

(x86_64 Devs are the happiest. Their PRs complete in 10 Mins!)

Sounds awfully complicated. How did we code the rules?

Indeed! The Build Rules are explained here…

“Build Rules for CI Workflow”

§6 Live Metric for Full-Time Runners

Hitting the Target Metrics in 2 weeks… Everyone needs to help out right?

Our quota is 25 Full-Time GitHub Runners per day.

We published our own Live Metric for Full-Time Runners, for everyone to track…

Live Metric for Full-Time Runners

Date: We compute the Full-Time Runners only for Today’s Date (UTC)
Elapsed Hours: Number of hours elapsed since 00:00 UTC
GitHub Job Hours: Elapsed Duration of all GitHub Jobs at NuttX Repo and NuttX Apps. (Cancelled / Completed / Failed)

This data is available only AFTER the job has been Cancelled / Completed / Failed. (Might have lagged by 1.5 hours)

But this is the Elapsed Job Duration. It doesn’t say that we’re running 8 Sub-Jobs in parallel. That’s why we need…
GitHub Runner Hours: Number of GitHub Runners * Job Duration. Effectively the Chargeable Minutes by GitHub.

We compute this as 8 * GitHub Job Hours. This is averaged from past data.

(Remember: One GitHub Runner will run One Single Sub-Job, like arm-01)
Full-Time GitHub Runners: Equals GitHub Runner Hours / Elapsed Hours.

It means “How many GitHub Runners, running Full-Time, in order to consume the GitHub Runner Hours”.

(We should keep this below 25 per day, per week, per month)

We publish the data every 15 minutes…

compute-github-runners.sh calls GitHub API to add up the Elapsed Duration of All Completed GitHub Jobs for today.

Then it extrapolates the Number of Full-Time GitHub Runners.

(1 GitHub Job Hour roughly equals 8 GitHub Runner Hours, which equals 8 Full-Time Runners Per Hour)
run.sh calls the script above and render the Full-Time GitHub Runners as a PNG.

(Thanks to ImageMagick)
compute-github-runners2.sh: Is the Linux Version of the above macOS Script.

(But less accurate, due to BC Rounding)

Next comes the Watchmen…

(Can we run All CI Checks for All PRs?)

PXL_20241020_114213194

§7 Monitor our CI Servers 24 x 7

Doesn’t sound right that an Unpaid Volunteer is monitoring our CI Servers 24 x 7 … But someone’s gotta do it! 👍

This runs on a 4K TV (Xiaomi 65-inch) all day, all night…

Screenshot 2024-10-28 at 1 53 26 PM

On Overnight Hikes: I check my phone at every water break…

GridArt_20241028_150938083

If something goes wrong?

We have GitHub Scripts for Termux Android. Remember to “pkg install gh” and set GITHUB_TOKEN…

enable-macos-windows2.sh: Enable the macOS Builds in the NuttX Mirror Repo
compute-github-runners2.sh: Compute the number of Full-Time GitHub Runners for the day (less accurately than macOS version)
kill-push-master.sh: Cancel all Merge Jobs in NuttX Repo and NuttX Apps

§8 Final Verdict

It’s past Diwali and Halloween and Elections… Our CI Servers are still alive. We made it yay! 🎉

Within Two Weeks: We squashed our GitHub Actions spending from $4,900 (weekly) down to $890…

Within Two Weeks: We squashed our GitHub Actions spending from $4,900 (weekly) down to $890

“Monthly Bill” for GitHub Actions used to be $18K…

Monthly Bill for GitHub Actions used to be $18K

Presently our Monthly Bill is $9.8K. Slashed by half (almost) and still dropping! Thank you everyone for making this happen! 🙏

Right now our Monthly Bill is $9.8K

(At Mid Nov 2024: Monthly Bill is now $3.1K 🎉)

Bonus Love & Respect: Previously our devs waited 2.5 Hours for a Pull Request to be checked. Now we wait at most 1.5 Hours!

Tired Fingers syncing the NuttX Repo to NuttX Mirror Repo

§9 Our Wishlist

Everything is hunky dory?

Trusting a Single Provider for Continuous Integration is a terrible thing. We got plenty more to do…

Become more resilient and self-sufficient with Our Own Build Farm

(Away from GitHub)
Analyse our Build Logs with Our Own Tools

(Instead of GitHub)
Excellent Initiative by Mateusz Szafoni: We Merge Multiple Targets into One Target

(And cut the Build Time)

🙏🙏🙏 Please join Your Ubuntu PC to our Build Farm! 🙏🙏🙏

But our Merge Jobs are still running in a Free Account?

We learnt a Painful Lesson today: Freebies Won’t Last Forever!

We should probably maintain an official Paid GitHub Org Account to execute our Merge Jobs…

New GitHub Org shall be sponsored by our generous Stakeholder Companies

(Espressif, Sony, Xiaomi, …)
New GitHub Org shall be maintained by a Paid Employee of our Stakeholder Companies

(Instead of an Unpaid Volunteer)
Which means clicking Twice Per Day to trigger the Scheduled Merge Jobs

(My fingers are tired, pic above)

(UPDATE: We now use sync-build-ingest.sh)
And restarting the Failed Merge Jobs

(Because of Mysterious Network Timeouts)

(CI Test might overrun)
Track down any bugs that Fail the Merge Job
Maintaining the NuttX Build Farm and NuttX Dashboard
New GitHub Org shall host the Official Downloads of NuttX Compiled Binaries

(For upcoming Board Testing Farm)
New GitHub Org will eventually Offload CI Checks from our NuttX Repos

(Maybe do macOS CI Checks for PRs)

Optimising the Continuous Integration for Apache NuttX RTOS

§10 What’s Next

Next Article: We’ll chat about NuttX Dashboard. And how we made it with Grafana and Prometheus…

Many Thanks to the awesome NuttX Admins and NuttX Devs! I couldn’t have survived the two choatic and stressful weeks without your help. And my GitHub Sponsors, for sticking with me all these years.

Got a question, comment or suggestion? Create an Issue or submit a Pull Request here…

lupyuen.github.io/src/ci3.md

§11 Appendix: Self-Hosted GitHub Runners

To run the Complete Suite of CI Checks on every PR… We could use Self-Hosted GitHub Runners?

Yep I tested Self-Hosted GitHub Runners, I wrote about my experience here: “Continuous Integration for Apache NuttX RTOS”

Self-Hosted GitHub Runners are actually quite complex to setup. And the machine needs to be properly secured, in case any unauthorised code is pushed down from GitHub.
We don’t have budget to set up Virtual Machines maintained by IT Security Professionals for GitHub Runners anyway
NuttX Project might be a little too dependent on GitHub. Even if we had the funds, the ASF contract with GitHub won’t allow us to pay more for extra usage. So we’re trying alternatives.
Right now we’re testing a Community-Hosted Build Farm based on Ubuntu PCs and macOS: “Your very own Build Farm for Apache NuttX RTOS”

CI Checks for a Complex PR

§12 Appendix: Check our PR Submission

Before submitting a PR to NuttX: How to check our PR thoroughly?

Yep it’s super important to thoroughly test our PRs before submitting to NuttX.

But NuttX Project doesn’t have the budget to run all CI Checks for New PRs. The onus is on us to test our PRs (without depending on the CI Workflow)

Run the CI Builds ourselves with Docker Engine
Or run the CI Builds with GitHub Actions

(1) might be slower, depending on our PC. With (2) we don’t need to worry about Wasting GitHub Runners, so long as the CI Workflow runs entirely in our own personal repo, before submitting to NuttX Repo.

Here are the instructions…

NuttX Dashboard

What if our PR fails the check, caused by Another PR?

We wait for the Other PR to be patched…

Set our PR to Draft Mode
Keep checking the NuttX Dashboard (above)
Wait patiently for the Red Error Boxes to disappear
Rebase our PR with the Master Branch
Our PR should pass the CI Check. Set our PR to Ready for Review.

Otherwise we might miss a Serious Bug.

Screenshot 2024-10-19 at 8 11 22 AM

§13 Appendix: Verify our PR Merge

When NuttX merges our PR, the Merge Job won’t run until 00:00 UTC and 12:00 UTC. How can we be really sure that our PR was merged correctly?

Let’s create a GitHub Org (at no cost), fork the NuttX Repo and trigger the CI Workflow. (Which won’t charge any extra GitHub Runner Minutes to NuttX Project!)

“How to Verify a PR Merge”

This will probably work if our CI Servers ever go dark.

Network Timeout at GitHub

§14 Appendix: Network Timeout at GitHub

(See the NuttX Issue)

Something super strange about Network Timeouts (pic above) in our CI Docker Workflows at GitHub Actions. Here’s an example…

First Run fails while downloading something from GitHub…

Configuration/Tool: imxrt1050-evk/libcxxtest,CONFIG_ARM_TOOLCHAIN_GNU_EABI
curl: (28) Failed to connect to github.com port 443 after 134188 ms: Connection timed out
make[1]: *** [libcxx.defs:28: libcxx-17.0.6.src.tar.xz] Error 28

Second Run fails again, while downloading NimBLE from GitHub…

Configuration/Tool: nucleo-wb55rg/nimble,CONFIG_ARM_TOOLCHAIN_GNU_EABI
curl: (28) Failed to connect to github.com port [443](https://github.com/nuttxpr/nuttx/actions/runs/11535899222/job/32112716849#step:7:444) after 134619 ms: Connection timed out
make[2]: *** [Makefile:55: /github/workspace/sources/apps/wireless/bluetooth/nimble_context] Error 2

Third Run succeeds. Why do we keep seeing these errors: GitHub Actions with Docker, can’t connect to GitHub itself?
Is there a Concurrent Connection Limit for GitHub HTTPS Connections?

We see 4 Concurrent Connections to GitHub HTTPS…
The Fifth Connection failed: arm-02 at 00:42:52

Should we use a Caching Proxy Server for curl?

$ export https_proxy=https://1.2.3.4:1234
$ curl https://github.com/...

Is something misconfigured in our Docker Image?

But the exact same Docker Image runs fine on our own Build Farm. It doesn’t show any errors.
Is GitHub Actions starting our Docker Container with the wrong MTU (Network Packet Size)? 🤔
- GitHub Actions with Smaller MTU Size
- Docker MTU issues and solutions
Meanwhile I’m running a script to Restart Failed Jobs on our NuttX Mirror Repo: restart-failed-job.sh

These Timeout Errors will cost us precious GitHub Minutes. The remaining jobs get killed, and restarting these killed jobs from scratch will consume extra GitHub Minutes. (The restart below costs us 6 extra GitHub Runner Hours)

How do we Retry these Timeout Errors?
Can we have Restartable Builds?

Doesn’t quite make sense to kill everything and rebuild from scratch (arm6, arm7, riscv7) just because one job failed (xtensa2)
Or xtensa2 should wait for others to finish, before it declares a timeout and croaks?

Configuration/Tool: esp32s2-kaluga-1/lvgl_st7789
curl: Failed to connect to github.com port 443 after 133994 ms:
Connection timed out

(See the Complete Log)

Previously: Our developers waited 2.5 Hours for a Pull Request to be checked. Now we wait at most 1.5 Hours

§15 Appendix: Build Rules for CI Workflow

Initially we created the Build Rules for CI Workflow to solve these problems that we observed in Sep 2024…

NuttX Devs need to wait (2.5 hours) for the CI Build to complete Across all Architectures (Arm32, Arm64, RISC-V, Xtensa)…

Even though we’re modifying a Single Architecture.
We’re using too many GitHub Runners and Build Minutes, exceeding the ASF Policy for GitHub Actions
Our usage of GitHub Runners is going up ($12K per month)

We need to stay within the ASF Budget for GitHub Runners ($8.2K per month)
What if CI could build only the Modified Architecture?
Right now most of our CI Builds are taking 2.5 mins.

Can we complete the build within 1 hour, when we Create / Modify a Simple PR?

This section explains how we coded the Build Rules. Which were mighty helpful for cutting costs in Nov 2024.

(Discussion here)

§15.1 Overall Solution

We propose a Partial Solution, based on the Arch and Board Labels (recently added to CI)…

We target only the Simple PRs: One Arch Label + One Board Label + One Size Label.

Like “Arch: risc-v, Board: risc-v, Size: XS”
If “Arch: arm” is the only non-size label, then we build only arm-01, arm-02, …
Same for “Board: arm”
If Arch and Board Labels are both present: They must be the same
Similar rules for RISC-V, Simulator, x86_64 and Xtensa
Simple PR + Docs is still considered a Simple PR (so devs won’t be penalised for adding docs)

§15.2 Fetch the Arch Labels

In our Build Rules: This is how we fetch the Arch Labels from a PR. And identify the PR as Arm, Arm64, RISC-V or Xtensa: arch.yml

# Get the Arch for the PR: arm, arm64, risc-v, xtensa, ...
- name: Get arch
  id: get-arch
  run: |        

    # If PR is Not Created or Modified: Build all targets
    pr=${{github.event.pull_request.number}}
    if [[ "$pr" == "" ]]; then
      echo "Not a Created or Modified PR, will build all targets"
      exit
    fi

    # Ignore the Label "Area: Documentation", because it won't affect the Build Targets
    query='.labels | map(select(.name != "Area: Documentation")) | '
    select_name='.[].name'
    select_length='length'

    # Get the Labels for the PR: "Arch: risc-v \n Board: risc-v \n Size: XS"
    # If GitHub CLI Fails: Build all targets
    labels=$(gh pr view $pr --repo $GITHUB_REPOSITORY --json labels --jq $query$select_name || echo "")
    numlabels=$(gh pr view $pr --repo $GITHUB_REPOSITORY --json labels --jq $query$select_length || echo "")
    echo "numlabels=$numlabels" | tee -a $GITHUB_OUTPUT

    # Identify the Size, Arch and Board Labels
    if [[ "$labels" == *"Size: "* ]]; then
      echo 'labels_contain_size=1' | tee -a $GITHUB_OUTPUT
    fi
    if [[ "$labels" == *"Arch: "* ]]; then
      echo 'labels_contain_arch=1' | tee -a $GITHUB_OUTPUT
    fi
    if [[ "$labels" == *"Board: "* ]]; then
      echo 'labels_contain_board=1' | tee -a $GITHUB_OUTPUT
    fi

    # Get the Arch Label
    if [[ "$labels" == *"Arch: arm64"* ]]; then
      echo 'arch_contains_arm64=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Arch: arm"* ]]; then
      echo 'arch_contains_arm=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Arch: risc-v"* ]]; then
      echo 'arch_contains_riscv=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Arch: simulator"* ]]; then
      echo 'arch_contains_sim=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Arch: x86_64"* ]]; then
      echo 'arch_contains_x86_64=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Arch: xtensa"* ]]; then
      echo 'arch_contains_xtensa=1' | tee -a $GITHUB_OUTPUT
    fi

    # Get the Board Label
    if [[ "$labels" == *"Board: arm64"* ]]; then
      echo 'board_contains_arm64=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Board: arm"* ]]; then
      echo 'board_contains_arm=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Board: risc-v"* ]]; then
      echo 'board_contains_riscv=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Board: simulator"* ]]; then
      echo 'board_contains_sim=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Board: x86_64"* ]]; then
      echo 'board_contains_x86_64=1' | tee -a $GITHUB_OUTPUT
    elif [[ "$labels" == *"Board: xtensa"* ]]; then
      echo 'board_contains_xtensa=1' | tee -a $GITHUB_OUTPUT
    fi

  env:
    GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Why “ || echo ""”? That’s because if the GitHub CLI gh fails for any reason, we shall build all targets.

This ensures that our CI Workflow won’t get disrupted due to errors in GitHub CLI.

§15.3 Limit to Simple PRs

We handle only Simple PRs: One Arch Label + One Board Label + One Size Label.

Like “Arch: risc-v, Board: risc-v, Size: XS”.

If it’s Not a Simple PR: We build everything. Like so: arch.yml

# inputs.boards is a JSON Array: ["arm-01", "risc-v-01", "xtensa-01", ...]
# We compact and remove the newlines
boards=$( echo '${{ inputs.boards }}' | jq --compact-output ".")
numboards=$( echo "$boards" | jq "length" )

# We consider only Simple PRs with:
# Arch + Size Labels Only
# Board + Size Labels Only
# Arch + Board + Size Labels Only
if [[ "$labels_contain_size" != "1" ]]; then
  echo "Size Label Missing, will build all targets"
  quit=1
elif [[ "$numlabels" == "2" && "$labels_contain_arch" == "1" ]]; then
  echo "Arch + Size Labels Only"
elif [[ "$numlabels" == "2" && "$labels_contain_board" == "1" ]]; then
  echo "Board + Size Labels Only"
elif [[ "$numlabels" == "3" && "$labels_contain_arch" == "1"  && "$labels_contain_board" == "1" ]]; then
  # Arch and Board must be the same
  if [[
    "$arch_contains_arm" != "$board_contains_arm" ||
    "$arch_contains_arm64" != "$board_contains_arm64" ||
    "$arch_contains_riscv" != "$board_contains_riscv" ||
    "$arch_contains_sim" != "$board_contains_sim" ||
    "$arch_contains_x86_64" != "$board_contains_x86_64" ||
    "$arch_contains_xtensa" != "$board_contains_xtensa"
  ]]; then
    echo "Arch and Board are not the same, will build all targets"
    quit=1
  else
    echo "Arch + Board + Size Labels Only"
  fi
else
  echo "Not a Simple PR, will build all targets"
  quit=1
fi

# If Not a Simple PR: Build all targets
if [[ "$quit" == "1" ]]; then
  # If PR was Created or Modified: Exclude some boards
  pr=${{github.event.pull_request.number}}
  if [[ "$pr" != "" ]]; then
    echo "Excluding arm-0[1249], arm-1[124-9], risc-v-04..06, sim-03, xtensa-02"
    boards=$(
      echo '${{ inputs.boards }}' |
      jq --compact-output \
      'map(
        select(
          test("arm-0[1249]") == false and test("arm-1[124-9]") == false and
          test("risc-v-0[4-9]") == false and
          test("sim-0[3-9]") == false and
          test("xtensa-0[2-9]") == false
        )
      )'
    )
  fi
  echo "selected_builds=$boards" | tee -a $GITHUB_OUTPUT
  exit
fi

§15.4 Identify the Non-Arm Builds

Suppose the PR says “Arch: arm” or “Board: arm”.

We filter out the builds that should be skipped (RISC-V, Xtensa, etc): arch.yml

# For every board
for (( i=0; i<numboards; i++ ))
do
  # Fetch the board
  board=$( echo "$boards" | jq ".[$i]" )
  skip_build=0
  
  # For "Arch / Board: arm": Build arm-01, arm-02, ...
  if [[ "$arch_contains_arm" == "1" || "$board_contains_arm" == "1" ]]; then
    if [[ "$board" != *"arm"* ]]; then
      skip_build=1
    fi
  # Omitted: Arm64, RISC-V, Simulator x86_64, Xtensa
  ...
  # For Other Arch: Allow the build
  else
    echo Build by default: $board
  fi

  # Add the board to the selected builds
  if [[ "$skip_build" == "0" ]]; then
    echo Add $board to selected_builds
    if [[ "$selected_builds" == "" ]]; then
      selected_builds=$board
    else
      selected_builds=$selected_builds,$board
    fi
  fi
done

# Return the selected builds as JSON Array
# If Selected Builds is empty: Skip all builds
echo "selected_builds=[$selected_builds]" | tee -a $GITHUB_OUTPUT
if [[ "$selected_builds" == "" ]]; then
  echo "skip_all_builds=1" | tee -a $GITHUB_OUTPUT
fi

§15.5 Skip The Non-Arm Builds

Earlier we saw the code in arch.yml Reusable Workflow that identifies the builds to be skipped.

The code above is called by build.yml (Build Workflow). Which will actually skip the builds: build.yml

# Select the Linux Builds based on PR Arch Label
Linux-Arch:
uses: apache/nuttx/.github/workflows/arch.yml@master
needs: Fetch-Source
with:
  os: Linux
  boards: |
    [
      "arm-01", "risc-v-01", "sim-01", "xtensa-01", "arm64-01", "x86_64-01", "other",
      "arm-02", "risc-v-02", "sim-02", "xtensa-02",
      "arm-03", "risc-v-03", "sim-03",
      "arm-04", "risc-v-04",
      "arm-05", "risc-v-05",
      "arm-06", "risc-v-06",
      "arm-07", "arm-08", "arm-09", "arm-10", "arm-11", "arm-12", "arm-13", "arm-14"
    ]

# Run the selected Linux Builds
Linux:
needs: Linux-Arch
if: ${{ needs.Linux-Arch.outputs.skip_all_builds != '1' }}
runs-on: ubuntu-latest
env:
  DOCKER_BUILDKIT: 1

strategy:
  max-parallel: 12
  matrix:
    boards: ${{ fromJSON(needs.Linux-Arch.outputs.selected_builds) }}

steps:
  ## Omitted: Run cibuild.sh on Linux

Why “needs: Fetch-Source”? That’s because the PR Labeler runs concurrently in the background.

When we add Fetch-Source as a Job Dependency: We give the PR Labeler sufficient time to run (1 min), before we read the PR Label in arch.yml.

§15.6 Same for Other Builds

We do the same for Arm64, RISC-V, Simulator, x86_64 and Xtensa: arch.yml

# For "Arch / Board: arm64": Build arm64-01
elif [[ "$arch_contains_arm64" == "1" || "$board_contains_arm64" == "1" ]]; then
  if [[ "$board" != *"arm64-"* ]]; then
    skip_build=1
  fi

# For "Arch / Board: risc-v": Build risc-v-01, risc-v-02, ...
elif [[ "$arch_contains_riscv" == "1" || "$board_contains_riscv" == "1" ]]; then
  if [[ "$board" != *"risc-v-"* ]]; then
    skip_build=1
  fi

# For "Arch / Board: simulator": Build sim-01, sim-02
elif [[ "$arch_contains_sim" == "1" || "$board_contains_sim" == "1" ]]; then
  if [[ "$board" != *"sim-"* ]]; then
    skip_build=1
  fi

# For "Arch / Board: x86_64": Build x86_64-01
elif [[ "$arch_contains_x86_64" == "1" || "$board_contains_x86_64" == "1" ]]; then
  if [[ "$board" != *"x86_64-"* ]]; then
    skip_build=1
  fi

# For "Arch / Board: xtensa": Build xtensa-01, xtensa-02
elif [[ "$arch_contains_xtensa" == "1" || "$board_contains_xtensa" == "1" ]]; then
  if [[ "$board" != *"xtensa-"* ]]; then
    skip_build=1
  fi

Disable macOS Builds

§15.7 Skip the macOS Builds

For Simple PRs and Complex PRs: We skip the macOS builds (macos, macos/sim-*) since these builds are costly: build.yml

(macOS builds will take 2 hours to complete due to the queueing for macOS Runners)

# Select the macOS Builds based on PR Arch Label
macOS-Arch:
  uses: apache/nuttx/.github/workflows/arch.yml@master
  needs: Fetch-Source
  with:
    os: macOS
    boards: |
      ["macos", "sim-01", "sim-02", "sim-03"]

# Run the selected macOS Builds
macOS:
  permissions:
    contents: none
  runs-on: macos-13
  needs: macOS-Arch
  if: ${{ needs.macOS-Arch.outputs.skip_all_builds != '1' }}
  strategy:
    max-parallel: 2
    matrix:
      boards: ${{ fromJSON(needs.macOS-Arch.outputs.selected_builds) }}
  steps:
    ## Omitted: Run cibuild.sh on macOS

skip_all_builds for macOS will be set to 1: arch.yml

# Select the Builds for the PR: arm-01, risc-v-01, xtensa-01, ...
- name: Select builds
  id: select-builds
  run: |

    # Skip all macOS Builds
    if [[ "${{ inputs.os }}" == "macOS" ]]; then
      echo "Skipping all macOS Builds"
      echo "skip_all_builds=1" | tee -a $GITHUB_OUTPUT
      exit
    fi

§15.8 Ignore the Docs Label

NuttX Devs shouldn’t be penalised for adding docs!

That’s why we ignore the label “Area: Documentation”. Which means that Simple PR + Docs is still a Simple PR.

And will skip the unnecessary builds: arch.yml

# Ignore the Label "Area: Documentation", because it won't affect the Build Targets
query='.labels | map(select(.name != "Area: Documentation")) | '
select_name='.[].name'
select_length='length'

# Get the Labels for the PR: "Arch: risc-v \n Board: risc-v \n Size: XS"
# If GitHub CLI Fails: Build all targets
labels=$(gh pr view $pr --repo $GITHUB_REPOSITORY --json labels --jq $query$select_name || echo "")
numlabels=$(gh pr view $pr --repo $GITHUB_REPOSITORY --json labels --jq $query$select_length || echo "")
echo "numlabels=$numlabels" | tee -a $GITHUB_OUTPUT

§15.9 Sync to NuttX Apps

Remember to sync build.yml and arch.yml from NuttX Repo to NuttX Apps!

(See the Pull Request)

How are they connected?

build.yml points to arch.yml for the Build Rules.

When we sync build.yml from NuttX Repo to NuttX Apps, we won’t need to remove the references to arch.yml.
We could make nuttx-apps/build.yml point to nuttx/arch.yml.

But that would make the CI Fragile: Changes to nuttx/arch.yml might cause nuttx-apps/build.yml to break.
That’s why we point nuttx-apps/build.yml to nuttx-apps/arch.yml instead.

But NuttX Apps don’t need Build Rules?

arch.yml is kinda redundant in NuttX Apps. Everything is a Complex PR!
I have difficulty keeping nuttx/build.yml and nuttx-apps/build.yml in sync. That’s why I simply copied over arch.yml as-is.
In future we could extend arch.yml with App-Specific Build Ruiles

CI Build Workflow looks very different now?

Yeah our CI Build Workflow used to be simpler: build.yml

Linux:
  needs: Fetch-Source
  strategy:
    matrix:
      boards: [arm-01, arm-02, arm-03, arm-04, arm-05, arm-06, arm-07, arm-08, arm-09, arm-10, arm-11, arm-12, arm-13, other, risc-v-01, risc-v-02, sim-01, sim-02, xtensa-01, xtensa-02]

Now with Build Rules, it becomes more complicated: build.yml

# Select the Linux Builds based on PR Arch Label
Linux-Arch:
  uses: apache/nuttx-apps/.github/workflows/arch.yml@master
  needs: Fetch-Source
  with:
    boards: |
      [
        "arm-01", "other", "risc-v-01", "sim-01", "xtensa-01", ...
      ]

# Run the selected Linux Builds
Linux:
  needs: Linux-Arch
  if: ${{ needs.Linux-Arch.outputs.skip_all_builds != '1' }}
  strategy:
    matrix:
      boards: ${{ fromJSON(needs.Linux-Arch.outputs.selected_builds) }}

One thing remains the same: We configure the Target Groups in build.yml. (Instead of arch.yml)

§15.10 Actual Performance

For our Initial Implementation of Build Rules: We recorded the CI Build Performance for Simple PRs.

Then we made the Simple PRs faster…

Build Time	Before	After
Arm32	2 hours	1.5 hours
Arm64	2.2 hours	30 mins
RISC-V	1.8 hours	50 mins
Xtensa	2.2 hours	1.5 hours
x86_64	2.2 hours	10 mins
Simulator	2.2 hours	1 hour

How did we make the Simple PRs faster?

We broke up Big Jobs (arm-05, riscv-01, riscv-02) into Multiple Smaller Jobs.

Small Jobs will really fly! (See the Build Job Details)

(We moved the RP2040 jobs from arm-05 to arm-06, then added arm-14. Followed by jobs riscv-03 … riscv-06)
We saw a 27% Reduction in GitHub Runner Hours! From 15 Runner Hours down to 11 Runner Hours per Arm32 Build.
We split the Board Labels according to Arch, like “Board: arm”.

Thus “Board: arm” should build the exact same way as “Arch: arm”.

Same for “Board: arm, Arch: arm”. We updated the Build Rules to use the Board Labels.
We split the others job into arm64 and x86_64

Up Next: Reorg and rename the CI Build Jobs, for better performance and easier maintenance. But how?

I have a hunch that CI works better when we pack the jobs into One-Hour Time Slices
Kinda like packing yummy goodies into Bento Boxes, making sure they don’t overflow the Time Boxes :-)
We should probably shift the Riskiest / Most Failure Prone builds into the First Build Job (arm-00, risc-v-00, sim-00).

And we shall Fail Faster (in case of problems), skipping the rest of the jobs.
Recently we see many builds for Arm32 Goldfish.

Can we limit the builds to the Goldfish Boards only?

To identify Goldfish PRs, we can label the PRs like this: “Arch: arm, SubArch: goldfish” and “Board: arm, SubBoard: goldfish”
Instead of Building an Entire Arch (arm-01)…

Can we build One Single SubArch (stm32)?

How will we Filter the Build Jobs (e.g. arm-01) that should be built for a SubArch (e.g. stm32)? (Maybe like this)

(Discussion here)

Spot the exact knotty moment that we were told about the CI Shutdown

Spot the exact knotty moment that we were told about the CI Shutdown