revi-archive/llvm-premerge-checks

Fork 0

Mikhail Goncharov 157418a00d doc current setup

2023-09-11 11:49:14 +02:00

7.8 KiB

Raw Blame History

Overview
Buildkite agents
Build steps
Phabricator integration
Buildkite monitoring

Overview

Buildkite orchestrates each build.
multiple Linux and windows agents connected to Buildkite. Agents are run at Google Cloud Platform.
small proxy service that takes build requests from reviews.llvm.org and converts them into Buildkite build request. Buildkite job sends build results directly to Phabricator.
every review creates a new branch in fork of llvm-project.

Buildkite agents

Agents are deployed in two clusters llvm-premerge-checks and windows-cluster. The latter is for windows machines (as it is controlled on cluster level if machines can run windows containers).

Container configurations are in ./containers and deployment configurations are in ./kubernetes. Most important ones are:

Windows agents: container containers/buildkite-windows, deployment kubernetes/buildkite/windows.yaml. TODO: at the moment Docker image is created and uploaded from a windows machine (e.g. win-dev). It would be great to setup a cloudbuild.
Linux agents: run tests for linux x86 config, container containers/buildkite-linux, deployment kubernetes/buildkite/linux.yaml.
Service agents: run low CPU build steps (e.g. generate pipeline steps) container containers/buildkite-linux, deployment kubernetes/buildkite/service.yaml.

All deployments have a copy ..-test to be used as a test playground to check container / deployment changes before modifying "prod" setup.

Build steps

Buildkite allows dynamically define pipelines as the output of a command. And most of pipelines use this pattern of running a script and using the resulting yaml. E.g. script to run pull-request checks is llvm-project .ci/generate-buildkite-pipeline-premerge. Thus any changes to steps to run should go into that script.

We have a legacy set of scripts in /scripts in this repo but discourage new use and development of them - they are mostly kept to make Phabricator integration to function.

Phabricator integration

Note: this section is about integrating with Phabricator that is now discouraged, some things might already be renamed or straight broken as we moving to Pull Requests.

Harbormaster build plan the Phabricator side these things were configured
Herald rule for everyone and for beta testers. Note that right now there is no difference between beta and "normal" builds.
the merge_guards_bot user account for writing comments.

Life of a pre-merge check

When new diff arrives for review it triggers a Herald rule ("everyone" or "beta testers").

That in sends an HTTP POST request to phab-proxy that submits a new buildkite job diff-checks. All parameters from the original request are put in the build's environment with ph_ prefix (to avoid shadowing any Buildkite environment variable). "ph_scripts_refspec" parameter defines refspec of llvm-premerge-checks to use ("main" by default).

diff-checks pipeline (create_branch_pipeline.py) downloads a patch (or series of patches) and applies it to a fork of the llvm-project repository. Then it pushes a new state as a new branch (e.g. "phab-diff-288211") and triggers "premerge-checks" on it (all "ph_" env variables are passed to it). This new branch can now be used to reproduce the build or by another tooling. Periodical cleanup-branches pipeline deletes branches older than 30 days.

premerge-checks pipeline (build_branch_pipeline.py) builds and tests changes on Linux and Windows agents. Then it uploads a combined result to Phabricator.

Cluster parts

Ingress and public addresses

We use NGINX ingress for Kubernetes. Right now it's only used to provide basic HTTP authentication and forwards all requests from load balancer to phabricator proxy application.

Follow up to date docs to install reverse proxy.

[cert-manager] is installed with helm https://cert-manager.io/docs/installation/helm/

helm install
cert-manager jetstack/cert-manager
--namespace cert-manager
--create-namespace
--version v1.9.1
--set installCRDs=true

We also have certificate manager and lets-encrypt configuration in place, but they are not used at the moment and should be removed if we decide to live with static IP.

HTTP auth is configured with k8s secret 'http-auth' in 'buildkite' namespace (see how to update auth).

Enabled projects and project detection

To reduce build times and mask unrelated problems, we're only building and testing the projects that were modified by a patch. choose_projects.py uses manually maintained config file to define inter-project dependencies and exclude projects:

Get prefix (e.g. "llvm", "clang") of all paths modified by a patch.
Add all dependant projects.
Add all projects that this extended list depends on, completing the dependency subtree.
Remove all disabled projects.

Agent machines

All build machines are running from Docker containers so that they can be debugged, updated, and scaled easily:

Linux. We use Kubernetes deployment to manage these agents.
Windows. At the moment they are run as multiple individual VM instances.

See playbooks how to manage and set up machines.

Compilation caching

Each build is performed on a clean copy of the git repository. To speed up the builds ccache is used on Linux and sccache on Windows.

Buildkite monitoring

FIXME: does not work as of 2023-09-11. Those metrics could allow us to setup auto-scaling of machines to the current demend.

VM instance buildkite-monitoring exposes Buildkite metrics to GCP. To set up a new instance:

Create as small Linux VM with full access to Stackdriver Monitoring API.
Follow instructions to install monitoring agent and enable statsd plugin.
Download recent release of buildkite-agent-metrics.
Run in SSH session:

chmod +x buildkite-agent-metrics-linux-amd64
nohup ./buildkite-agent-metrics-linux-amd64 -token XXXX -interval 30s -backend statsd &

Metrics are exported as "custom/statsd/gauge".

TODO: update "Testing scripts locally" playbook on how to run Linux build locally with Docker. TODO: migrate 'builkite-monitoring' to k8s deployment.

7.8 KiB Raw Blame History