.. _debusine-concepts:

=================
Debusine concepts
=================

.. _explanation-artifacts:

Artifacts
=========

Artifacts are at the heart of Debusine. Artifacts are both inputs
(submitted by users) and outputs (generated by tasks). An artifact
combines:

* an arbitrary set of files
* arbitrary key-value data (stored as a JSON-encoded dictionary)
* a category

The category is just a string identifier used to recognize artifacts sharing
the same structure. You can create and use categories as you see fit but we
have defined a basic :ref:`ontology <artifacts>` suited for the case of a
Debian-based distribution.

Artifacts can have relations with other artifacts:

* *built-using*: indicates that the build of the artifact used the target
  artifact (ex: "binary-packages" artifacts are built using
  "source-package" artifacts)
* *extends*: indicates that the artifact is extending the target artifact
  in some way (ex: a "source-upload" artifact extends a "source-package"
  artifact with target distribution information)
* *relates-to*: indicates that the artifact relates to another one in
  some way (ex: a "binary-upload" artifact relates-to a "binary-package",
  or a "package-build-log" artifact relates to a "binary-package").

Artifacts are not deleted:

* as long as they are referenced by another artifact (through one of the
  above relationships)
* as long as their expiration date is not over
* as long as they are not manually deleted (if they don't have any
  expiration date)
* as long as they are referenced by items of a collection

Artifacts can have additional properties:

* immutable: when set to True, nothing can be changed in the artifact
  through the API
* creation timestamp: timestamp indicating when the artifact has been
  created
* last updated timestamp: timestamp indicating when the artifact has been
  last modified/updated

The following operations are possible on artifacts:

* create a new artifact
* upload content of one of its file
* set key-value data
* attach/remove a file
* add/remove a relationship
* delete an artifact

Files in artifacts are content-addressed (stored by hash) in the
database, so a single file can be referenced in multiple places without
unnecessary data duplication.

.. _explanation-collections:

Collections
===========

A Collection is a set of artifacts or other collections that are intended to
be used together. The following are some example use cases:

* A suite in the Debian archive (e.g. "Debian bookworm")
* A Debian archive (a.k.a. repository) containing multiple suites
* For a source package name, the latest version in each suite in Debian
  (compare ``https://tracker.debian.org/pkg/foo``)
* Results of a QA scan across all packages in unstable and experimental
* Buildd-suitable ``debian:system-tarball`` artifacts for all Debian suites
* Extracted ``.desktop`` files for each package name in a suite

.. todo::

   Another possible idea is to use collections for the output of each task,
   either automatically or via a parameter to the task.

Collections have the following properties:

* ``category``: a string identifier indicating the structure of additional
  data; see the :ref:`ontology <collections>`
* ``name``: the name of the collection
* ``workspace``: defines access control and file storage for this collection; at
  present, all artifacts in the collection must be in the same workspace
* ``full_history_retention_period``, ``metadata_only_retention_period``:
  optional time intervals to configure the retention of items in the
  collection after removal; see :ref:`explanation-collection-item-retention`
  for details

Collections are unique by category and name.  They may be looked up by
category and name, providing starting points for further lookups within
collections.

Each item in a collection is a combination of some metadata and an optional
reference to an artifact or another collection. The permitted categories for
the artifact or collection are limited depending on the category of the
containing collection. The metadata is as follows:

* ``category``: the category of the artifact, copied for ease of lookup and
  to preserve history
* ``name``: a name identifying the item, which will normally be derived
  automatically from some of its properties; only one item with a given
  name and an unset removal timestamp (i.e. an active item) may exist in any
  given collection
* key-value data indicating additional properties of the item in the
  collection, stored as a JSON-encoded dictionary with a structure
  :ref:`depending on the category of the collection <collections>`; this
  data can:

  * provide additional data related to the item itself
  * provide additional data related to the associated artifact in the
    context of the collection (e.g. overrides for packages in suites)
  * override some artifact metadata in the context of the collection (e.g.
    vendor/codename of system tarballs)
  * duplicate some artifact metadata, to make querying easier and to
    preserve it as history even after the associated artifact has been
    expired (e.g. architecture of system tarballs)

* audit log fields for changes in the item's state:

  * timestamp (``created_at``), user (``created_by_user``),
    and workflow (``created_by_workflow``) for when it was created
  * timestamp (``removed_at``), user (``removed_by_user``),
    and workflow (``removed_by_workflow``) for when it was removed

This metadata may be retained even after a linked artifact has been expired
(see :ref:`explanation-collection-item-retention`). This means that it is
sometimes useful to design collection items to copy some basic information,
such as package names and versions, from their linked artifacts for use when
inspecting history.

The same artifact or collection may be present more than once in the same
containing collection, with different properties. For example, this is
useful when debusine needs to use the same artifact in more than one similar
situation, such as a single system tarball that should be used for builds
for more than one suite.

A collection may impose additional constraints on the items it contains,
depending on its category. Some constraints may apply only to active items,
while some may apply to all items. If a collection contains another
collection, all relevant constraints are applied recursively.

Collections can be compared: for example, a collection of outputs of QA
tasks can be compared with the collection of inputs to those tasks, making
it easy to see which new tasks need to be scheduled to stay up to date.

.. _explanation-collection-item-retention:

Retention of collection items
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Collection items and the artifacts they refer to may be retained in
debusine's database for some time after the item is removed from the
collection, depending on the values of ``full_history_retention_period`` and
``metadata_only_retention_period``.  The sequence of events is as follows:

* item is removed from collection: metadata and artifact are both still
  present
* after ``full_history_retention_period``, the link between the collection
  item and the artifact is removed: metadata is still present, but the
  artifact may be expired if nothing else prevents that from happening
* after ``full_history_retention_period`` +
  ``metadata_only_retention_period``, the collection item itself is deleted
  from the database: metadata is no longer present, so the history of the
  collection no longer records that the item in question was ever in the
  collection

If ``full_history_retention_period`` is not set, then artifacts in the
collection and the files they contain are never expired.  If
``metadata_only_retention_period`` is not set, then metadata-level history
of items in the collection is never expired.

.. _explanation-collection-updates:

Updating collections
~~~~~~~~~~~~~~~~~~~~

The purpose of some tasks is to update a collection.  Those tasks must
ensure that anything else looking at the collection always sees a consistent
state, satisfying whatever invariants are defined for that collection.  In
most cases it is sufficient to ensure that the task does all its updates
within a single database transaction.  This may be impractical for some
long-running tasks, and they might need to break up the updates into chunks
instead; in such cases they must still be careful that the state of the
collection at each transaction boundary is consistent.

To support automated QA at the scale of a distribution, some collections are
derived automatically from other collections, and there are special
arrangements for keeping those collections up to date.  See
:ref:`collection-derived`.

.. _explanation-workspaces:

Workspaces
==========

A Workspace is a concept tying together a set of Artifacts and
a set of Users. Since Artifacts have to be stored somewhere, Workspaces
also tie together the set of FileStore where files can be stored.

Workspaces have the following important properties:

* public: a boolean which indicates whether the Artifacts are publicly
  accessible or if they are restricted to the users belonging to the
  workspace
* default_expiration_delay: the minimal time (in days) that a new
  artifact is kept in the workspace before being expired. This value
  can be overridden in the artifact afterwards. If this value is 0,
  then Artifacts are never expired until they are manually removed.
* default_file_store: the default FileStore where newly uploaded files
  are stored.

.. _explanation-workers:

Workers
=======

Workers are services that run :ref:`tasks <explanation-tasks>` on behalf of
a Debusine server.  There are two types of worker.

External workers
~~~~~~~~~~~~~~~~

Most workers are external workers, running an instance of
``debusine-worker``.  This is a daemon that runs untrusted tasks using some
form of containerization or virtualization.  It has no direct access to the
Debusine database; instead, it interacts with the server using the HTTP API
and WebSockets.

External workers process one task at a time, and only process ``Worker``
tasks.

Celery workers
~~~~~~~~~~~~~~

A Debusine instance normally has an associated Celery worker, which is used
to run tasks that require direct access to the Debusine database.  These
tasks are necessarily trusted, so they must not involve running
user-controlled code.

Celery workers have a concurrency level, normally set to the number of
logical CPUs in the system (:py:func:`os.cpu_count`).

.. todo::

   Document (and possibly fix) what happens when workers are restarted while
   running a task.

.. _explanation-tasks:

Tasks
=====

Tasks are time-consuming operations that are typically offloaded to
dedicated workers. They consume artifacts as input and generate artifacts
as output. The generated artifacts automatically have *built-using*
relationships linking to the artifacts used as input.

Tasks can require specific features from the workers on which it will
run. This will be used to ensure things like:

* architecture selection (when managing builders on different
  architectures)
* required memory amount
* required free disk space amount
* availability of specific build chroot

There are four types of tasks:

* ``Worker`` tasks are the type of tasks most people will use, running on
  external workers.  They may execute untrusted code, such as building a
  source package uploaded by a user.
* ``Server`` tasks perform operations that require direct database access
  and that may take some time to run.  They run on Celery workers, and must
  not execute any user-controlled code.
* ``Internal`` tasks are used to coordinate details of the scheduler's
  behaviour.  They are normally hidden from view.
* ``Workflow`` tasks represent a collection of other tasks; see
  :ref:`explanation-workflows`.

Tasks that run on ``debusine-worker`` instances are required to use the
public API to interact with artifacts. They are passed a dedicated token
that has the proper permissions to retrieve the required artifacts and to
upload the generated artifacts.

Executor Backends
~~~~~~~~~~~~~~~~~

Debusine supports multiple different virtualisation backends to execute
``Worker`` tasks, from lightweight containers (e.g. ``unshare``) to VMs
(e.g. ``incus-vm``).

When tasks are executed in an executor backend, one of the task inputs
is an environment, an artifact containing a system image that the task
is executed in. These image artifacts are downloaded by the worker and
cached locally. For some backends (e.g. Incus) they'll be converted
and/or imported into an image store.

The worker maintains an LRU cache of up to 10 images. When cleaning up
images, they'll also be removed from any relevant image stores.

.. _explanation-work-requests:

Work Requests
=============

Work Requests are the way Debusine schedules tasks to workers and monitors
their progress and success.

Work Requests have the following important properties:

* task_type: the type of the task (``Worker``, ``Server``, ``Internal``, or
  ``Workflow``; see :ref:`explanation-tasks`)
* task_name: the name of the task to execute (used to figure out the
  Python class implementing the logic)
* task_data: a JSON dict representing the input parameters for the task
* status: the processing status of the work request. Allowed values are:

  * blocked: the task is not ready to be executed
  * pending: the task is ready to be executed and can be picked up by a
    worker
  * running: the task is currently being executed by a worker
  * aborted: the task has been cancelled/aborted
  * completed: the task has been completed

* result: the processing result. Allowed values are:

  * success: the task completed and succeeded
  * failure: the task completed and failed
  * error: an unexpected error happened during execution

* workspace: foreign key to the workspace where the task is executed
* worker: a foreign key to the assigned worker (is NULL while
  work request is pending or blocked)
* unblock_strategy: a field specifying how the work request can move from
  ``blocked`` to ``pending`` status. Supported values are:

  * ``deps``: the work request can be unblocked once all the dependent
    work requests have completed
  * ``manual``: the work request must be manually unblocked

* dependencies: ManyToMany relation with other ``WorkRequest`` that
  need to complete before this one can be unblocked (if using the ``deps``
  unblock_strategy)
* parent: foreign key to the containing WorkRequest (or NULL when scheduled
  outside of a workflow). The parent hierarchy will eventually reach a node of
  type ``"workflow"``` which is the node that manages this ``WorkRequest``
  hierarchy. See :ref:`Workflows <explanation-workflows>`.
* workflow_data: JSON dict controlling some workflow specific behaviour
* event_reactions: JSON dict describing actions to perform in response to
  specific events.
* internal_collection: (only for workflow work requests): reference to a
  ``debusine:workflow-internal`` collection (see
  :ref:`collection-workflow-internal`) that holds artifacts produced during
  this workflow
* expiration_delay: retention time (in days) for this work request in the
  database
* supersedes: optional work request that has been superseded by this one: this
  is used to track previous attempts when retrying tasks.

Blocked work requests using the ``deps`` unblock strategy may have
dependencies on other work requests. Those are only used to control the
order of execution of work requests inside workflows: the scheduler
ignores ``blocked`` work requests and only considers ``pending`` work
requests. The ``deps`` unblock strategy will change the status of the work
request to ``pending`` when all its dependent work requests have
completed.

.. _explanation-workflows:

Workflows
=========

Workflows are advanced server-side logic that can schedule and combine
server tasks and worker tasks: outputs of some work requests can become
the input of other work requests, and the flow of execution can be
influenced by the results of already executed work requests.

Workflows are powerful operations in particular due to their ability
to run server tasks. Until finer grained access control is implemented,
users can only start the subset of workflows that have been made available
by the workspace administrator (by creating *workflow templates*). This
process:

* grants a unique name to the workflow so that it can be easily identified
  and started by users
* defines all the input parameters that cannot be overridden when a user
  starts the workflow

Those workflow templates can then be turned into actual running workflows
by users or external events, through the web
interface or through the API.

The input parameters that are not set in the workflow template are called
run-time parameters and they have to be provided by the user that starts
the workflow. Those parameters are stored in a WorkRequest model with task_type ``workflow``
that will be used as the root of a WorkRequest hierarchy covering the whole duration of the process controlled
by the workflow.

Once completed, the remaining lifetime of the workflow instances is
controlled by their expiration date and the expiration of some associated
artifacts.

To begin with, available workflows will be limited to those that
are fully implemented in Debusine. In the future, we expect to add
a more flexible approach where administrators can submit a fully
customized logic combining various building blocks.

Here are some examples of possible workflows:

 * Package build: it would take a source package and a target distribution
   as input parameters, and the workflow would automate the following
   steps:
   { sbuild on all architectures supported in the target distribution }
   → add source and binary packages to target distribution.

   See :ref:`sbuild workflow <workflow-sbuild>`.

 * Package review: it would take a source package and associated binary
   packages and a target distribution, and the workflow would control
   the following steps:
   { generating debdiff between source packages, lintian, autopkgtest,
   autopkgtests of reverse-dependencies } → manual validation by reviewer
   → add source and binary packages to target distribution.

 * Both build and review could be combined in a larger workflow.

   In that case, the reverse-dependencies whose autopkgtests should be run
   cannot be identified until the sbuild task has completed, so the
   workflow would be expanded/reconfigured after that step completed.

 * Update a collection of lintian analyses of the latest packages in a
   given distribution based on the changes of the collection
   representing that distribution.

   Here again the set of lintian analyses to run depends on a :ref:`first
   step of comparison between the two collections <collection-derived>`.

See :ref:`Workflows <workflows>` for a list of available workflows.
