This commit replaces the previous deployment mechanism with a blue-green strategy to lay the groundwork for zero-downtime deployments.
Key changes:
Introduces a deploy-blue-green.sh script to manage "blue" and "green" container sets, creating versioned releases.
Updates the Anubis gatekeeper template to dynamically route traffic based on the active deployment color, allowing for seamless traffic switching.
Modifies Docker Compose files to include color-specific labels and environment variables.
Adapts the GitHub Actions workflow to execute the new blue-green deployment process.
Removes the old, now-obsolete deployment and health check scripts.
Note: Automated rollback on health check failure is not yet implemented. Downgrades can be performed manually by switching the active color.
This commit significantly improves the gatekeeper system's robustness, monitoring capabilities, and simplifies host header management for backend services.
Key changes include:
**Gatekeeper Health, Management & Resilience:**
- Implemented active health checking for individual gatekeeper containers within the `gatekeeper-manager` service.
- The manager now periodically curls the `/metrics` endpoint of each gatekeeper container.
- Reports health status to a new Gatus `services_gatekeeper` endpoint.
- Automatically attempts to restart the gatekeeper stack if any gatekeeper instance is unhealthy or if the expected number of gatekeepers is not running.
- Refactored the `gatekeeper-manager` shell script for improved state management and signal handling:
- Introduced `STARTED`, `RESTARTING`, `TERMINATING` state flags for more controlled operations.
- Enhanced SIGTERM and SIGHUP handling to gracefully manage gatekeeper lifecycles.
- Added `apk add curl` to ensure `curl` is available in the manager container.
- Renamed the gatekeeper Docker Compose template from `docker-compose_gatekeeper.template.yml` to `gatekeepers.template.yml` and its output to `gatekeepers.yml`.
- Updated `dockergen-gatekeeper` to watch the new template file and notify the correct `gatekeeper-manager` service instance (e.g., `pkmntrade-club-gatekeeper-manager-1`).
- Discover services that should be protected by looking for a `gatekeeper=true` label.
**Host Header Management & `ALLOWED_HOSTS` Simplification:**
- HAProxy configuration (`haproxy.cfg`) now consistently sets the `Host` HTTP header for requests to all backend services (e.g., `pkmntrade.club`, `staging.pkmntrade.club`). This centralizes and standardizes host information.
- Consequently, explicit `ALLOWED_HOSTS` environment variables have been removed from the `web` and `celery` service definitions in `docker-compose_web.yml` and `docker-compose_staging.yml`. Backend Django applications should now rely on the `Host` header set by HAProxy for request validation.
- The `gatekeepers.template.yml` now defines a `TARGET_HOST` environment variable for proxied services (e.g., `web`, `web-staging`). This aligns with the ALLOWED_HOSTS on the target to ensure requests aren't blocked.
**Gatus Monitoring & Configuration Updates:**
- In Gatus configuration (`gatus/config.template.yaml`):
- The "Redis" external service endpoint has been renamed to "Cache" for better clarity and to fit the theme of simple names.
- A new external service endpoint "Gatekeeper" has been added to monitor the overall health reported by the `gatekeeper-manager`.
- Health checks for "Web Worker" endpoints (both main and staging) now include the appropriate `Host` header (e.g., `Host: pkmntrade.club`) to ensure accurate health assessments by Django.
- In `docker-compose_core.yml`, the `curl` commands used by `db-redis-healthcheck` for database and cache health now append `|| true`. This prevents the script from exiting on a curl error (e.g., timeout, connection refused), ensuring that the failure is still reported to Gatus via the `success=false` parameter rather than the script terminating prematurely.
These changes collectively make the gatekeeper system more fault-tolerant, provide better visibility into its status, and streamline the configuration of backend applications by standardizing how they receive host information.
- **Implemented Dynamic Gatekeeper (Anubis) Proxy:**
- Introduced Anubis as a Gatekeeper proxy layer for services (`web`, `web-staging`, `feedback`, `health`).
- Added `docker-gen` setup (`docker-compose_gatekeeper.template.yml`, `gatekeeper-manager`) to dynamically configure Anubis instances based on container labels (`enable_gatekeeper=true`).
- Updated HAProxy to route traffic through the respective Gatekeeper services.
- **Enhanced Service Health Monitoring & Checks:**
- Integrated `django-health-check` into the Django application, providing detailed health endpoints (e.g., `/health/`).
- Replaced the custom health check view with `django-health-check` URLs.
- Added `psutil` for system metrics in health checks.
- Made Gatus configuration dynamic using `docker-gen` (`config.template.yaml`), allowing automatic discovery and monitoring of service instances (e.g., web workers).
- Externalized Gatus SMTP credentials to environment variables.
- Strengthened `docker-compose_core.yml` with a combined `db-redis-healthcheck` service reporting to Gatus.
- Added explicit health checks for `db` and `redis` services in `docker-compose.yml`.
- **Improved Docker & Compose Configuration:**
- Added `depends_on` conditions in `docker-compose.yml` for `web` and `celery` services to wait for the database.
- Updated `ALLOWED_HOSTS` in `docker-compose_staging.yml` and `docker-compose_web.yml` to include internal container names for Gatekeeper communication.
- Set `DEBUG=False` for staging services.
- Removed `.env.production` from `.gitignore` (standardized to `.env`).
- Streamlined `scripts/entrypoint.sh` by removing the call to the no-longer-present `/deploy.sh`.
- **Dependency Updates:**
- Added `django-health-check>=3.18.3` and `psutil>=7.0.0` to `pyproject.toml` and `uv.lock`.
- Updated `settings.py` to include `health_check` apps, configuration, and use `REDIS_URL` consistently.
- **Streamlined deployment script used in GHA:**
- Updated the workflow to copy new server files and create a new `.env` file in the temporary directory before moving them into place.
- Consolidated the stopping and removal of old containers into a single step for better clarity and efficiency.
- Reduce container downtime by rearranging stop/start steps.