Miscweb

miscweb is a new service on kubernetes.

Since 2022-01-20 it serves production traffic for static-bugzilla. Other miscweb releases in the same namespace contain different services, like annual.wikimedia.org and transparency.wikimedia.org.

It was requested in task T281538 to replace the legacy service "miscweb" running on Ganeti VMs in production.

Also see: miscweb1002, miscweb2002 for the legacy machines still serving other microsites.


Sites running on miscweb k8s

The first of the sites hosted on miscweb-k8s is static-bugzilla.

Since 2022-01-20 static-bugzilla.wikimedia.org is served from k8s. Other microsites are running on miscweb as well. All services in miscweb:

Where does the code live?

Most code for the html files and blubber container images is hosted under https://gitlab.wikimedia.org/repos/sre/miscweb. Each service has a project there. The projects contain 4 important files and folders:

  • .pipeline folder contains blubber configuration for the image build
  • html folder contains the static html which is served by the container image. Note: this folder is also called html-compressed for big sites which use lfs and compressed storage of html files.
  • production folder contains the apache configuration for the apache2 process in the container
  • .gitlab-ci.yml contains the CI configuration for building and publishing the image to our image registry.

The helm charts for kubernetes can be found in operations/deployment-charts.

How to deploy changes

Most service changes contain of updates of the used container image and updates to the Kubernetes helm deployment.

Update the container image

To update the container image, checkout the project from https://gitlab.wikimedia.org/repos/sre/miscweb/. Create your changes to the container image and get a review in a merge request. Once the merge request is merged to the main branch, a dedicated publish pipeline job is executed. Find the job in GitLab in your project under CI/CD -> Jobs. In the last lines in the job log you can find the new image tag. For example:

#13 pushing layers
#13 pushing layers 4.9s done
13# pushing manifest for docker-registry.discovery.wmnet/repos/sre/miscweb/annualreport:2023-08-14-083827
77bf6661c95d2ee40d007b0e9 0.6s done
#13 DONE 10.2s

Copy the new image tag 2023-08-14-083827 for the next step. Note: if you browse https://docker-registry.wikimedia.org/ it can take some time for your image to show up due to caching.

Deploy to Kubernets/wikikube

Deployment workflow follows the standard process described in Kubernetes/Deployments#Code deployment/configuration changes.

Checkout /operations/deployment-charts. Update the image tag in the matching miscweb values file with the tag from the previous step (for example in /helmfile.d/services/miscweb/values-annualreport.yaml#3) or do any other change to the Kubernetes service if needed. Get a review and a +2 from SRE team.

Once the change to deployment-charts repo is merged, login to Deployment server and cd to /srv/deployment-charts/helmfile.d/services/miscweb

Staging

  • helmfile -e staging diff --context 5
  • helmfile -e staging -i apply --context 5

Production

  • helmfile -e codfw diff --context 5
  • helmfile -e codfw -i apply --context 5
  • helmfile -e eqiad diff --context 5
  • helmfile -e eqiad -i apply --context 5

All helm deployments are atomic. So either they work or will be reverted automatically after 5 minutes.

Deploy a single release/service

To deploy just one release, you can use --selector flag in helmfile. Replace annualreport with the correct release name:

  • helmfile -e eqiad -i --selector name=annualreport apply

Toubleshooting

For troubleshooting, use kube_env and kubectl:

  • kube_env miscweb staging
  • kubectl get pods

Service names

miscweb.svc.eqiad.wmnet has address 10.2.2.58  (eqiad)
miscweb.svc.codfw.wmnet has address 10.2.1.58  (codfw)
miscweb.discovery.wmnet has address 10.2.2.58  (DNS/Discovery)

LVS / discovery

https://config-master.wikimedia.org/pybal/eqiad/miscweb

https://config-master.wikimedia.org/pybal/codfw/miscweb

https://config-master.wikimedia.org/discovery/

Metrics and dashboards

https://grafana.wikimedia.org/d/exdTE7kSk/miscweb?orgId=1

How this service was made

Here I am trying to compile a table / list of all the changes made to get this service from scratch into WMF production, in chronological order of how they were merged.

steps for miscweb
#actionlink
1created a new service request tickethttps://phabricator.wikimedia.org/project/profile/1305/
2read docshttps://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service
3reserved a service porthttps://wikitech.wikimedia.org/wiki/Kubernetes/Service_ports
4added tokens in private repo to CI::master and deployment_server in private repocd /srv/private/.. on the puppetmaster (ask an SRE with root access if needed)
5added dummy tokens in the labs/private repohttps://gerrit.wikimedia.org/r/684000
6created a new namespace in kubernetes, use helmfile apply on deployment servershttps://gerrit.wikimedia.org/r/683743
7added new namespace to CI and deployment_serverhttps://gerrit.wikimedia.org/r/681500/ , https://gerrit.wikimedia.org/r/685116
8requested a new Gerrit repo to host your (Blubber) codehttps://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests
9read about deployment pipelinehttps://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial
10added initial config stub for pipeline libhttps://gerrit.wikimedia.org/r/690678
11read about Blubberhttps://wikitech.wikimedia.org/wiki/Blubber , https://wikitech.wikimedia.org/wiki/Blubber/Pipeline
12added initial Blubber filehttps://gerrit.wikimedia.org/r/690768
13added pipelines and config in integration/confighttps://gerrit.wikimedia.org/r/690788 (asked releng)
14added bespoke pipeline in integration/config if neededhttps://gerrit.wikimedia.org/r/690794 (asked releng)
15added LVS service IPshttps://gerrit.wikimedia.org/r/693966
16added entrypoint.sh in Blubberhttps://gerrit.wikimedia.org/r/697140
17tried staging/test variantshttps://gerrit.wikimedia.org/r/697142
18simplified apache confighttps://gerrit.wikimedia.org/r/697654/ , https://gerrit.wikimedia.org/r/697663 , https://gerrit.wikimedia.org/r/697691
19installed vim, curl in container for testinghttps://gerrit.wikimedia.org/r/697655/ , https://gerrit.wikimedia.org/r/697666
20dropped/merged unused pipelinehttps://gerrit.wikimedia.org/r/697657
21switched service to not run 'insecurely' (as a separate user)https://gerrit.wikimedia.org/r/697662/
22added virtual site inside webserverhttps://gerrit.wikimedia.org/r/697695
23tested cloning from repo, letting Blubber generate a Dockerfile and got shell inside containerhttps://phabricator.wikimedia.org/T281538#7128132
24stopped loading modules not usedhttps://gerrit.wikimedia.org/r/698079
25reserved a public port for LVShttps://wikitech.wikimedia.org/w/index.php?title=Service_ports&type=revision&diff=1914806&oldid=1913236
26opened firewall on deployment server to dump data from pre-k8s servicehttps://gerrit.wikimedia.org/r/699064
27rsynced data over to deployment serverhttps://phabricator.wikimedia.org/T281538#7147262
28added config to serve data gzipped to reduce image size, installed browser in container to testhttps://gerrit.wikimedia.org/r/698079 , https://gerrit.wikimedia.org/r/699320
29load mod_rewrite and mod_headers, add headers/encoding settings for gziped contenthttps://gerrit.wikimedia.org/r/699319
30read about helm and deployments on kuberneteshttps://wikitech.wikimedia.org/wiki/Helm , https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments
30cloned the repo 'operations/deployment-charts' where the helm files livehttps://gerrit.wikimedia.org/r/admin/repos/operations/deployment-charts
31read README in the repo about how to create chartshttps://gerrit.wikimedia.org/g/operations/deployment-charts
32read and ran 'create_new_service.sh'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/create_new_service.sh https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/Rakefile
33adjusted values in new files generated by script and uploaded to the repohttps://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/698895/
34created a new app type for a httpd without php-fpm, added a prometheus (metrics) exporterhttps://gerrit.wikimedia.org/r/700522
35added helmfile.yaml and values under services.d, copying from another servicehttps://gerrit.wikimedia.org/r/713441
36set the docker registry name specifically to use the discovery namehttps://gerrit.wikimedia.org/r/714014/
37added uncompressed content of the first 1000 Bugzilla bugshttps://gerrit.wikimedia.org/r/714460
38cleaned up and added comments for others to delete files they don't usehttps://gerrit.wikimedia.org/r/713639
39set a main_app version and added some CPU/RAM limitshttps://gerrit.wikimedia.org/r/714022
40added reserved port as nodePorthttps://gerrit.wikimedia.org/r/714053
41added version tags for staging and productionhttps://gerrit.wikimedia.org/r/714368
42linked staging httpd config to prod httpd confighttps://gerrit.wikimedia.org/r/714458
43added httpd rewrite rules from pre-k8s confighttps://gerrit.wikimedia.org/r/714459
44set service deployment to production, not minikubehttps://gerrit.wikimedia.org/r/714034
45bumped staging version to latest build created by CIhttps://gerrit.wikimedia.org/r/714755 , https://gerrit.wikimedia.org/r/715236 etc .. (skipping these in the future, needed after every change)
46loaded missing mod_alias for Redirect directivehttps://gerrit.wikimedia.org/r/715727
47added HTML content for the first 10000 bugs, checked image sizehttps://gerrit.wikimedia.org/r/717347
48compressed content with gzip and added more bug HTMLhttps://gerrit.wikimedia.org/r/728668
49various changes to add all the content in batches of 10k bugs, then the same for activities HTML fileshttps://gerrit.wikimedia.org/r/730275 , https://gerrit.wikimedia.org/r/730281 and various others up to https://gerrit.wikimedia.org/r/730334
50added and gzipped index and "all" pageshttps://gerrit.wikimedia.org/r/730336
51added old Bugzilla Wikimedia skin directoryhttps://gerrit.wikimedia.org/r/730339
52read about adding a new service to LVShttps://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service
53added service IPs in DNShttps://netbox.wikimedia.org/ (ask infra foundations)
53added LVS config and had it mergedhttps://gerrit.wikimedia.org/r/694625 (coordinate with serviceops/traffic for this step)
54switched service_state from service_setup to lvs_setuphttps://gerrit.wikimedia.org/r/694628
55enabled TLS in helm charthttps://gerrit.wikimedia.org/r/739675
56removed nodePort, added public_port, enabled TLS, multiple attempts to get the order right, then TLS workedhttps://gerrit.wikimedia.org/r/739810 , https://gerrit.wikimedia.org/r/739848 , https://gerrit.wikimedia.org/r/739945 , https://gerrit.wikimedia.org/r/742819
57switched service_state from lvs_setup to monitoring_setup, checked new Icinga monitoring being added, further testing to confirm it at all workshttps://gerrit.wikimedia.org/r/694629 , https://phabricator.wikimedia.org/T281538#7578691
58debugged gzip encoding issue in cloud VPS, confirmed can pull and run directly from prod docker registryhttps://phabricator.wikimedia.org/T281538#7606684
59fixed content type for HTML, which was set to CSS, service now working in cloudhttps://gerrit.wikimedia.org/r/752235 , https://staticbz.wmcloud.org/bug10001.html
60further version bumping / deploying / testinghttps://gerrit.wikimedia.org/r/752750
61confirmed working with curl directly from production service names with right content-type and content-encodinghttps://phabricator.wikimedia.org/T281538#7620703
62switched service_state from monitoring_setup to production (make it page) but only very carefully after checking confd templates on DNS servers, downtiming services in Icingahttps://gerrit.wikimedia.org/r/694630 , https://phabricator.wikimedia.org/T281538#7620961
63read about discovery DNShttps://wikitech.wikimedia.org/wiki/DNS/Discovery
64added discovery DNS as an active-active service, confirmed could now curl from discovery namehttps://gerrit.wikimedia.org/r/693968 , https://phabricator.wikimedia.org/T281538#7620995
65switched ATS (traffic servers/caching layer) from old backend to new backend, the discovery name on our reserved service porthttps://gerrit.wikimedia.org/r/753813
66added service to disc_desired_state.pyhttps://gerrit.wikimedia.org/r/753846
67ATS servers got 502, did not work, reverted, turned out the reason was a missing SAN on the TLS cert
68addded SAN to cert, created new cert, checked ithttps://phabricator.wikimedia.org/T281538#7635115

How to switch microsites hosted on legacy VMs from one datacenter to another

This section is not about the kubernetes service but about the ganeti VMs hosting microsites that have not moved yet.

This is how to switch those from one DC to another:

  1. Check the google doc "sre collab service matrix" to see which services are hosted on miscweb* and pick one to switch
  2. go to the puppet repo and find your service in the file common/profile/trafficserver/backend.yaml
  3. confirm that it points to the name "https://webserver-misc-sites.discovery.wmnet"
  4. look in the DNS repository in the file ./templates/wmnet and find this section:
;webserver-misc-sites  300 IN CNAME miscweb1003.eqiad.wmnet.
webserver-misc-sites 300 IN CNAME miscweb2003.codfw.wmnet.
  1. Now decide to either switch ALL sites at once and switch around which server is commented out and which is not
    1. or introduce a different name, like webserver-misc-sites but separately, that points to miscweb1003 instead of miscweb2003 (or vice versa)
    2. if you are doing the second, go back to the ATS config (backend.yaml) mentioned above and switch just one of the sites to your new DNS discovery name
  2. run puppet on all cp* hosts via cumin, or just on the hosts for your data center (ex. "cp4*" for ulsfo) or wait half an hour
  3. on old and new miscweb machine do a tail -f /var/log/apache2/*.log and watch the access logs, each site has its own access and error log file
  4. open the site in your browser and confirm the request hits your new backend and the site looks like before
  5. run a httpbb test against your new (or old and new) server and confirm there are no errors or differences (from a deployment server, example:)
    1. [deploy1002:~] $ httpbb --hosts miscweb1003.eqiad.wmnet /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml
  6. mark the site as switched in doc/ticket. !log in SAL
  7. double check sites that have external deployers (deploy method column in the service matrix doc)
This article is issued from Wikimedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.