** STEP 0: OFFICIAL PIPELINE REQUIREMENTS ** If you want to operate a pipeline for ArchiveTeam's official ArchiveBot instance, please read https://wiki.archiveteam.org/index.php/ArchiveBot#Volunteer_to_run_a_Pipeline before continuing. ** STEP 1: INSTALL EVERYTHING ** To run the pipeline, you will need: - git 2.6.0+ - a Python 3.3+ installation - Pip (for Python 3.3+) - seesaw (automatically installed by Pip) - rsync - wpull (automatically installed by Pip) - youtube-dl Quick install, for Debian and Debian-esque systems like Ubuntu: sudo apt-get update sudo apt-get install build-essential python3-dev python3-pip \ libxml2-dev libxslt-dev zlib1g-dev libssl-dev libsqlite3-dev \ libffi-dev git tmux fontconfig-config fonts-dejavu-core \ libfontconfig1 libjpeg-turbo8 libjpeg8 lsof ffmpeg youtube-dl \ autossh rsync pip3 install --upgrade pip ** STEP 2: CREATE THE ARCHIVEBOT USER ** After you've installed all the software, set up a dedicated account for ArchiveBot: adduser archivebot (You may also want to add the user to the sudo'ers group.) Log out of the server and log back in as user archivebot. Then do: ssh-keygen [keep hitting Enter] cat ~/.ssh/id_rsa.pub At this point, copy the public key output from your screen (it should start with "ssh-rsa" followed by a bunch of letters and numbers), and give it to the person operating the control node (aka backend). They will then set things up so your new pipeline can coordinate with the others, and they will give you back the username and hostname you will need for step 4. ** STEP 3: FINALLY, IT'S TIME TO INSTALL ARCHIVEBOT CODE ** Okay, back to the server stuff: cd ~/ git clone https://github.com/ArchiveTeam/ArchiveBot cd ArchiveBot git submodule update --init pip3 install --user -r pipeline/requirements.txt If you get any error messages at this point, you should try to fix them before continuing on, as there may be incompatibilities between the things that ArchiveBot is expecting and what your server actually has. ** STEP 4: START IT UP ** As user archivebot, in the FIRST tmux session: autossh -C -L 127.0.0.1:16379:127.0.0.1:6379 \ YOUR-USERNAME-GOES-HERE@CONTROL-NODE-GOES-HERE -N As user archivebot, in the SECOND tmux session: cd ~/ArchiveBot/pipeline mkdir -p ~/warcs4fos export REDIS_URL=redis://127.0.0.1:16379/0 export FINISHED_WARCS_DIR=$HOME/warcs4fos If you run the pipeline on a system with a modern OpenSSL version (e.g. Debian Buster and later), which comes with a more secure default configuration, additionally set the OPENSSL_CONF environment variable: export OPENSSL_CONF=/home/archivebot/ArchiveBot/ops/openssl-less-secure.cnf Now, think up a name for this new ArchiveBot pipeline. It will appear on the publicly available pipeline status dashboard. It will go in the command you enter next: ~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \ --concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE 2>&1 | \ tee "pipeline-$(date -u +"%Y-%m-%dT%H_%M_%SZ").log" You can adjust the number of jobs your server can handle in --concurrent as needed. If you want your pipeline to only handle !ao/!archiveonly jobs, run it with the AO_ONLY environment variable set: AO_ONLY=1 ~/.local/bin/run-pipeline3 pipeline.py \ --disable-web-server --concurrent 2 \ YOUR-PIPELINE-NAME-GOES-HERE or export AO_ONLY=1 ~/.local/bin/run-pipeline3 pipeline.py --disable-web-server \ --concurrent 2 YOUR-PIPELINE-NAME-GOES-HERE If your pipeline has large amounts of disk space (at least 100GB dedicated to ArchiveBot's processing), set the LARGE environment variable in the same way as AO_ONLY above. Your pipeline will accept jobs queued with the --large option. If you are getting errors about wpull, you may need to create a symbolic link to it, like this: ln -s /usr/bin/wpull /home/archivebot/ArchiveBot/pipeline/wpull (You may need to edit that /home/YOUR_USER_HERE/YOUR-DIRECTORY/ path as needed.) As user archivebot, in the THIRD tmux session: export RSYNC_URL=UPLOAD-URL-GOES-HERE ~/ArchiveBot/uploader/uploader.py $HOME/warcs4fos If you start multiple pipelines, you can safely point them to the same FINISHED_WARCS_DIR and run just one uploader. Check out the ArchiveBot dashboard to make sure everything is working like it ought to. ** STEP 5: MISCELLANEOUS ** To gracefully stop the pipeline: touch ~/ArchiveBot/pipeline/STOP To gracefully stop the uploader, hit ctrl-c in its tmux session. To upgrade, run: pip3 install --user --upgrade -r pipeline/requirements.txt ** STEP 5a: TROUBLESHOOTING YOUTUBE-DL INSTALLATION ** youtube-dl is is a command line program that is used for downloading videos from YouTube, Vimeo, and other websites that feature embedded videos. It is supposed to be installed on your system automatically through requirements.txt, but just in case that doesn't work, here's how you can get it installed: sudo apt-get install python3-pip pip3 install --upgrade youtube_dl Or, for older versions of Python: sudo apt-get install python-pip pip install --upgrade youtube_dl ** STEP 6: Operate the Pipeline ** Some pointers for pipeline operators: You can find the process ID for a job by ps aux | grep $jobid. That job has a job directory, which you can find in the data directory; you can also get it out of ps. The job directory is wpull's scratch space, where it puts files it's downloading and where it assembles its WARC. It will move the WARC into the uploader folder when it reaches the designated size. If a job becomes stuck, find the process ID of its wpull instance and kill it with kill -9. The pipeline will move the completed WARC and upload it, and complete the job (you may want to note in #archivebot that you did this). The job may be re-queued. If you stop the pipeline or it crashes, you should remove the job directories under pipeline/data, and clean out /tmp. The WARCs in the pipeline directory are almost certainly incomplete and should not be uploaded. The jobs cannot currently be resumed, and so the data dir and /tmp are just consuming space. If the pipeline runs out of disk, it will be unable to do any useful work and jobs will lock up or fail. In this case, check that the uploader is functioning; if it is, use the du command in the data directory to see what is taking up space. If it is wpull.log, truncate it (not rm, but rather ftruncate) to 0 to free up a little space if needed. If the pipeline runs out of RAM, you will likely have to kill the job that is consuming all the RAM; wpull instances will pause to avoid the OOM killer being run. Consider creating a small swap file if your VM does not have any swap.