Jenkins and Machine Learning Plugins for Data Science
This site is the new docs site currently being tested. For the actual docs in use please go to https://www.jenkins.io/doc. |
Project goal: Create a new plugin for integrating Jenkins with one of Machine Learning tools (e.g. Jupyter Python, TensorBoard, or Sacred)
Skills to study/improve: Java, Jenkins plugin, Apache Zeppelin, Jupyter Notebooks, Python, Machine Learning, Data Science
Background
Machine learning (ML) has established itself as a key data science (DS) technology in finance, retail, marketing, science, and many other fields. Machine learning algorithms learn by analyzing features of training data sets that can then be applied to make predictions, estimations, and classifications in new test cases. The power of machine learning comes at the cost of complex development and production environment to support the data, algorithms, and models generated from machine learning [1, see figure below]
Project Proposal
It is envisioned that Jenkins and its plugin ecosystem can play a key role in supporting, facilitating and accelerating the integration of ML workflow components and orchestrating their reproducible execution. Currently, Python and interactive computational notebooks (such as Jupyter and Zeppelin) are the dominant software tools for teaching, composing and executing machine learning workflows. Interactive notebooks have proven valuable data exploration and teaching tools as they adopt a ‘literal programming’ paradigm where code fragments, results, instructions and documentation are integrated into a single UI, resembling a familiar paper ‘notebook’. However, interactive notebooks present significant limitations to the adoption of good software engineering principles (test-driven development, code versioning, reproducibility, etc) as well as scaling-up to data sizes typically used in DS production environments [2]. We propose the development of a Jenkins plugin that will integrate machine learning tasks and algorithms to build ML data systems that are easier to develop, test, deploy and maintain. The hidden power of interactive computational notebooks comes from back-end ‘polyglot’ interpreters (a.k.a kernels) that interpret the code fragments, and return the results of the computation is a useful display (graph, table or other). We propose to build a Jenkins plugin that will allow the integration of existing polyglot notebook kernels to support notebook - like computations as Jenkins build tasks. The goal is to allow data scientists to define production-grade ML workflows as Jenkins builds, integrating typical Jenkins build tasks with notebook code fragments as build steps. The plugin will support a new build task, to execute code via an appropriate language ‘kernel’ as currently supported by existing notebooks such as Zeppelin and Jupyter.
Technical Quickstart
The project involves interaction between a Jenkins node and an IPython Kernel. Jenkins must be able to connect to a running IPython Kernel - or start one if necessary - to be used to evaluate code used in a Jenkins plugin. Apache Zeppelin will be used for this interaction, as they provide existing Java code to interact with an IPython Kernel [3]. In Jenkins, we will use the simple Builder plugin from the Jenkins tutorial [4], modifying its original code to call the IPython instead of printing Hello World.
Here’s a step-by-step of what the student will be expected to achieve.
-
Use Apache Maven to create a Java project
-
Learn the API of the Python Maven module of Apache Zeppelin project
-
Use the Python Maven module of Apache Zeppelin (item 2) in the Java project created in item 1 to send Python code, and receive its output
-
Follow the Jenkins Plugin Tutorial [4] and create a simple Builder plug-in, that prints Hello World
-
Modify the plug-in (which is also a Maven project) to call the code created in item 3
-
Prove that we are able to call an IPython Kernel from a Jenkins job
Extra steps could involve investigate and document for future developers (which is a good practice in Software Engineering) what would be required to use other Kernels (e.g. Perl, R, Groovy, etc), where the code would have to be modified, and perhaps also see if we could use Apache Zeppelin Interpreters [5] from Jenkins.
-
Models in Minutes not Months: Data Science as Microservices
-
https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086
-
https://github.com/apache/zeppelin/blob/master/python/README.md
-
xref:dev-docs:plugin-tutorial:index.adoc
-
https://zeppelin.apache.org/docs/0.7.0/manual/interpreters.html
Links
-
Apply to Google Summer of Code xref:projects:gsoc:index.adoc
-
Discussion in the Developer mailing list: https://groups.google.com/forum/m/#!topic/jenkinsci-dev/VrBoOWVsZ-o
-
Creating a Jenkins Data Science Platform: http://biouno.org/2016/03/07/Creating-a-Jenkins-Data-Science-Platform.html
Inspiration or Evaluation
Other tools that researchers use for experiment tracking (should not be seen as a direct alternative! Some can be used together with Jenkins!) * https://colab.research.google.com/ * https://neptune.ml/ ( $99/user/mo) * https://www.comet.ml/ ( $39/user/mo) * https://polyaxon.com/ * https://github.com/ModelChimp/modelchimp * https://github.com/datmo/datmo * https://github.com/IDSIA/sacred + https://github.com/vivekratnavel/omniboard / https://github.com/chovanecm/sacredboard * Airflow * studio.ml * https://medium.com/polyaxon * Version Control System for Machine Learning Projects https://dvc.org/
Jupyter-Zeppelin
-
https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
-
https://nextjournal.com/schmudde/how-to-version-control-jupyter
-
Existing Java integration with IPython Kernel (Jupyter), via Apache Zeppelin
Example Jenkins Pipeline
-
An example pipeline launching Tensorboard from a Jenkins pipeline can be found here
Newbie-friendly issues
We currently do not have any issues but can readily answer questions via the gitter channel here
Organization Links
-
Jenkins GSoC page - documentation, application guidelines
-
Participate and contribute to Jenkins - landing page for newcomer contributors
-
Newbie-friendly issues - list of organization-wide newbie-friendly issues (use them if there is no links in the project idea) > Go back to other GSoC 2020 project ideas