Contributor Guide | SynapseML

Version: Next

Interested in contributing to SynapseML? We're excited to work with you.

Use the library and give feedback: report bugs, request features.
Add sample Jupyter notebooks, Python or Scala code examples, documentation pages.
Fix bugs and issues.
Add new features, such as data transformations or machine learning algorithms.
Review pull requests from other contributors.

You can give feedback, report bugs and request new features anytime by opening an issue. Also, you can up-vote or comment on existing issues.

If you want to add code, examples or documentation to the repository, follow this process:

Preferably, get started by tackling existing issues to get yourself acquainted with the library source and the process.
To ensure your contribution is a good fit and doesn't duplicate on-going work, open an issue or comment on an existing issue. In it, discuss your contribution and design.
Any algorithm you're planning to contribute should be well known and accepted for production use, and backed by research papers.
Algorithms should be highly scalable and suitable for massive datasets.
All contributions need to comply with the MIT License. Contributors external to Microsoft need to sign CLA.

Fork the SynapseML repository.
Implement your algorithm in Scala, using our wrapper generation mechanism to produce PySpark bindings.
Use SparkML PipelineStages so your algorithm can be used as a part of pipeline.
For parameters use MMLParams.
Implement model saving and loading by extending SparkML MLReadable.
Use good Scala style.
Binary dependencies should be on Maven Central.
See this pull request for an example contribution.

Set up build environment. Use a Linux machine or VM (we use Ubuntu, but other distros should work too).
Test your code locally.
Add tests using ScalaTests. Unit tests are required.
A sample notebook is required as an end-to-end test.

Add a sample Jupyter notebook that shows the intended use case of your algorithm, with instructions in step-by-step manner. (The same notebook could be used for testing the code.)
Add in-line ScalaDoc comments to your source code, to generate the API reference documentation

In most cases, you should squash your commits into one.
Open a pull request, and link it to the discussion issue you created earlier.
A SynapseML core team member will trigger a build to test your changes.
Fix any build failures. (The pull request will have comments from the build with useful links.)
Wait for code reviews from core team members and others.
Fix issues found in code review and reiterate.

Wait for a core team member to merge your code in.
Your feature will be available through a Docker image and script installation in the next release, which typically happens around once a month. You can try out your features sooner by using build artifacts for the version that has your changes merged in (such versions end with a .devN).

If in doubt about how to do something, see how it was done in existing code or pull requests, and don't hesitate to ask.