MLOps.toys

A curated list of MLOps projects

Discover projects
MLOps.toys
by Aporia
Experiment Tracking

Compare 1000s of AI experiments at once.

  • Open-source: Community-driven. Self-hosted and full metadata access.
  • Explore & Compare: Easily search, group and aggregate metrics by any hyperparameter.
  • Dashboard: Activity view and full experiments dashboard for all experiments.
Model Serving
Works well with:
AWS Lambda
Azure Functions
Google Cloud Run
Competes with:
Seldon Core to some extent
KFServing to some extent

BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models.

  • Supports multiple ML frameworks, including Tensorflow, PyTorch, Keras, XGBoost and more
  • Cloud native deployment with Docker, Kubernetes, AWS, Azure and many more
  • High-Performance online API serving and offline batch serving
  • Web dashboards and APIs for model registry and deployment management

BentoML tries to bridge the gap between Data Science and DevOps. By providing a standard interface for describing a prediction service, BentoML abstracts away how to run model inference efficiently and how model serving workloads can integrate with cloud infrastructures.

Feature Store

A tool for building feature stores. Transform your raw data into beautiful features.

The library is centered on the following concetps:

  • ETL: central framework to create data pipelines. Spark-based Extract, Transform and Load modules ready to use.
  • Declarative Feature Engineering: care about what you want to compute and not how to code it.
  • Feature Store Modeling: the library easily provides everything you need to process and load data to your Feature Store.
Experiment Tracking

Open-source library for implementing CI/CD in machine learning projects.

On every pull request, CML helps you automatically train and evaluate models, then generates a visual report with results and metrics.

  • GitFlow for data science. Use GitLab or GitHub to manage ML experiments, track who trained ML models or modified data and when. Codify data and models with DVC instead of pushing to a Git repo.
  • Auto reports for ML experiments. Auto-generate reports with metrics and plots in each Git Pull Request. Rigorous engineering practices help your team make informed, data-driven decisions.
  • No additional services. Build your own ML platform using just GitHub or GitLab and your favourite cloud services: AWS, Azure, GCP. No databases, services or complex setup needed.
Experiment Tracking

Comet enables data scientists and teams to track, compare, explain and optimize experiments and models across the model’s entire lifecycle. From training to production.

Data Versioning

DAGsHub enables data scientists and ML engineers to work together, effectively. Integrating open-source tools like Git, DVC, MLflow, and Jenkins so that you can track and version code, data, models, pipelines, and experiments in one place.

  • Your project in one place: Manage your code, notebooks, data, models, pipelines, and experiments and easily connect to plugins for automation, all with open source tools and open formats.
  • Zero configuration: Don't waste time on DevOps heavy lifting. Each DAGsHub project comes with a free, built-in DVC data storage and MLflow server, with team access controls, so you can just add the URL, and get to work.
  • Diff, compare, and review anything: DAGsHub lets you diff Jupyter notebooks, tables, images, experiments, and even MRI data, so you can compare apples to apples, review, and make sense of your work.
  • Reproducibility is a click away: Get all components of an experiment on your system. It's as easy as git checkout.
Model Monitoring

Validate and monitor your data and models during training, production and new version releases.

Features:

  • ML Validation of training data and ML model
  • Observability of ML in production
  • Alerting about various issues in live ML systems
  • Detecting Mismatches between research and production environments
  • Quick Querying of problematic production data
Explainability

ELI5 is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has built-in support for several ML frameworks and provides a way to explain black-box models.

Feature Store
Competes with:
Hopsworks

Feast is an operational data system for managing and serving machine learning features to models in production.

  • Feast decouples your models from your data infrastructure by providing a single data access layer that abstracts feature storage from feature retrieval.
  • Feast provides both a centralized registry to which data scientists can publish features, and a battle-hardened serving layer. Together, these enable non-engineering teams to ship features into production with minimal oversight.
  • Feast solves the challenge of data leakage by providing point-in-time correct feature retrieval when exporting feature datasets for model training.
  • With Feast, data scientists can start new ML projects by selecting previously engineered features from a centralized registry, and are no longer required to develop new features for each project.
Training Orchestration

Flyte makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing.

  • Kubernetes-Native Workflow Automation Platform
  • Ergonomic SDK's in Python, Java & Scala
  • Versioned & Auditable
  • Reproducible Pipelines
  • Strong Data Typing
Model Testing
Works well with:
Github Actions

Provides API-powered synthetic data test fixtures for your tabular data-enabled features — enabling regression and integration tests to be easily built and deployed for your production ML models and data science components

  • Construct synthetic data covering important behaviours, edge cases, and under-represented areas of your input domain
  • Retrieve via dedicated endpoint to your builds & continuous integration steps. No more data in version control.
  • Benchmark model output to quickly detect regressions, or compare behaviours for model selection
Explainability

InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof. With this package, you can train interpretable glassbox models and explain blackbox systems. InterpretML helps you understand your model's global behavior, or understand the reasons behind individual predictions.

Interpretability is essential for:

  • Model debugging - Why did my model make this mistake?
  • Feature Engineering - How can I improve my model?
  • Detecting fairness issues - Does my model discriminate?
  • Human-AI cooperation - How can I understand and trust the model's decisions?
  • Regulatory compliance - Does my model satisfy legal requirements?
  • High-risk applications - Healthcare, finance, judicial, ...
Training Orchestration
Works well with:
Jupyter Notebooks
Cloud
Python
R-Studio
Tensorflow
Spark

Katonic MLOps Platform is a collaborative platform with a Unified UI to manage all data science activities in one place and introduce MLOps practice into the production systems of customers and developers. It is a collection of cloud-native tools for all of these stages of MLOps:

  • Data exploration
  • Feature preparation
  • Model training/tuning
  • Model serving, testing and versioning

Katonic is for both data scientists and data engineers looking to build production-grade machine learning implementations and can be run either locally in your development environment or on a production cluster. Katonic provides a unified system—leveraging Kubernetes for containerization and scalability for the portability and repeatability of its pipelines.

Explainability

This project is about explaining what machine learning classifiers (or models) are doing. At the moment, we support explaining individual predictions for text classifiers or classifiers that act on tables (numpy arrays of numerical or categorical data) or images, with a package called lime (short for local interpretable model-agnostic explanations). Lime is based on the work presented in this paper (bibtex here for citation).

Model Monitoring

MLRun is an end-to-end open-source MLOps orchestration framework to manage and automate your entire analytics and machine learning lifecycle, from data ingestion, through model development to full pipeline deployment. MLRun eases the development of machine learning pipelines at scale and helps ML teams build a robust process for moving from the research phase to fully operational production deployments.

  • Feature and Artifact Store: Handles the ingestion, processing, metadata, and storage of data and features across multiple repositories and technologies.
  • Elastic Serverless Runtimes: Converts simple code to scalable and managed microservices with workload-specific runtime engines (such as Kubernetes jobs, Nuclio, Dask, Spark, and Horovod).
  • ML Pipeline Automation: Automates data preparation, model training and testing, deployment of real-time production pipelines, and end-to-end model and feature monitoring.
  • Central Management: Provides a unified portal for managing the entire MLOps workflow. The portal includes a UI, a CLI, and an SDK, which are accessible from anywhere.
Training Orchestration

OpenPAI is an open-source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud, and hybrid environments on various scales.

Training Orchestration
Works well with:
Jupyter Notebooks
Self-Hosted
Cloud

Build data pipelines, the easy way!

No framework. No YAML. Just write Python and R code in Notebooks.

Features:

  • Visually construct pipelines through our user-friendly UI
  • Code in Notebooks
  • Run any subset of a pipeline directly or periodically
  • Easily define your dependencies to run on any machine
Training Orchestration
Works well with:
Kubernetes
AWS Batch
Airflow

Develop and test workflows locally, seamlessly execute them in a distributed environment.

Features:

  • Cloud-agnostic. Runs in Kubernetes, AWS Batch, and Airflow.
  • Integrates with Jupyter. Develop interactively, deploy to the cloud without code changes.
  • Incremental builds. Speed up execution by skipping tasks whose source code has not changed.
  • Flexible. Supports functions, scripts, notebooks, and SQL scripts as tasks.
  • Parallelization. Automatically parallelize independent tasks.
  • Interactive console. Helps you debug workflows quickly.
Explainability

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

Model Serving
Competes with:
KFServing
BentoML to some extent

Seldon Core makes it easier and faster to deploy your machine learning models and experiments at scale on Kubernetes.

  • Runs anywhere: Built on Kubernetes, runs on any cloud and on premises
  • Agnostic and independent: Framework agnostic, supports top ML libraries, toolkits and languages
  • Runtime inference graphs: Advanced deployments with experiments, ensembles and transformers

Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries and more.

Training Orchestration
Competes with:
Hydra

spock is a framework that helps manage complex parameter configurations during research and development of Python applications. spock lets you focus on the code you need to write instead of re-implementing boilerplate code like creating ArgParsers, reading configuration files, implementing traceability etc. In short, spock configurations are defined by simple and familiar class-based structures. This allows spock to support inheritance, read from multiple markdown formats, and allow hierarchical configuration by composition.
Features:

  • Simple Declaration: Type checked parameters are defined within a @spock decorated class. Supports required/optional and automatic defaults.
  • Easily Managed Parameter Groups: Each class automatically generates its own object within a single namespace.
  • Parameter Inheritance: Classes support inheritance allowing for complex configurations derived from a common base set of parameters.
  • Complex Types: Nested Lists/Tuples, List/Tuples of Enum of @spock classes, List of repeated @spock classes
  • Multiple Configuration File Types: Configurations are specified from YAML, TOML, or JSON files.
  • Hierarchical Configuration: Compose from multiple configuration files via simple include statements.
  • Command-Line Overrides: Quickly experiment by overriding a value with automatically generated command line arguments.
  • Immutable: All classes are frozen preventing any misuse or accidental overwrites (to the extent they can be in Python).
  • Tractability and Reproducibility: Save runtime parameter configuration to YAML, TOML, or JSON with a single chained command (with extra runtime info such as Git info, Python version, machine FQDN, etc). The saved markdown file can be used as the configuration input to reproduce prior runtime configurations.
  • S3 Addon: Automatically detects s3:// URI(s) and handles loading and saving spock configuration files when an active boto3.Session is passed in (plus any additional S3Transfer configurations)
Model Serving
Works well with:
Kubernetes
Grafana
Nvidia Triton
Intel OpenVINO
MLFLow
ClearML

Syndicai is a cloud platform that deploys, manages, and scales any trained AI model in minutes with no configuration & infrastructure setup.

  • Easy to use - You don't need to know to understand Docker & Kubernetes. Platform Production-ready deployments from day one.
  • Highly flexible - Customize every single step of the whole AI model deployment workflow (from model wrappers to Kubernetes configuration manifests), and integrate with tools you love.
  • Optimized for ML - Run workload on highly optimized, cost-efficient, and secure infrastructure built specifically for high-performance ML models.
  • Cloud, Framework agnostic - Deploy ML models written in frameworks you love and run them on the cloud you want with no extensive and time-consuming setup.
Model Serving
Competes with:
Triton Inference Server

TorchServe is a flexible and easy to use tool for serving PyTorch models.

  • With TorchServe, PyTorch users can bring their models to production quicker, without having to write custom code: on top of providing a low latency prediction API, TorchServe embeds default handlers for the most common applications such as object detection and text classification.
  • TorchServe includes multi-model serving, model versioning for A/B testing, monitoring metrics, and RESTful endpoints for application integration.
  • TorchServe supports any machine learning environment, including Amazon SageMaker, container services, and Amazon Elastic Compute Cloud (EC2).
Training Orchestration

Valohai is an MLOps platform that handles machine orchestration, automatic reproducibility and deployment.

  • Technology agnostic: Valohai runs everything in Docker containers so that you can run almost anything on it.
  • Runs on any cloud: Valohai natively supports Azure, AWS, GCP and OpenStack.
  • API, CLI, GUI and Jupyter integration: Valohai integrates to almost any workflow through its many interfaces.
  • Managed service: Seasoned DevOps engineers manage Valohai – so you don’t have to be one.
Model Monitoring

The WhyLabs Observability Platform enables any AI practitioner to set up AI monitoring in three easy steps. It follows the standard DevOps model of installing a lightweight logging agent (whylogs) alongside your model and sending data profiles to a fully self-service SaaS platform (WhyLabs). On the platform, you can analyze your profiles to see how your model is performing and get automatically get alerted on changes. The platform includes:

  • An easy setup flow so that you can start getting value right away
  • Automatic data drift detection and alerting to prevent model performance degradation
  • Industry standard for data profiling enabled by the open source "whylogs" library
Model Monitoring

With Aporia data scientists and ML engineers can easily build monitoring for their ML models running in production.

Features:

  • Build your own monitors: Easily define monitoring logic.
  • Concept drift & Data integrity detections: Built-in monitors and alerts for prediction drift, data drift, data integrity issues and more.
  • Runs on your VPC: Natively supports on-prem and cloud deployments.
  • User-friendly & flexible: A simple, intuitive dashboard for all your models in production.
  • Data segments: Define & monitor slices of data based on selected features.
Model Serving

Bodywork deploys machine learning projects developed in Python, to Kubernetes. It helps you:

  • serve models as microservices
  • execute batch jobs
  • run reproducible pipelines

On demand, or on a schedule. It automates repetitive DevOps tasks and frees machine learning engineers to focus on what they do best - solving data problems with machine learning.

Feature Store

An easy-to-use feature store.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

Experiment Tracking

ClearML is an open source suite of tools that automates preparing, executing, and analyzing machine learning experiments.

Features:

  • ClearML Experiment: A complete experiment management toolset. Keep track of parameters, jobs, artifacts, metrics, debug data, metadata, and log it all in one clear interface.
  • ClearML Orchestrate: The easiest way to manage scheduling and orchestration for GPU / CPU resources and to auto-scale on cloud & on-prem machines. Replicate your dev environment for training anywhere or develop on remote VMs.
  • ClearML Feature Store: Data analysis versioning & lineage for full reproducibility. Build and automate data pipelines for R&D and production. Rebalance, debias and mix & match datasets for fine grain control of your data.
Model Serving

Cortex makes it simple to deploy machine learning models in production.

Deploy

  • Deploy TensorFlow, PyTorch, ONNX, scikit-learn, and other models.
  • Define preprocessing and postprocessing steps in Python.
  • Configure AP/Is as realtime or batch.
  • Deploy multiple models per API.

Manage

  • Monitor API performance and track predictions.
  • Update APIs with no downtime.
  • Stream logs from APIs.
  • Perform A/B tests.

Scale

  • Test locally, scale on your AWS account.
  • Autoscale to handle production traffic.
  • Reduce cost with spot instances.
Data Versioning

DVC is an open-source tool for data science and machine learning projects.

Key features:

  1. Simple command line Git-like experience. Does not require installing and maintaining any databases. Does not depend on any proprietary online services.
  2. Management and versioning of datasets and machine learning models. Data is saved in S3, Google cloud, Azure, Alibaba cloud, SSH server, HDFS, or even local HDD RAID.
  3. Makes projects reproducible and shareable; helping to answer questions about how a model was built.
  4. Helps manage experiments with Git tags/branches and metrics tracking.

DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs) which are being used frequently as both knowledge repositories and team ledgers. DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as well as ad-hoc data file suffixes and prefixes.

Training Orchestration

Determined is an open-source deep learning training platform that makes building models fast and easy.

  • Train models faster using state-of-the-art distributed training, without changing your model code
  • Automatically find high-quality models with advanced hyperparameter tuning from the creators of Hyperband
  • Get more from your GPUs with smart scheduling and cut cloud GPU costs by seamlessly using preemptible instances
  • Track and reproduce your work with experiment tracking that works out-of-the-box, covering code versions, metrics, checkpoints, and hyperparameters

Determined integrates these features into an easy-to-use, high-performance deep learning environment — which means you can spend your time building models instead of managing infrastructure.

Model Monitoring

Evidently helps analyze machine learning models during validation or production monitoring. It generates interactive reports from pandas DataFramesor csv files.

Features:

  • Model Health: Quickly visualize model performance and important metrics. Get a prioritized list of issues to debug.
  • Data Drift: Compare recent data with the past. Learn which features changed and if key models drivers shifted. Visually explore and understand drift.
  • Target Drift: Understand how model predictions and target change over time. If the ground truth is delayed, catch the model decay in advance.
Explainability

Continuously monitor, explain, and analyze AI systems at scale. With actionable insights build trustworthy, fair, and responsible AI monitoring.

  • Complex AI systems are inherently black boxes with minimal insight into their operation.
  • Explainable AI or XAI makes these AI black boxes more like AI glass-boxes by enabling users to always understand the ‘why’ behind their decisions.
  • Identify, address, and share performance gaps and biases quickly for AI validation and debugging
Feature Store
Competes with:
Tecton
Feast

The Hopsworks Feature Store manages your features for training and serving models.

  • Provides scale-out storage for training and batch inference as well as low-latency storage for online applications that need to build feature vectors to make real-time predictions.
  • Provides Python and Java/Scala APIs to enable Batch and Online applications manage and use features for machine learning.
  • Integrates seamlessly with popular platforms for Data Science, such as AWS Sagemaker and Databricks. It also integrates with backend datalakes, such as S3 and Hadoop.
  • Supports both cloud and on-prem deployments.

The Iguazio Data Science Platform accelerates and scales development, deployment and management of your AI applications with MLOps and end-to-end automation of machine learning pipelines. The platform includes an online and offline feature store, fully integrated with automated model monitoring and drift detection, model serving and dynamic scaling capabilities, all packaged in an open and managed platform.

  • Ingest Data from Any Source and Build Reusable Online and Offline Features: Ingest and unify unstructured and structured data in real-time and create online and offline features using Iguazio’s Integrated Feature Store.
  • Continuously Train and Evaluate Models at Scale: Run experimentation over scalable serverless ML/DL runtimes with automated tracking, data versioning, and continuous integration/delivery (CI/CD) support.
  • Deploy Models to Production in Seconds: Deploy models and APIs from a Jupyter notebook or IDE to production in just a few clicks and continuously monitor model performance and mitigate model drift.
  • Monitor Your Models and Data on the Fly: Manage, govern and monitor your models and real-time features in production with a simple dashboard integrated with Iguazio’s Feature Store.
Model Serving
Competes with:
Seldon Core
BentoML to some extent

KFServing enables serverless inferencing on Kubernetes to solve production model serving use cases.

  • Provides performant, high abstraction interfaces for common ML frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX.
  • Provides a Kubernetes Custom Resource Definition (CRD) for serving ML models.
  • Encapsulate the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU autoscaling, scale to zero, and canary rollouts to your ML deployments.
  • Enable a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability out of the box.
Training Orchestration

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

Kubeflow's goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

Anywhere you are running Kubernetes, you should be able to run Kubeflow.

Experiment Tracking

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.

It offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g your notebook)

Features:

  • MLflow Tracking: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI.
  • MLflow Projects: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others.
  • MLflow Models: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker.
  • MLflow Model Registry: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.
Experiment Tracking

Neptune is a lightweight experiment logging/tracking tool that helps you with your machine learning experiments.

Features:

  • Rich experiment logging and tracking capabilities
  • Python and R clients
  • Experiments dashboards, views and comparison features
  • Team management
  • 25+ integrations with popular data science stack libraries
  • Fast, reliable UI
Competes with:
Triton Inference Server
Tensorflow Serving
TorchServe

OpenVINO™ Model Server (OVMS) is a scalable, high-performance solution for serving machine learning models optimized for Intel® architectures.

  • Simultanous serving of any model trained in a framework that is supported by OpenVINO
  • The server implements gRPC and REST API framework with data serialization and deserialization using TensorFlow Serving API
  • Uses OpenVINO™ as the inference execution provider
  • Supports different file systems: local (e.g. NFS), Google Cloud Storage (GCS), Amazon S3, Minio or Azure Blob Storage
Data Versioning

Pachyderm is a tool for version-controlled, automated, end-to-end data pipelines for data science.

Features:

  • Containerized: Pachyderm is built on Docker and Kubernetes. Whatever languages or libraries your pipeline needs, they can run on Pachyderm which can easily be deployed on any cloud provider or on prem.
  • Version Control: Pachyderm version controls your data as it's processed. You can always ask the system how data has changed, see a diff, and, if something doesn't look right, revert.
  • Provenance (aka data lineage): Pachyderm tracks where data comes from. Pachyderm keeps track of all the code and data that created a result.
  • Parallelization: Pachyderm can efficiently schedule massively parallel workloads.
  • Incremental Processing: Pachyderm understands how your data has changed and is smart enough to only process the new data.
Training Orchestration

PrimeHub, an open-source pluggable MLOps platform on the top of Kubernetes for teams of data scientists and administrators. PrimeHub equips enterprises with consistent yet flexible tools to develop, train, and deploy ML models at scale. By improving the iterative process of data science, data teams can collaborate closely and innovate fast.

  • Cluster Computing with multi-tenancy
  • One-Click Notebook Environments
  • Group-centric Datasets Management / Resources Management / Access-control Management
  • Custom Machine Learning Environments with Image Builder
  • Model Tracking and Deployment
  • Capability Augmentation with 3rd-party Apps Store
Model Serving

A command-line utility to train and deploy Machine Learning and Deep Learning models on AWS SageMaker in a few simple steps.

Key features:

  1. Turn on ML superpowers: Train, tune and deploy hundreds of ML models by implementing just 2 functions
  2. Focus 100% on Machine Learning: Manage your ML models from one place without dealing with low level engineering tasks
  3. 100% reliable: No more flaky ML pipelines. Sagify offers 100% reliable training and deployment on AWS.
Training Orchestration
Works well with:
Jupyter Notebooks
Kubernetes
Grafana
Weights and Biases
Arize

Spell is an end-to-end deep learning platform that automates complex ML infrastructure and operational work required to train and deploy AI models. Spell is fully hybrid-cloud, and can deploy easily into any cloud or on-prem hardware.

  • Run Orchestration: Automate cloud training execution from a user's local CLI as a tracked and reproducible experiment, capturing all outputs and comprehensive metrics.
  • Model Serving: Serve models directly into production from a model registry, complete with lineage metadata, backed by a managed Kubernetes cluster for maximum scalability and robustness.
  • Experiment Management: Manage, organize, collaborate on, and visualize your entire ML training portfolio in the cloud, under one centralized control pane.
Training Orchestration
Competes with:
HuggingFace Accelerate
PyTorch Lightning (Accelerate)

stoke is a lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices (e.g. CPU, GPU), distributed modes, mixed-precision, and PyTorch extensions. This allows you to switch from local full-precision CPU to mixed-precision distributed multi-GPU with extensions (like optimizer state sharding) by simply changing a few declarative flags. Additionally, stoke exposes configuration settings for every underlying backend for those that want configurability and raw access to the underlying libraries. In short, stoke is the best of PyTorch Lightning Accelerators disconnected from the rest of PyTorch Lightning. Write whatever PyTorch code you want, but leave device and backend context switching to stoke.
Supports:

  • Devices: CPU, GPU, multi-GPU
  • Distributed: DDP, Horovod, deepspeed (via DDP)
  • Mixed-Precision: AMP, Nvidia Apex, deepspeed (custom APEX like backend)
  • Extensions: fairscale (Optimizer State Sharding and Sharded DDP), deepspeed (ZeRO Stage 0-3, etc.)
Model Serving
Competes with:
Triton Inference Server

TensorFlow Serving is a flexible, high-performance serving system for TF models, designed for production environments.

  • Can serve multiple models, or multiple versions of the same model simultaneously.
  • Exposes both gRPC as well as HTTP inference endpoints.
  • Allows deployment of new model versions without changing any client code.
  • Supports canarying new versions and A/B testing experimental models.
  • Adds minimal latency to inference time due to efficient, low-overhead implementation.
Works well with:
Azure ML
Google CAIP
Competes with:
Tensorflow Serving
TorchServe

Triton Inference Server simplifies the deployment of AI models at scale in production.

  • Supports TensorFlow, TensorRT, PyTorch, ONNX Runtime, and custom framework backends.
  • Triton runs models concurrently on GPUs to maximize utilization, supports CPU-based inferencing, and offers advanced features like model ensemble and streaming inferencing.
  • Available as a Docker container, Triton integrates with Kubernetes for orchestration and scaling.
  • Can be used with cloud AI platforms like Azure ML and Google CAIP.
  • Triton exports Prometheus metrics for monitoring.
Experiment Tracking

Track and visualize all the pieces of your machine learning pipeline, from datasets to production models.

  • Quickly identify model regressions. Use W&B to visualize results in real time, all in a central dashboard.
  • Focus on the interesting ML. Spend less time manually tracking results in spreadsheets and text files.
  • Capture dataset versions with W&B Artifacts to identify how changing data affects your resulting models.
  • Reproduce any model, with saved code, hyperparameters, launch commands, input data, and resulting model weights.
Data Versioning

lakeFS is an open-source data lake management platform that transforms your object storage into a Git-like repository. lakeFS enables you to manage your data lake the way you manage your code. Run parallel pipelines for experimentation and CI/CD for your data.
Features:

  • Scalable: Version control data at exabyte scale.
  • Flexible: Run git operations like branch, commit, and merge over your data in any storage service.
  • Develop Faster: Zero copy branching for frictionless experimentation, easy collaboration.
  • Enable Clean Workflows: Use pre-commit & merge hooks for CI/CD workflows.
  • Resilient: Recover from data issues faster with revert capability.