Training models
Overview
Training AI models on ICP provides developers with a decentralized infrastructure and simplifies the setup process. ICP’s built-in computation and storage capabilities remove the need for external hardware. With GPU support on the roadmap, ICP will soon further enhance its AI capabilities and allow developers to train complex models, including large-scale AI systems.
MotokoLearn
MotokoLearn is a machine learning toolkit designed for the Motoko programming language. It enables developers to train classical machine learning models such as classification trees, regression trees, and random forest classifiers. These models are well-suited for applications involving small- to medium-sized datasets and can handle common AI tasks efficiently without requiring high-end GPU hardware.
Benefits of onchain model training
MotokoLearn enables onchain training and inference of machine learning models, providing advantages for Web3 services. Onchain training inherits the security and verification guarantees of the Internet Computer protocol, eliminating the need for external Web2 providers or pre-compiled Wasm modules of pre-trained models. This also simplifies the overall dapp architecture, ensuring that the model and training process are fully decentralized.
Training example: CART classifier tree
MotokoLearn allows developers to implement and train a Classification and Regression Tree (CART) model. Below is an example of how a classification tree can be trained using the toolkit:
import mtkl "../motokolearn/src/Mtklearn/Mtklearn";
import data "../motokolearn/src/Mtklearn/Datasets";
actor {
let seed = 123456789;
let nsamples: Nat = 100;
let alldata = data.wine_data;
let pos_vec = mtkl.randomSample(0, alldata.size()-1, nsamples, false, seed);
let train = mtkl.rows(pos_vec, alldata);
let test = mtkl.removeRows(pos_vec, alldata);
let xcols = Iter.toArray(Iter.range(0, mtkl.transpose(train).size()-2));
let ycol = mtkl.transpose(train).size()-1;
let xtrain = mtkl.cols(xcols, train);
let yaux = mtkl.transpose(mtkl.cols([ycol], train))[0];
let ytrain = mtkl.dataMemberVectorToTextVector(yaux);
let xtest = mtkl.cols(xcols, test);
let yauxtest = mtkl.transpose(mtkl.cols([ycol], test))[0];
let ytest = mtkl.dataMemberVectorToTextVector(yauxtest);
switch(ytrain) {
case (#ok(yvec)) {
let y_uniques = mtkl.uniquesText(yvec);
let col_ids = Iter.toArray(Iter.range(0, xcols.size()-1));
let ret_tree = mtkl.fitClassification(xtrain, yvec, 0, y_uniques, 3, 10, col_ids, seed);
};
};
};
More examples can be found in the MotokoLearn GitHub repository. These include CART regression tree, Random forest classifier and Random forest regression - effective algorithms for many machine learning problems.
Benefits of using classical algorithms
Classical algorithms, like those available in MotokoLearn, are effective for many AI tasks. They require fewer data and computational resources compared to complex models like neural networks, which makes them accessible to a broader range of developers. For example, random forest models and regression trees can handle predictive tasks without needing massive datasets or specialized hardware. This makes the Internet Computer a suitable environment for training these models.
Data considerations for training models
The success of AI model training depends on the quality of data used. High-quality datasets are essential, with the number of samples varying based on the algorithm. Simpler models may require as few as 100 samples, while more advanced AI models could need terabytes of data. The Internet Computer facilitates secure data storage in canisters and allows developers to maintain full control over their datasets while benefiting from security and decentralization.
Why focus on smaller data problems?
Many real-world machine learning applications involve small to medium datasets that do not require complex models like neural networks or large GPU clusters. Problems involving heterogeneous "tabular" data are often best solved using ensemble methods such as boosted trees. For example, several Kaggle challenges, as well as many consulting projects involving datasets under 100MiB, demonstrate that simpler algorithms like random forests or decision trees are more effective. MotokoLearn targets these smaller-scale use cases and enables efficient model training directly onchain.