Classification¶
- PyRapidML.classification.initializer(data: pandas.core.frame.DataFrame, target: str, train_size: float = 0.7, test_data: Optional[pandas.core.frame.DataFrame] = None, preprocess: bool = True, imputation_type: str = 'simple', iterative_imputation_iters: int = 5, categorical_features: Optional[List[str]] = None, categorical_imputation: str = 'constant', categorical_iterative_imputer: Union[str, Any] = 'lightgbm', ordinal_features: Optional[Dict[str, list]] = None, high_cardinality_features: Optional[List[str]] = None, high_cardinality_method: str = 'frequency', numeric_features: Optional[List[str]] = None, numeric_imputation: str = 'mean', numeric_iterative_imputer: Union[str, Any] = 'lightgbm', date_features: Optional[List[str]] = None, ignore_features: Optional[List[str]] = None, normalize: bool = False, normalize_method: str = 'zscore', transformation: bool = False, transformation_method: str = 'yeo-johnson', handle_unknown_categorical: bool = True, unknown_categorical_method: str = 'least_frequent', pca: bool = False, pca_method: str = 'linear', pca_components: Optional[float] = None, ignore_low_variance: bool = False, combine_rare_levels: bool = False, rare_level_threshold: float = 0.1, bin_numeric_features: Optional[List[str]] = None, remove_outliers: bool = False, outliers_threshold: float = 0.05, remove_multicollinearity: bool = False, multicollinearity_threshold: float = 0.9, remove_perfect_collinearity: bool = True, create_clusters: bool = False, cluster_iter: int = 20, polynomial_features: bool = False, polynomial_degree: int = 2, trigonometry_features: bool = False, polynomial_threshold: float = 0.1, group_features: Optional[List[str]] = None, group_names: Optional[List[str]] = None, feature_selection: bool = False, feature_selection_threshold: float = 0.8, feature_selection_method: str = 'classic', feature_interaction: bool = False, feature_ratio: bool = False, interaction_threshold: float = 0.01, fix_imbalance: bool = False, fix_imbalance_method: Optional[Any] = None, data_split_shuffle: bool = True, data_split_stratify: Union[bool, List[str]] = False, fold_strategy: Union[str, Any] = 'stratifiedkfold', fold: int = 10, fold_shuffle: bool = False, fold_groups: Optional[Union[str, pandas.core.frame.DataFrame]] = None, n_jobs: Optional[int] = - 1, use_gpu: bool = False, custom_pipeline: Optional[Union[Any, Tuple[str, Any], List[Any], List[Tuple[str, Any]]]] = None, html: bool = True, session_id: Optional[int] = None, log_experiment: bool = False, experiment_name: Optional[str] = None, log_plots: Union[bool, list] = False, log_profile: bool = False, log_data: bool = False, silent: bool = False, verbose: bool = True, profile: bool = False, profile_kwargs: Optional[Dict[str, Any]] = None)¶
This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes two mandatory parameters:
dataandtarget. All the other parameters are optional.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase')
- data: pandas.DataFrame
Shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.
- target: str
Name of the target column to be passed in as a string. The target variable can be either binary or multiclass.
- train_size: float, default = 0.7
Proportion of the dataset to be used for training and validation. Should be between 0.0 and 1.0.
- test_data: pandas.DataFrame, default = None
If not None, test_data is used as a hold-out set and
train_sizeparameter is ignored. test_data must be labelled and the shape of data and test_data must match.- preprocess: bool, default = True
When set to False, no transformations are applied except for train_test_split and custom transformations passed in
custom_pipelineparam. Data must be ready for modeling (no missing values, no dates, categorical data encoding), when preprocess is set to False.- imputation_type: str, default = ‘simple’
The type of imputation to use. Can be either ‘simple’ or ‘iterative’.
- iterative_imputation_iters: int, default = 5
Number of iterations. Ignored when
imputation_typeis not ‘iterative’.- categorical_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, categorical_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are categorical.
- categorical_imputation: str, default = ‘constant’
Missing values in categorical features are imputed with a constant ‘not_available’ value. The other available option is ‘mode’.
- categorical_iterative_imputer: str, default = ‘lightgbm’
Estimator for iterative imputation of missing values in categorical features. Ignored when
imputation_typeis not ‘iterative’.- ordinal_features: dict, default = None
Encode categorical features as ordinal. For example, a categorical feature with ‘low’, ‘medium’, ‘high’ values where low < medium < high can be passed as ordinal_features = { ‘column_name’ : [‘low’, ‘medium’, ‘high’] }.
- high_cardinality_features: list of str, default = None
When categorical features contains many levels, it can be compressed into fewer levels using this parameter. It takes a list of strings with column names that are categorical.
- high_cardinality_method: str, default = ‘frequency’
Categorical features with high cardinality are replaced with the frequency of values in each level occurring in the training dataset. Other available method is ‘clustering’ which trains the K-Means clustering algorithm on the statistical attribute of the training data and replaces the original value of feature with the cluster label. The number of clusters is determined by optimizing Calinski-Harabasz and Silhouette criterion.
- numeric_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, numeric_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are numeric.
- numeric_imputation: str, default = ‘mean’
Missing values in numeric features are imputed with ‘mean’ value of the feature in the training dataset. The other available option is ‘median’ or ‘zero’.
- numeric_iterative_imputer: str, default = ‘lightgbm’
Estimator for iterative imputation of missing values in numeric features. Ignored when
imputation_typeis set to ‘simple’.- date_features: list of str, default = None
If the inferred data types are not correct or the silent param is set to True, date_features param can be used to overwrite or define the data types. It takes a list of strings with column names that are DateTime.
- ignore_features: list of str, default = None
ignore_features param can be used to ignore features during model training. It takes a list of strings with column names that are to be ignored.
- normalize: bool, default = False
When set to True, it transforms the numeric features by scaling them to a given range. Type of scaling is defined by the
normalize_methodparameter.- normalize_method: str, default = ‘zscore’
Defines the method for scaling. By default, normalize method is set to ‘zscore’ The standard zscore is calculated as z = (x - u) / s. Ignored when
normalizeis not True. The other options are:minmax: scales and translates each feature individually such that it is in the range of 0 - 1.
maxabs: scales and translates each feature individually such that the maximal absolute value of each feature will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
robust: scales and translates each feature according to the Interquartile range. When the dataset contains outliers, robust scaler often gives better results.
- transformation: bool, default = False
When set to True, it applies the power transform to make data more Gaussian-like. Type of transformation is defined by the
transformation_methodparameter.- transformation_method: str, default = ‘yeo-johnson’
Defines the method for transformation. By default, the transformation method is set to ‘yeo-johnson’. The other available option for transformation is ‘quantile’. Ignored when
transformationis not True.- handle_unknown_categorical: bool, default = True
When set to True, unknown categorical levels in unseen data are replaced by the most or least frequent level as learned in the training dataset.
- unknown_categorical_method: str, default = ‘least_frequent’
Method used to replace unknown categorical levels in unseen data. Method can be set to ‘least_frequent’ or ‘most_frequent’.
- pca: bool, default = False
When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in
pca_methodparameter.- pca_method: str, default = ‘linear’
The ‘linear’ method performs uses Singular Value Decomposition. Other options are:
kernel: dimensionality reduction through the use of RVF kernel.
incremental: replacement for ‘linear’ pca when the dataset is too large.
- pca_components: int or float, default = None
Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be less than the original number of features. Ignored when
pcais not True.- ignore_low_variance: bool, default = False
When set to True, all categorical features with insignificant variances are removed from the data. The variance is calculated using the ratio of unique values to the number of samples, and the ratio of the most common value to the frequency of the second most common value.
- combine_rare_levels: bool, default = False
When set to True, frequency percentile for levels in categorical features below a certain threshold is combined into a single level.
- rare_level_threshold: float, default = 0.1
Percentile distribution below which rare categories are combined. Ignored when
combine_rare_levelsis not True.- bin_numeric_features: list of str, default = None
To convert numeric features into categorical, bin_numeric_features parameter can be used. It takes a list of strings with column names to be discretized. It does so by using ‘sturges’ rule to determine the number of clusters and then apply KMeans algorithm. Original values of the feature are then replaced by the cluster label.
- remove_outliers: bool, default = False
When set to True, outliers from the training data are removed using the Singular Value Decomposition.
- outliers_threshold: float, default = 0.05
The percentage outliers to be removed from the training dataset. Ignored when
remove_outliersis not True.- remove_multicollinearity: bool, default = False
When set to True, features with the inter-correlations higher than the defined threshold are removed. When two features are highly correlated with each other, the feature that is less correlated with the target variable is removed. Only considers numeric features.
- multicollinearity_threshold: float, default = 0.9
Threshold for correlated features. Ignored when
remove_multicollinearityis not True.- remove_perfect_collinearity: bool, default = True
When set to True, perfect collinearity (features with correlation = 1) is removed from the dataset, when two features are 100% correlated, one of it is randomly removed from the dataset.
- create_clusters: bool, default = False
When set to True, an additional feature is created in training dataset where each instance is assigned to a cluster. The number of clusters is determined by optimizing Calinski-Harabasz and Silhouette criterion.
- cluster_iter: int, default = 20
Number of iterations for creating cluster. Each iteration represents cluster size. Ignored when
create_clustersis not True.- polynomial_features: bool, default = False
When set to True, new features are derived using existing numeric features.
- polynomial_degree: int, default = 2
Degree of polynomial features. For example, if an input sample is two dimensional and of the form [a, b], the polynomial features with degree = 2 are: [1, a, b, a^2, ab, b^2]. Ignored when
polynomial_featuresis not True.- trigonometry_features: bool, default = False
When set to True, new features are derived using existing numeric features.
- polynomial_threshold: float, default = 0.1
When
polynomial_featuresortrigonometry_featuresis True, new features are derived from the existing numeric features. This may sometimes result in too large feature space. polynomial_threshold parameter can be used to deal with this problem. It does so by using combination of Random Forest, AdaBoost and Linear correlation. All derived features that falls within the percentile distribution are kept and rest of the features are removed.- group_features: list or list of list, default = None
When the dataset contains features with related characteristics, group_features parameter can be used for feature extraction. It takes a list of strings with column names that are related.
- group_names: list, default = None
Group names to be used in naming new features. When the length of group_names does not match with the length of
group_features, new features are named sequentially group_1, group_2, etc. It is ignored whengroup_featuresis None.- feature_selection: bool, default = False
When set to True, a subset of features are selected using a combination of various permutation importance techniques including Random Forest, Adaboost and Linear correlation with target variable. The size of the subset is dependent on the
feature_selection_thresholdparameter.- feature_selection_threshold: float, default = 0.8
Threshold value used for feature selection. When
polynomial_featuresorfeature_interactionis True, it is recommended to keep the threshold low to avoid large feature spaces. Setting a very low value may be efficient but could result in under-fitting.- feature_selection_method: str, default = ‘classic’
Algorithm for feature selection. ‘classic’ method uses permutation feature importance techniques. Other possible value is ‘boruta’ which uses boruta algorithm for feature selection.
- feature_interaction: bool, default = False
When set to True, new features are created by interacting (a * b) all the numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.
- feature_ratio: bool, default = False
When set to True, new features are created by calculating the ratios (a / b) between all numeric variables in the dataset. This feature is not scalable and may not work as expected on datasets with large feature space.
- interaction_threshold: bool, default = 0.01
Similar to polynomial_threshold, It is used to compress a sparse matrix of newly created features through interaction. Features whose importance based on the combination of Random Forest, AdaBoost and Linear correlation falls within the percentile of the defined threshold are kept in the dataset. Remaining features are dropped before further processing.
- fix_imbalance: bool, default = False
When training dataset has unequal distribution of target class it can be balanced using this parameter. When set to True, SMOTE (Synthetic Minority Over-sampling Technique) is applied by default to create synthetic datapoints for minority class.
- fix_imbalance_method: obj, default = None
When
fix_imbalanceis True, ‘imblearn’ compatible object with ‘fit_resample’ method can be passed. When set to None, ‘imblearn.over_sampling.SMOTE’ is used.- data_split_shuffle: bool, default = True
When set to False, prevents shuffling of rows during ‘train_test_split’.
- data_split_stratify: bool or list, default = False
Controls stratification during ‘train_test_split’. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored when
data_split_shuffleis False.- fold_strategy: str or sklearn CV generator object, default = ‘stratifiedkfold’
Choice of cross validation strategy. Possible values are:
‘kfold’
‘stratifiedkfold’
‘groupkfold’
‘timeseries’
a custom CV generator object compatible with scikit-learn.
- fold: int, default = 10
Number of folds to be used in cross validation. Must be at least 2. This is a global setting that can be over-written at function level by using
foldparameter. Ignored whenfold_strategyis a custom object.- fold_shuffle: bool, default = False
Controls the shuffle parameter of CV. Only applicable when
fold_strategyis ‘kfold’ or ‘stratifiedkfold’. Ignored whenfold_strategyis a custom object.- fold_groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when ‘GroupKFold’ is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- n_jobs: int, default = -1
The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor set n_jobs to None.
- use_gpu: bool or str, default = False
When set to True, it will use GPU for training with algorithms that support it, and fall back to CPU if they are unavailable. When set to ‘force’, it will only use GPU-enabled algorithms and raise exceptions when they are unavailable. When False, all algorithms are trained using CPU only.
GPU enabled algorithms:
Extreme Gradient Boosting, requires no further installation
CatBoost Classifier, requires no further installation (GPU is only enabled when data > 50,000 rows)
Light Gradient Boosting Machine, requires GPU installation https://lightgbm.readthedocs.io/en/latest/GPU-Tutorial.html
Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, Support Vector Machine, requires cuML >= 0.15 https://github.com/rapidsai/cuml
- custom_pipeline: (str, transformer) or list of (str, transformer), default = None
When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied after ‘train_test_split’ and before PyRapidML’s internal transformations.
- html: bool, default = True
When set to False, prevents runtime display of monitor. This must be set to False when the environment does not support IPython. For example, command line terminal, Databricks Notebook, Spyder and other similar IDEs.
- session_id: int, default = None
Controls the randomness of experiment. It is equivalent to ‘random_state’ in scikit-learn. When None, a pseudo random number is generated. This can be used for later reproducibility of the entire experiment.
- log_experiment: bool, default = False
When set to True, all metrics and parameters are logged on the
MLFlowserver.- experiment_name: str, default = None
Name of the experiment for logging. Ignored when
log_experimentis not True.- log_plots: bool or list, default = False
When set to True, certain plots are logged automatically in the
MLFlowserver. To change the type of plots to be logged, pass a list containing plot IDs. Refer to documentation ofplot_model. Ignored whenlog_experimentis not True.- log_profile: bool, default = False
When set to True, data profile is logged on the
MLflowserver as a html file. Ignored whenlog_experimentis not True.- log_data: bool, default = False
When set to True, dataset is logged on the
MLflowserver as a csv file. Ignored whenlog_experimentis not True.- silent: bool, default = False
Controls the confirmation input of data types when
setupis executed. When executing in completely automated mode or on a remote kernel, this must be True.- verbose: bool, default = True
When set to False, Information grid is not printed.
- profile: bool, default = False
When set to True, an interactive EDA report is displayed.
- profile_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the ProfileReport method used to create the EDA report. Ignored if
profileis False.
- Returns
Global variables that can be changed using the
set_configfunction.
- PyRapidML.classification.comparing_models(include: Optional[List[Union[str, Any]]] = None, exclude: Optional[List[str]] = None, fold: Optional[Union[int, Any]] = None, round: int = 4, cross_validation: bool = True, sort: str = 'Accuracy', n_select: int = 1, budget_time: Optional[float] = None, turbo: bool = True, errors: str = 'ignore', fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True) Union[Any, List[Any]]¶
This function trains and evaluates performance of all estimators available in the model library using cross validation. The output of this function is a score grid with average cross validated scores. Metrics evaluated during CV can be accessed using the
get_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> best_model = comparing_models()
- include: list of str or scikit-learn compatible object, default = None
To train and evaluate select models, list containing model ID or scikit-learn compatible object can be passed in include param. To see a list of all models available in the model library use the
modelsfunction.- exclude: list of str, default = None
To omit certain models from training and evaluation, pass a list containing model id in the exclude parameter. To see a list of all models available in the model library use the
modelsfunction.- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- cross_validation: bool, default = True
When set to False, metrics are evaluated on holdout set.
foldparam is ignored when cross_validation is set to False.- sort: str, default = ‘Accuracy’
The sort order of the score grid. It also accepts custom metrics that are added through the
add_metricfunction.- n_select: int, default = 1
Number of top_n models to return. For example, to select top 3 models use n_select = 3.
- budget_time: int or float, default = None
If not None, will terminate execution of the function after budget_time minutes have passed and return results up to that point.
- turbo: bool, default = True
When set to True, it excludes estimators with longer training times. To see which algorithms are excluded use the
modelsfunction.- errors: str, default = ‘ignore’
When set to ‘ignore’, will skip the model with exceptions and continue. If ‘raise’, will break the function when exceptions are raised.
- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when ‘GroupKFold’ is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- Returns
Trained model or list of trained models, depending on the
n_selectparam.
Warning
Changing turbo parameter to False may result in very high training times with datasets exceeding 10,000 rows.
AUC for estimators that does not support ‘predict_proba’ is shown as 0.0000.
No models are logged in
MLFlowwhencross_validationparameter is False.
- PyRapidML.classification.creating_model(estimator: Union[str, Any], fold: Optional[Union[int, Any]] = None, round: int = 4, cross_validation: bool = True, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True, **kwargs) Any¶
This function trains and evaluates the performance of a given estimator using cross validation. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the
get_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction. All the available models can be accessed using themodelsfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr')
- estimator: str or scikit-learn compatible object
ID of an estimator available in model library or pass an untrained model object consistent with scikit-learn API. Estimators available in the model library (ID - Name):
‘lr’ - Logistic Regression
‘knn’ - K Neighbors Classifier
‘nb’ - Naive Bayes
‘dt’ - Decision Tree Classifier
‘svm’ - SVM - Linear Kernel
‘rbfsvm’ - SVM - Radial Kernel
‘gpc’ - Gaussian Process Classifier
‘mlp’ - MLP Classifier
‘ridge’ - Ridge Classifier
‘rf’ - Random Forest Classifier
‘qda’ - Quadratic Discriminant Analysis
‘ada’ - Ada Boost Classifier
‘gbc’ - Gradient Boosting Classifier
‘lda’ - Linear Discriminant Analysis
‘et’ - Extra Trees Classifier
‘xgboost’ - Extreme Gradient Boosting
‘lightgbm’ - Light Gradient Boosting Machine
‘catboost’ - CatBoost Classifier
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- cross_validation: bool, default = True
When set to False, metrics are evaluated on holdout set.
foldparam is ignored when cross_validation is set to False.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- kwargs:
Additional keyword arguments to pass to the estimator.
- Returns
Trained Model
Warning
AUC for estimators that does not support ‘predict_proba’ is shown as 0.0000.
Models are not logged on the
MLFlowserver whencross_validationparam is set to False.
- PyRapidML.classification.tuning_model(estimator, fold: Optional[Union[int, Any]] = None, round: int = 4, n_iter: int = 10, custom_grid: Optional[Union[Dict[str, list], Any]] = None, optimize: str = 'Accuracy', custom_scorer=None, search_library: str = 'scikit-learn', search_algorithm: Optional[str] = None, early_stopping: Any = False, early_stopping_max_iters: int = 10, choose_better: bool = False, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, return_tuner: bool = False, verbose: bool = True, tuner_verbose: Union[int, bool] = True, **kwargs) Any¶
This function tunes the hyperparameters of a given estimator. The output of this function is a score grid with CV scores by fold of the best selected model based on
optimizeparameter. Metrics evaluated during CV can be accessed using theget_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> tuned_lr = tuning_model(lr)
- estimator: scikit-learn compatible object
Trained model object
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- n_iter: int, default = 10
Number of iterations in the grid search. Increasing ‘n_iter’ may improve model performance but also increases the training time.
- custom_grid: dictionary, default = None
To define custom search space for hyperparameters, pass a dictionary with parameter name and values to be iterated. Custom grids must be in a format supported by the defined
search_library.- optimize: str, default = ‘Accuracy’
Metric name to be evaluated for hyperparameter tuning. It also accepts custom metrics that are added through the
add_metricfunction.- custom_scorer: object, default = None
custom scoring strategy can be passed to tune hyperparameters of the model. It must be created using
sklearn.make_scorer. It is equivalent of adding custom metric using theadd_metricfunction and passing the name of the custom metric in theoptimizeparameter. Will be deprecated in future.- search_library: str, default = ‘scikit-learn’
The search library used for tuning hyperparameters. Possible values:
- ‘scikit-learn’ - default, requires no further installation
- ‘scikit-optimize’ -
pip install scikit-optimize
- ‘scikit-optimize’ -
- ‘tune-sklearn’ -
pip install tune-sklearn ray[tune]
- ‘tune-sklearn’ -
- ‘optuna’ -
pip install optuna
- ‘optuna’ -
- search_algorithm: str, default = None
The search algorithm depends on the
search_libraryparameter. Some search algorithms require additional libraries to be installed. If None, will use search library-specific default algorithm.- ‘scikit-learn’ possible values:
‘random’ : random grid search (default)
‘grid’ : grid search
- ‘scikit-optimize’ possible values:
‘bayesian’ : Bayesian search (default)
- ‘tune-sklearn’ possible values:
‘random’ : random grid search (default)
‘grid’ : grid search
‘bayesian’ :
pip install scikit-optimize‘hyperopt’ :
pip install hyperopt‘optuna’ :
pip install optuna‘bohb’ :
pip install hpbandster ConfigSpace
- ‘optuna’ possible values:
‘random’ : randomized search
‘tpe’ : Tree-structured Parzen Estimator search (default)
- early_stopping: bool or str or object, default = False
Use early stopping to stop fitting to a hyperparameter configuration if it performs poorly. Ignored when
search_libraryis scikit-learn, or if the estimator does not have ‘partial_fit’ attribute. If False or None, early stopping will not be used. Can be either an object accepted by the search library or one of the following:‘asha’ for Asynchronous Successive Halving Algorithm
‘hyperband’ for Hyperband
‘median’ for Median Stopping Rule
If False or None, early stopping will not be used.
- early_stopping_max_iters: int, default = 10
Maximum number of epochs to run for each sampled configuration. Ignored if
early_stoppingis False or None.- choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the
optimizeparameter.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the tuner.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- return_tuner: bool, default = False
When set to True, will return a tuple of (model, tuner_object).
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- tuner_verbose: bool or in, default = True
If True or above 0, will print messages from the tuner. Higher values print more messages. Ignored when
verboseparam is False.- kwargs:
Additional keyword arguments to pass to the optimizer.
- Returns
Trained Model and Optional Tuner Object when
return_tuneris True.
Warning
Using ‘grid’ as
search_algorithmmay result in very long computation. Only recommended with smaller search spaces that can be defined in thecustom_gridparameter.search_library‘tune-sklearn’ does not support GPU models.
- PyRapidML.classification.ensemble_model(estimator, method: str = 'Bagging', fold: Optional[Union[int, Any]] = None, n_estimators: int = 10, round: int = 4, choose_better: bool = False, optimize: str = 'Accuracy', fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True) Any¶
This function ensembles a given estimator. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the
get_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> dt = creating_model('dt') >>> bagged_dt = ensemble_model(dt, method = 'Bagging')
- estimator: scikit-learn compatible object
Trained model object
- method: str, default = ‘Bagging’
Method for ensembling base estimator. It can be ‘Bagging’ or ‘Boosting’.
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- n_estimators: int, default = 10
The number of base estimators in the ensemble. In case of perfect fit, the learning procedure is stopped early.
- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the
optimizeparameter.- optimize: str, default = ‘Accuracy’
Metric to compare for model selection when
choose_betteris True.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- Returns
Trained Model
Warning
Method ‘Boosting’ is not supported for estimators that do not have ‘class_weights’ or ‘predict_proba’ attributes.
- PyRapidML.classification.blend_models(estimator_list: list, fold: Optional[Union[int, Any]] = None, round: int = 4, choose_better: bool = False, optimize: str = 'Accuracy', method: str = 'auto', weights: Optional[List[float]] = None, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True) Any¶
This function trains a Soft Voting / Majority Rule classifier for select models passed in the
estimator_listparam. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using theget_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = get_data('juice') >>> from PyRapidML.classification import * >>> exp_name = setup(data = juice, target = 'Purchase') >>> top3 = compare_models(n_select = 3) >>> blender = blend_models(top3)
- estimator_list: list of scikit-learn compatible objects
List of trained model objects
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the
optimizeparameter.- optimize: str, default = ‘Accuracy’
Metric to compare for model selection when
choose_betteris True.- method: str, default = ‘auto’
‘hard’ uses predicted class labels for majority rule voting. ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers. Default value, ‘auto’, will try to use ‘soft’ and fall back to ‘hard’ if the former is not supported.
- weights: list, default = None
Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights when None.
- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- Returns
Trained Model
- PyRapidML.classification.stack_models(estimator_list: list, meta_model=None, fold: Optional[Union[int, Any]] = None, round: int = 4, method: str = 'auto', restack: bool = True, choose_better: bool = False, optimize: str = 'Accuracy', fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True) Any¶
This function trains a meta model over select estimators passed in the
estimator_listparameter. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using theget_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> top3 = comparing_models(n_select = 3) >>> stacker = stack_models(top3)
- estimator_list: list of scikit-learn compatible objects
List of trained model objects
- meta_model: scikit-learn compatible object, default = None
When None, Logistic Regression is trained as a meta model.
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- method: str, default = ‘auto’
When set to ‘auto’, it will invoke, for each estimator, ‘predict_proba’, ‘decision_function’ or ‘predict’ in that order. Other, manually pass one of the value from ‘predict_proba’, ‘decision_function’ or ‘predict’.
- restack: bool, default = True
When set to False, only the predictions of estimators will be used as training data for the
meta_model.- choose_better: bool, default = False
When set to True, the returned object is always better performing. The metric used for comparison is defined by the
optimizeparameter.- optimize: str, default = ‘Accuracy’
Metric to compare for model selection when
choose_betteris True.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- Returns
Trained Model
Warning
When
methodis not set to ‘auto’, it will check if the defined method is available for all estimators passed inestimator_list. If the method is not implemented by any estimator, it will raise an error.
- PyRapidML.classification.plot_model(estimator, plot: str = 'auc', scale: float = 1, save: bool = False, fold: Optional[Union[int, Any]] = None, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, use_train_data: bool = False, verbose: bool = True, display_format: Optional[str] = None) str¶
This function analyzes the performance of a trained model on holdout set. It may require re-training the model in certain cases.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> plot_model(lr, plot = 'auc')
- estimator: scikit-learn compatible object
Trained model object
- plot: str, default = ‘auc’
List of available plots (ID - Name):
‘auc’ - Area Under the Curve
‘threshold’ - Discrimination Threshold
‘pr’ - Precision Recall Curve
‘confusion_matrix’ - Confusion Matrix
‘error’ - Class Prediction Error
‘class_report’ - Classification Report
‘boundary’ - Decision Boundary
‘rfe’ - Recursive Feature Selection
‘learning’ - Learning Curve
‘manifold’ - Manifold Learning
‘calibration’ - Calibration Curve
‘vc’ - Validation Curve
‘dimension’ - Dimension Learning
‘feature’ - Feature Importance
‘feature_all’ - Feature Importance (All)
‘parameter’ - Model Hyperparameter
‘lift’ - Lift Curve
‘gain’ - Gain Chart
‘tree’ - Decision Tree
- scale: float, default = 1
The resolution scale of the figure.
- save: bool, default = False
When set to True, plot is saved in the current working directory.
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.
- verbose: bool, default = True
When set to False, progress bar is not displayed.
- display_format: str, default = None
To display plots in Streamlit (https://www.streamlit.io/), set this to ‘streamlit’. Currently, not all plots are supported.
- Returns
None
Warning
Estimators that does not support ‘predict_proba’ attribute cannot be used for ‘AUC’ and ‘calibration’ plots.
When the target is multiclass, ‘calibration’, ‘threshold’, ‘manifold’ and ‘rfe’ plots are not available.
When the ‘max_features’ parameter of a trained model object is not equal to the number of samples in training set, the ‘rfe’ plot is not available.
- PyRapidML.classification.evaluate_model(estimator, fold: Optional[Union[int, Any]] = None, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, use_train_data: bool = False)¶
This function displays a user interface for analyzing performance of a trained model. It calls the
plot_modelfunction internally.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> evaluate_model(lr)
- estimator: scikit-learn compatible object
Trained model object
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.
- Returns
None
Warning
This function only works in IPython enabled Notebook.
- PyRapidML.classification.interpret_model(estimator, plot: str = 'summary', feature: Optional[str] = None, observation: Optional[int] = None, use_train_data: bool = False, X_new_sample: Optional[pandas.core.frame.DataFrame] = None, save: bool = False, **kwargs)¶
This function analyzes the predictions generated from a tree-based model. It is implemented based on the SHAP (SHapley Additive exPlanations). For more info on this, please see https://shap.readthedocs.io/en/latest/
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> xgboost = creating_model('xgboost') >>> interpret_model(xgboost)
- estimator: scikit-learn compatible object
Trained model object
- plot: str, default = ‘summary’
Type of plot. Available options are: ‘summary’, ‘correlation’, and ‘reason’.
- feature: str, default = None
Feature to check correlation with. This parameter is only required when
plottype is ‘correlation’. When set to None, it uses the first column in the train dataset.- observation: int, default = None
Observation index number in holdout set to explain. When
plotis not ‘reason’, this parameter is ignored.- use_train_data: bool, default = False
When set to true, train data will be used for plots, instead of test data.
- X_new_sample: pd.DataFrame, default = None
Row from an out-of-sample dataframe (neither train nor test data) to be plotted. The sample must have the same columns as the raw input data, and it is transformed by the preprocessing pipeline automatically before plotting.
- save: bool, default = False
When set to True, Plot is saved as a ‘png’ file in current working directory.
- kwargs:
Additional keyword arguments to pass to the plot.
- Returns
None
- PyRapidML.classification.calibrate_model(estimator, method: str = 'sigmoid', fold: Optional[Union[int, Any]] = None, round: int = 4, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, verbose: bool = True) Any¶
This function calibrates the probability of a given estimator using isotonic or logistic regression. The output of this function is a score grid with CV scores by fold. Metrics evaluated during CV can be accessed using the
get_metricsfunction. Custom metrics can be added or removed usingadd_metricandremove_metricfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> dt = creating_model('dt') >>> calibrated_dt = calibrate_model(dt)
- estimator: scikit-learn compatible object
Trained model object
- method: str, default = ‘sigmoid’
The method to use for calibration. Can be ‘sigmoid’ which corresponds to Platt’s method or ‘isotonic’ which is a non-parametric approach.
- fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in the
fold_strategyparameter of thesetupfunction is used. When an integer is passed, it is interpreted as the ‘n_splits’ parameter of the CV generator in thesetupfunction.- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- verbose: bool, default = True
Score grid is not printed when verbose is set to False.
- Returns
Trained Model
Warning
Avoid isotonic calibration with too few calibration samples (< 1000) since it tends to overfit.
- PyRapidML.classification.optimize_threshold(estimator, true_positive: int = 0, true_negative: int = 0, false_positive: int = 0, false_negative: int = 0)¶
This function optimizes probability threshold for a given estimator using custom cost function. The function displays a plot of optimized cost as a function of probability threshold between 0.0 to 1.0 and returns the optimized threshold value as a numpy float.
Example
>>> from PyRapidML.datasets import get_data >>> juice = get_data('juice') >>> from PyRapidML.classification import * >>> exp_name = setup(data = juice, target = 'Purchase') >>> lr = create_model('lr') >>> optimize_threshold(lr, true_negative = 10, false_negative = -100)
- estimator: scikit-learn compatible object
Trained model object
- true_positive: int, default = 0
Cost function or returns for true positive.
- true_negative: int, default = 0
Cost function or returns for true negative.
- false_positive: int, default = 0
Cost function or returns for false positive.
- false_negative: int, default = 0
Cost function or returns for false negative.
- Returns
numpy.float64
Warning
This function is not supported when target is multiclass.
- PyRapidML.classification.predict_model(estimator, data: Optional[pandas.core.frame.DataFrame] = None, probability_threshold: Optional[float] = None, encoded_labels: bool = False, raw_score: bool = False, round: int = 4, verbose: bool = True) pandas.core.frame.DataFrame¶
This function predicts
LabelandScore(probability of predicted class) using a trained model. Whendatais None, it predicts label and score on the holdout set.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = setup(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> pred_holdout = predict_model(lr) >>> pred_unseen = predict_model(lr, data = unseen_dataframe)
- estimator: scikit-learn compatible object
Trained model object
- data: pandas.DataFrame
Shape (n_samples, n_features). All features used during training must be available in the unseen dataset.
- probability_threshold: float, default = None
Threshold for converting predicted probability to class label. It defaults to 0.5 for all classifiers unless explicitly defined in this parameter.
- encoded_labels: bool, default = False
When set to True, will return labels encoded as an integer.
- raw_score: bool, default = False
When set to True, scores for all labels will be returned.
- round: int, default = 4
Number of decimal places the metrics in the score grid will be rounded to.
- verbose: bool, default = True
When set to False, holdout score grid is not printed.
- Returns
pandas.DataFrame
Warning
The behavior of the
predict_modelis changed in version 2.1 without backward compatibility. As such, the pipelines trained using the version (<= 2.0), may not work for inference with version >= 2.1. You can either retrain your models with a newer version or downgrade the version for inference.
- PyRapidML.classification.finalize_model(estimator, fit_kwargs: Optional[dict] = None, groups: Optional[Union[str, Any]] = None, model_only: bool = True) Any¶
This function trains a given estimator on the entire dataset including the holdout set.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> final_lr = finalize_model(lr)
- estimator: scikit-learn compatible object
Trained model object
- fit_kwargs: dict, default = {} (empty dict)
Dictionary of arguments passed to the fit method of the model.
- groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when GroupKFold is used for the cross validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in training dataset. When string is passed, it is interpreted as the column name in the dataset containing group labels.
- model_only: bool, default = True
When set to False, only model object is re-trained and all the transformations in Pipeline are ignored.
- Returns
Trained Model
- PyRapidML.classification.deploy_model(model, model_name: str, authentication: dict, platform: str = 'aws')¶
This function deploys the transformation pipeline and trained model on cloud.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creating_model('lr') >>> deploy_model(model = lr, model_name = 'lr-for-deployment', platform = 'aws', authentication = {'bucket' : 'S3-bucket-name'})
- Amazon Web Service (AWS) users:
To deploy a model on AWS S3 (‘aws’), environment variables must be set in your local environment. To configure AWS environment variables, type
aws configurein the command line. Following information from the IAM portal of amazon console account is required:AWS Access Key ID
AWS Secret Key Access
Default Region Name (can be seen under Global settings on your AWS console)
More info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html
- Google Cloud Platform (GCP) users:
To deploy a model on Google Cloud Platform (‘gcp’), project must be created using command line or GCP console. Once project is created, you must create a service account and download the service account key as a JSON file to set environment variables in your local environment.
More info: https://cloud.google.com/docs/authentication/production
- Microsoft Azure (Azure) users:
To deploy a model on Microsoft Azure (‘azure’), environment variables for connection string must be set in your local environment. Go to settings of storage account on Azure portal to access the connection string required.
- model: scikit-learn compatible object
Trained model object
- model_name: str
Name of model.
- authentication: dict
Dictionary of applicable authentication tokens.
When platform = ‘aws’: {‘bucket’ : ‘S3-bucket-name’}
When platform = ‘gcp’: {‘project’: ‘gcp-project-name’, ‘bucket’ : ‘gcp-bucket-name’}
When platform = ‘azure’: {‘container’: ‘azure-container-name’}
- platform: str, default = ‘aws’
Name of the cloud platform. Currently supported platforms: ‘aws’, ‘gcp’ and ‘azure’.
- Returns
None
- PyRapidML.classification.save_model(model, model_name: str, model_only: bool = False, verbose: bool = True, **kwargs)¶
This function saves the transformation pipeline and trained model object into the current working directory as a pickle file for later use.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> lr = creatinng_model('lr') >>> save_model(lr, 'saved_lr_model')
- model: scikit-learn compatible object
Trained model object
- model_name: str
Name of the model.
- model_only: bool, default = False
When set to True, only trained model object is saved instead of the entire pipeline.
- verbose: bool, default = True
Success message is not printed when verbose is set to False.
- kwargs:
Additional keyword arguments to pass to joblib.dump().
- Returns
Tuple of the model object and the filename.
- PyRapidML.classification.load_model(model_name, platform: Optional[str] = None, authentication: Optional[Dict[str, str]] = None, verbose: bool = True)¶
This function loads a previously saved pipeline.
Example
>>> from PyRapidML.classification import load_model >>> saved_lr = load_model('saved_lr_model')
- model_name: str
Name of the model.
- platform: str, default = None
Name of the cloud platform. Currently supported platforms: ‘aws’, ‘gcp’ and ‘azure’.
- authentication: dict, default = None
dictionary of applicable authentication tokens.
when platform = ‘aws’: {‘bucket’ : ‘S3-bucket-name’}
when platform = ‘gcp’: {‘project’: ‘gcp-project-name’, ‘bucket’ : ‘gcp-bucket-name’}
when platform = ‘azure’: {‘container’: ‘azure-container-name’}
- verbose: bool, default = True
Success message is not printed when verbose is set to False.
- Returns
Trained Model
- PyRapidML.classification.automl(optimize: str = 'Accuracy', use_holdout: bool = False) Any¶
This function returns the best model out of all trained models in current session based on the
optimizeparameter. Metrics evaluated can be accessed using theget_metricsfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = setup(data = juice, target = 'Purchase') >>> top3 = comparing_models(n_select = 3) >>> tuned_top3 = [tuning_model(i) for i in top3] >>> blender = blend_models(tuned_top3) >>> stacker = stack_models(tuned_top3) >>> best_auc_model = automl(optimize = 'AUC')
- optimize: str, default = ‘Accuracy’
Metric to use for model selection. It also accepts custom metrics added using the
add_metricfunction.- use_holdout: bool, default = False
When set to True, metrics are evaluated on holdout set instead of CV.
- Returns
Trained Model
- PyRapidML.classification.pull(pop: bool = False) pandas.core.frame.DataFrame¶
Returns last printed score grid. Use
pullfunction after any training function to store the score grid in pandas.DataFrame.- pop: bool, default = False
If True, will pop (remove) the returned dataframe from the display container.
- Returns
pandas.DataFrame
- PyRapidML.classification.models(type: Optional[str] = None, internal: bool = False, raise_errors: bool = True) pandas.core.frame.DataFrame¶
Returns table of models available in the model library.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> all_models = models()
- type: str, default = None
linear : filters and only return linear models
tree : filters and only return tree based models
ensemble : filters and only return ensemble models
- internal: bool, default = False
When True, will return extra columns and rows used internally.
- raise_errors: bool, default = True
When False, will suppress all exceptions, ignoring models that couldn’t be created.
- Returns
pandas.DataFrame
- PyRapidML.classification.get_metrics(reset: bool = False, include_custom: bool = True, raise_errors: bool = True) pandas.core.frame.DataFrame¶
Returns table of available metrics used for CV.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> all_metrics = get_metrics()
- reset: bool, default = False
When True, will reset all changes made using the
add_metricandremove_metricfunction.- include_custom: bool, default = True
Whether to include user added (custom) metrics or not.
- raise_errors: bool, default = True
If False, will suppress all exceptions, ignoring models that couldn’t be created.
- Returns
pandas.DataFrame
- PyRapidML.classification.add_metric(id: str, name: str, score_func: type, target: str = 'pred', greater_is_better: bool = True, multiclass: bool = True, **kwargs) pandas.core.series.Series¶
Adds a custom metric to be used for CV.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> from sklearn.metrics import log_loss >>> add_metric('logloss', 'Log Loss', log_loss, greater_is_better = False)
- id: str
Unique id for the metric.
- name: str
Display name of the metric.
- score_func: type
Score function (or loss function) with signature
score_func(y, y_pred, **kwargs).- target: str, default = ‘pred’
The target of the score function.
‘pred’ for the prediction table
‘pred_proba’ for pred_proba
‘threshold’ for decision_function or predict_proba
- greater_is_better: bool, default = True
Whether
score_funcis higher the better or not.- multiclass: bool, default = True
Whether the metric supports multiclass target.
- kwargs:
Arguments to be passed to score function.
- Returns
pandas.Series
- PyRapidML.classification.remove_metric(name_or_id: str)¶
Removes a metric from CV.
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> remove_metric('MCC')
- name_or_id: str
Display name or ID of the metric.
- Returns
None
- PyRapidML.classification.get_logs(experiment_name: Optional[str] = None, save: bool = False) pandas.core.frame.DataFrame¶
Returns a table of experiment logs. Only works when
log_experimentis True when initializing thesetupfunction.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase', log_experiment = True) >>> best = comparing_models() >>> exp_logs = get_logs()
- experiment_name: str, default = None
When None current active run is used.
- save: bool, default = False
When set to True, csv file is saved in current working directory.
- Returns
pandas.DataFrame
- PyRapidML.classification.get_config(variable: str)¶
This function retrieves the global variables created when initializing the
setupfunction. Following variables are accessible:X: Transformed dataset (X)
y: Transformed dataset (y)
X_train: Transformed train dataset (X)
X_test: Transformed test/holdout dataset (X)
y_train: Transformed train dataset (y)
y_test: Transformed test/holdout dataset (y)
seed: random state set through session_id
prep_pipe: Transformation pipeline
fold_shuffle_param: shuffle parameter used in Kfolds
n_jobs_param: n_jobs parameter used in model training
html_param: html_param configured through setup
create_model_container: results grid storage container
master_model_container: model storage container
display_container: results display container
exp_name_log: Name of experiment
logging_param: log_experiment param
log_plots_param: log_plots param
USI: Unique session ID parameter
fix_imbalance_param: fix_imbalance param
fix_imbalance_method_param: fix_imbalance_method param
data_before_preprocess: data before preprocessing
target_param: name of target variable
gpu_param: use_gpu param configured through setup
fold_generator: CV splitter configured in fold_strategy
fold_param: fold params defined in the setup
fold_groups_param: fold groups defined in the setup
stratify_param: stratify parameter defined in the setup
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> X_train = get_config('X_train')
- Returns
Global variable
- PyRapidML.classification.set_config(variable: str, value)¶
This function resets the global variables. Following variables are accessible:
X: Transformed dataset (X)
y: Transformed dataset (y)
X_train: Transformed train dataset (X)
X_test: Transformed test/holdout dataset (X)
y_train: Transformed train dataset (y)
y_test: Transformed test/holdout dataset (y)
seed: random state set through session_id
prep_pipe: Transformation pipeline
fold_shuffle_param: shuffle parameter used in Kfolds
n_jobs_param: n_jobs parameter used in model training
html_param: html_param configured through setup
create_model_container: results grid storage container
master_model_container: model storage container
display_container: results display container
exp_name_log: Name of experiment
logging_param: log_experiment param
log_plots_param: log_plots param
USI: Unique session ID parameter
fix_imbalance_param: fix_imbalance param
fix_imbalance_method_param: fix_imbalance_method param
data_before_preprocess: data before preprocessing
target_param: name of target variable
gpu_param: use_gpu param configured through setup
fold_generator: CV splitter configured in fold_strategy
fold_param: fold params defined in the setup
fold_groups_param: fold groups defined in the setup
stratify_param: stratify parameter defined in the setup
Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> set_config('seed', 123)
- Returns
None
- PyRapidML.classification.save_config(file_name: str)¶
This function save all global variables to a pickle file, allowing to later resume without rerunning the
setup.Example
>>> from PyRapidML.datasets import get_data >>> juice = extract_data('juice') >>> from PyRapidML.classification import * >>> exp_name = initializer(data = juice, target = 'Purchase') >>> save_config('myvars.pkl')
- Returns
None
- PyRapidML.classification.load_config(file_name: str)¶
This function loads global variables from a pickle file into Python environment.
Example
>>> from PyRapidML.classification import load_config >>> load_config('myvars.pkl')
- Returns
Global variables