Modules

buildml.automate module

Automate the supervised learning process of training a model with easy to use functionalities that give make building models a simple task.

class buildml.automate.SupervisedLearning(dataset, show_warnings: bool = False)[source]

Bases: object

Automated Supervised Learning module designed for end-to-end data handling, preprocessing, model development, and evaluation in the context of supervised machine learning.

Parameters

datasetpd.DataFrame

The input dataset for supervised learning.

show_warningsbool, optional

If True, display warnings. Default is False.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
>>> from sklearn.snm import SVC
>>> from buildml import SupervisedLearning
>>>
>>>
>>> dataset = pd.read_csv("Your_file_path")  # Load your dataset(e.g Pandas DataFrame)
>>> data = SupervisedLearning(dataset)
>>>
>>> # Exploratory Data Analysis
>>> eda = data.eda()
>>> 
>>> # Build and Evaluate Classifier
>>> classifiers = ["LogisticRegression(random_state = 0)", 
>>>                "RandomForestClassifier(random_state = 0)", 
>>>                "DecisionTreeClassifier(random_state = 0)", 
>>>                "SVC()"]
>>> build_model = data.build_multiple_classifiers(classifiers)

Notes

This class encapsulates various functionalities for data handling and model development. It leverages popular libraries such as pandas, numpy, matplotlib, seaborn, ydata_profiling, sweetviz, imbalanced-learn, scikit-learn, warnings, feature-engine, and datatable.

The workflow involves steps like loading and handling data, cleaning and manipulation, formatting and transformation, exploratory data analysis, feature engineering, data preprocessing, model building and evaluation, data aggregation and summarization, and data type handling.

References

build_multiple_classifiers(classifiers: list, kfold: int = None, cross_validation: bool = False, graph: bool = False, length: int = 20, width: int = 10, linestyle: str = 'dashed', marker: str = 'o', markersize: int = 12, fontweight: int = 80, fontstretch: int = 50)[source]

Build and evaluate multiple classifiers.

Parameters

classifierslist or tuple

A list or tuple of classifier objects to be trained and evaluated.

kfoldint, optional, default=None

Number of folds for cross-validation. If None, cross-validation is not performed.

cross_validationbool, optional, default=False

Perform cross-validation if True.

graphbool, optional, default=False

Whether to display performance metrics as graphs.

lengthint, optional, default=None

Length of the graph (if graph=True).

widthint, optional, default=None

Width of the graph (if graph=True).

Returns

dict

A dictionary containing classifier metrics and additional information.

Notes

This method builds and evaluates multiple classifiers on the provided dataset. It supports both traditional training/testing evaluation and cross-validation.

If graph is True, the method also displays graphs showing performance metrics for training and testing datasets.

References

Example:

>>> classifiers = [LogisticRegression(random_state = 0), 
>>>                RandomForestClassifier(random_state = 0), 
>>>                SVC(random_state = 0)]
>>> results = build_multiple_classifiers(classifiers, 
>>>                                      kfold=5, 
>>>                                      cross_validation=True, 
>>>                                      graph=True, 
>>>                                      length=8, 
>>>                                      width=12)

Note: Ensure that the classifiers provided are compatible with scikit-learn’s classification API.

build_multiple_classifiers_from_features(strategy: str, estimator, classifiers: list, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build multiple classifiers using different feature selection strategies and machine learning algorithms.

This method performs feature selection, trains multiple classifiers, and evaluates their performance.

Parameters:

strategystr

Feature selection strategy. Should be one of ‘selectkbest’, ‘selectpercentile’, ‘rfe’, or ‘selectfrommodel’.

estimator

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a classifier that implements ‘fit’. Could be one of [“f_classif”, “chi2”, “mutual_info_classif”] if strategy is set to: [“selectkbest”, “selectpercentile”].

classifierslist or tuple

List of classifier instances to be trained and evaluated.

max_num_featuresint, optional

Maximum number of features to consider. If None, all features are considered.

min_num_featuresint, optional

Minimum number of features to consider. If None, the process starts with max_num_features and decreases the count until 1.

kfoldint, optional

Number of folds for cross-validation. If None, regular train-test split is used.

cvbool, optional

If True, perform cross-validation. If False, use a train-test split.

Returns:

dict: A dictionary containing feature metrics and additional information about the trained models.

See Also:

Example:

>>> from buildml import SupervisedLearning
>>> from sklearn.feature_selection import f_classif
>>> from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
>>>
>>>
>>> data = SupervisedLearning(dataset)
>>> classifiers = [RandomForestClassifier(random_state = 0), 
>>>                DecisionTreeClassifier(random_state = 0)]
>>> result = data.build_multiple_classifiers_from_features(strategy='selectkbest', 
>>>                                                        estimator=f_classif, 
>>>                                                        classifiers=classifiers, 
>>>                                                        max_num_features=10, 
>>>                                                        kfold=5)
build_multiple_regressors(regressors: list, kfold: int = None, cross_validation: bool = False, graph: bool = False, length: int = 20, width: int = 10, linestyle: str = 'dashed', marker: str = 'o', markersize: int = 12, fontweight: int = 80, fontstretch: int = 50)[source]

Build, evaluate, and optionally graph multiple regression models.

This method facilitates the construction and assessment of multiple regression models using a variety of algorithms. It supports both single train-test split and k-fold cross-validation approaches. The generated models are evaluated based on key regression metrics, providing insights into their performance on both training and test datasets.

Parameters

regressorslist or tuple

List of regression models to build and evaluate.

kfoldint, optional

Number of folds for cross-validation. Default is None.

cross_validationbool, default False

If True, perform cross-validation; otherwise, use a simple train-test split.

graphbool, default False

If True, plot evaluation metrics for each regression model.

lengthint, optional, default=None

Length of the graph (if graph=True).

widthint, optional, default=None

Width of the graph (if graph=True).

Returns

dict

A dictionary containing regression metrics and additional information.

Notes

This method uses the following libraries: - sklearn.model_selection for train-test splitting and cross-validation. - matplotlib.pyplot for plotting if graph=True.

Examples

>>> # Build and evaluate multiple regression models
>>> df = SupervisedLearning(dataset)
>>> models = [LinearRegression(), 
>>>           RandomForestRegressor(), 
>>>           GradientBoostingRegressor()]
>>> results = df.build_multiple_regressors(regressors=models, 
>>>                                        kfold=5, 
>>>                                        cross_validation=True, 
>>>                                        graph=True)

See Also

  • SupervisedLearning.train_model_regressor : Train a single regression model.

  • SupervisedLearning.regressor_predict : Make predictions using a trained regression model.

  • SupervisedLearning.regressor_evaluation : Evaluate the performance of a regression model.

build_multiple_regressors_from_features(strategy: str, estimator, regressors: list, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate multiple regression models with varying numbers of features.

Parameters:

strategystr

The feature selection strategy. Supported values are ‘selectkbest’, ‘selectpercentile’, ‘rfe’, and ‘selectfrommodel’.

estimator

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a regressor that implements ‘fit’. Could be one of [“f_regression”, “chi2”] if strategy is set to: [“selectkbest”, “selectpercentile”].

regressorslist or tuple

List of regression models to build and evaluate.

max_num_featuresint, optional

Maximum number of features to consider during the feature selection process.

min_num_featuresint, optional

Minimum number of features to consider during the feature selection process.

kfoldint, optional

Number of folds for cross-validation. If provided, cross-validation metrics will be calculated.

cvbool, default False

If True, perform cross-validation; otherwise, use a single train-test split.

Returns:

dict

A dictionary containing feature metrics and additional information for each model.

Example:

>>> from buildml import SupervisedLearning
>>> from sklearn.feature_selection import f_regression
>>> from sklearn.ensemble import RandomForestRegressor, DecisionTreeRegressor
>>> from sklearn.linear_model import LinearRegression
>>>
>>>
>>> data = SupervisedLearning(dataset)
>>> results = data.build_multiple_regressors_from_features(
>>>        strategy='selectkbest',
>>>        estimator=f_regression,
>>>        regressors=[LinearRegression(), 
>>>                    RandomForestRegressor(random_state = 0), 
>>>                    DecisionTreeRegressor(random_state = 0)],
>>>        max_num_features=10,
>>>        kfold=5,
>>>        cv=True
>>>        )

See Also:

  • sklearn.feature_selection.SelectKBest

  • sklearn.feature_selection.SelectPercentile

  • sklearn.feature_selection.RFE

  • sklearn.feature_selection.SelectFromModel

  • sklearn.linear_model

  • sklearn.ensemble.RandomForestRegressor

  • sklearn.model_selection.cross_val_score

  • sklearn.metrics.mean_squared_error

  • sklearn.metrics.r2_score

build_single_classifier_from_features(strategy: str, estimator, classifier, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate a single classification model using feature selection.

Parameters

strategystr

Feature selection strategy. Should be one of [“selectkbest”, “selectpercentile”, “rfe”, “selectfrommodel”].

estimator :

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a classifier that implements ‘fit’. Could be one of [“f_classif”, “chi2”, “mutual_info_classif”] if strategy is set to: [“selectkbest”, “selectpercentile”].

classifierobject

Classification model object to be trained.

max_num_featuresint, optional

Maximum number of features to consider, by default None.

min_num_featuresint, optional

Minimum number of features to consider, by default None.

kfoldint, optional

Number of folds for cross-validation, by default None. Needs cv to be set to True to work.

cvbool, optional

Whether to perform cross-validation, by default False.

Returns

dict

A dictionary containing feature metrics and additional information about the models.

Notes

  • This method builds a classification model using feature selection techniques and evaluates its performance.

  • The feature selection strategies include “selectkbest”, “selectpercentile”, “rfe”, and “selectfrommodel”.

  • The estimator parameter is required for “rfe” and “selectfrommodel” strategies.

  • This method assumes that the dataset and labels are already set in the class instance.

See Also

  • scikit-learn.feature_selection for feature selection techniques.

  • scikit-learn.linear_model for classification models.

  • scikit-learn.model_selection for cross-validation techniques.

  • scikit-learn.metrics for classification performance metrics.

  • Other libraries used in this method: numpy, pandas, matplotlib, seaborn.

References

Example

>>> from sklearn.feature_selection import f_classif
>>> from buildml import SupervisedLearning
>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> learn = SupervisedLearning(dataset)
>>> results = learn.build_single_classifier_from_features(strategy='selectkbest', 
>>>                                                       estimator=f_classif, 
>>>                                                       classifier=RandomForestClassifier(random_state = 0))
>>> print(results)
build_single_regressor_from_features(strategy: str, estimator, regressor, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate a single regression model using feature selection.

Parameters

strategystr

Feature selection strategy. Should be one of [“selectkbest”, “selectpercentile”, “rfe”, “selectfrommodel”].

estimator :

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a regressor that implements ‘fit’. Could be one of [“f_regression”, “f_oneway”, “chi2”] if strategy is set to: [“selectkbest”, “selectpercentile”].

regressorobject

Regression model object to be trained.

max_num_featuresint, optional

Maximum number of features to consider, by default None.

min_num_featuresint, optional

Minimum number of features to consider, by default None.

kfoldint, optional

Number of folds for cross-validation, by default None. Needs cv to be set to True to work.

cvbool, optional

Whether to perform cross-validation, by default False.

Returns

dict

A dictionary containing feature metrics and additional information about the models.

Notes

  • This method builds a regression model using feature selection techniques and evaluates its performance.

  • The feature selection strategies include “selectkbest”, “selectpercentile”, “rfe”, and “selectfrommodel”.

  • The estimator parameter is required for “rfe” and “selectfrommodel” strategies.

  • This method assumes that the dataset and labels are already set in the class instance.

See Also

  • scikit-learn.feature_selection for feature selection techniques.

  • scikit-learn.linear_model for regression models.

  • scikit-learn.model_selection for cross-validation techniques.

  • scikit-learn.metrics for regression performance metrics.

  • Other libraries used in this method: numpy, pandas, matplotlib, seaborn, ydata_profiling, sweetviz, imblearn, sklearn, warnings, datatable.

References

Example

>>> from sklearn.feature_selection import f_regression
>>> from buildml import SupervisedLearning
>>> from sklearn.linear_model import LinearRegression
>>>
>>> learn = SupervisedLearning(dataset)
>>> results = learn.build_single_regressor_from_features(strategy='selectkbest', 
>>>                                                      estimator=f_regression, 
>>>                                                      regressor=LinearRegression())
>>> print(results)
categorical_to_datetime(column)[source]

Convert specified categorical columns to datetime format.

Parameters

columnstr, list, or tuple

The column or columns to be converted to datetime format.

Returns

DataFrame

The DataFrame with specified columns converted to datetime.

Notes

This method allows for the conversion of categorical columns containing date or time information to the datetime format.

Examples

>>> # Create a supervised learning instance and convert a single column
>>> model = SupervisedLearning(dataset)
>>> model.categorical_to_datetime('date_column')
>>> # Convert multiple columns
>>> model.categorical_to_datetime(['start_date', 'end_date'])
>>> # Convert a combination of columns using a tuple
>>> model.categorical_to_datetime(('start_date', 'end_date'))

See Also

  • pandas.to_datetime : Convert argument to datetime.

References

categorical_to_numerical(columns: list = None)[source]

Convert categorical columns to numerical using one-hot encoding.

Parameters

columnslist, optional

A list of column names to apply one-hot encoding. If not provided, one-hot encoding is applied to all categorical columns.

Returns

pd.DataFrame

Transformed DataFrame with categorical columns converted to numerical using one-hot encoding.

Notes

This method uses the pandas library for one-hot encoding.

See Also

pandas.get_dummies : Convert categorical variable(s) into dummy/indicator variables.

Examples

>>> # Convert all categorical columns to numerical using one-hot encoding
>>> df = SupervisedLearning(dataset)
>>> df.categorical_to_numerical()
>>> # Convert specific columns to numerical using one-hot encoding
>>> df.categorical_to_numerical(columns=['Category1', 'Category2'])
classifier_evaluation(kfold: int = None, cross_validation: bool = False)[source]

Evaluate the performance of a classification model.

Parameters

kfoldint, optional

Number of folds for cross-validation. If not provided, default is None.

cross_validationbool, default False

Flag to indicate whether to perform cross-validation.

Returns

dict

A dictionary containing evaluation metrics for both training and test sets.

Raises

TypeError
  • If kfold is provided without enabling cross-validation.

AssertionError
  • If called for a regression problem.

Notes

  • This method evaluates the performance of a classification model using metrics such as confusion matrix, classification report, accuracy, precision, recall, and F1 score.

  • If kfold is not provided, it evaluates the model on the training and test sets.

  • If cross_validation is set to True, cross-validation scores are also included in the result.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Evaluate the model
>>> evaluation_results = model.classifier_evaluation(kfold=5, cross_validation=True)
>>> print(evaluation_results)

See Also

  • sklearn.metrics.confusion_matrix : Compute confusion matrix.

  • sklearn.metrics.classification_report : Build a text report showing the main classification metrics.

  • sklearn.metrics.accuracy_score : Accuracy classification score.

  • sklearn.metrics.precision_score : Compute the precision.

  • sklearn.metrics.recall_score : Compute the recall.

  • sklearn.metrics.f1_score : Compute the F1 score.

  • sklearn.model_selection.cross_val_score : Evaluate a score by cross-validation.

References

classifier_graph(classifier, cmap_train='viridis', cmap_test='viridis', size_train_marker: float = 10, size_test_marker: float = 10, resolution=100)[source]

Visualize the decision boundaries of a classification model.

Parameters

classifierscikit-learn classifier object

The trained classification model.

cmap_trainstr, default “viridis”

Colormap for the training set.

cmap_teststr, default “viridis”

Colormap for the test set.

size_train_markerfloat, default 10

Marker size for training set points.

size_test_markerfloat, default 10

Marker size for test set points.

resolutionint, default 100

Resolution of the decision boundary plot.

Raises

AssertionError

If called for a regression problem.

TypeError

If the number of features is not 2.

Notes

  • This method visualizes the decision boundaries of a classification model by plotting the regions where the model predicts different classes.

  • It supports both training and test sets, with different markers and colormaps for each.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Visualize the decision boundaries
>>> model.classifier_graph(classifier=model.model_classifier)

See Also

  • sklearn.preprocessing.LabelEncoder : Encode target labels.

  • sklearn.linear_model.LogisticRegression : Logistic Regression classifier.

  • sklearn.svm.SVC : Support Vector Classification.

  • sklearn.tree.DecisionTreeClassifier : Decision Tree classifier.

  • sklearn.ensemble.RandomForestClassifier : Random Forest classifier.

  • sklearn.neighbors.KNeighborsClassifier : K-Nearest Neighbors classifier.

  • matplotlib.pyplot.scatter : Plot scatter plots.

References

classifier_model_testing(variables_values: list, scaling: bool = False)[source]

Test a classification model with given input variables.

Parameters

variables_valueslist

A list containing values for input variables used to make predictions.

scalingbool, default False

Flag to indicate whether to scale input variables. If True, the method assumes that the model was trained on scaled data.

Returns

array

Predicted labels for the given input variables.

Raises

AssertionError

If called for a regression problem.

Notes

  • This method is used to test a classification model by providing values for the input variables and obtaining predicted labels.

  • If scaling is required, it is important to set the scaling parameter to True.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Provide input variables for testing
>>> input_data = [value1, value2, value3]
>>>
>>> # Test the model
>>> predicted_labels = model.classifier_model_testing(input_data, scaling=True)
>>> print(predicted_labels)

See Also

  • sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

  • sklearn.neighbors.KNeighborsClassifier : K-nearest neighbors classifier.

  • sklearn.ensemble.RandomForestClassifier : A meta-estimator that fits a number of decision tree classifiers on various sub-samples of the dataset.

References

classifier_predict()[source]

Predict the target variable using the trained classifier.

Returns

Dict[str, np.ndarray]

A dictionary containing the actual and predicted values for training and test sets. Keys include ‘Actual Training Y’, ‘Actual Test Y’, ‘Predicted Training Y’, and ‘Predicted Test Y’.

Raises

AssertionError

If the model is set for regression, not classification.

Notes

This method uses the sklearn library for classification model prediction.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> classifier = RandomForestClassifier(random_state = 0)
>>> trained_classifier = df.train_model_classifier(classifier)
>>>
>>> # Predict for regression model
>>> predictions = df.classifier_predict()
>>>
>>> print(predictions)
{'Actual Training Y': array([...]), 'Actual Test Y': array([...]),
 'Predicted Training Y': array([...]), 'Predicted Test Y': array([...])}

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

column_binning(column, number_of_bins: int = 10, labels: list = None)[source]

Apply binning to specified columns in the dataset.

Parameters

columnstr or list or tuple

The column(s) to apply binning to.

number_of_binsint, default 10

The number of bins to use.

labels: list or tuple, default = None

Name the categorical columns created by giving them labels.

Returns

DataFrame

The dataset with specified column(s) binned.

Notes

  • This method uses the pd.cut function to apply binning to the specified column(s).

  • Binning is a process of converting numerical data into categorical data.

Examples

>>> # Create a supervised learning instance and perform column binning
>>> model = SupervisedLearning(dataset)
>>> model.column_binning(column="Age", number_of_bins=5)
>>> model.column_binning(column=["Salary", "Experience"], number_of_bins=10)

See Also

  • pandas.cut : Bin values into discrete intervals.

  • pandas.DataFrame : Data structure for handling the dataset.

References

count_column_categories(column: str, reset_index: bool = False, inplace: bool = False, test_data: bool = False)[source]

Count the occurrences of categories in a categorical column.

Parameters

columnstr or list or tuple

Categorical column or columns to count categories.

reset_indexbool, default False

Whether to reset the index after counting.

inplacebool, default False

Replace the original dataset with this groupby operation.

test_databool, default False

Include the categories count for the test data.

Returns

DataFrame

Count of occurrences of each category in the specified column.

Raises

TypeError

If the column type is not recognized.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Count the occurrences of each category in the 'Category' column
>>> category_counts = model.count_column_categories(column='Category')
>>>
>>> print(category_counts)

See Also

  • pandas.Series.value_counts : Return a Series containing counts of unique values.

  • pandas.DataFrame.reset_index : Reset the index of a DataFrame.

References

drop_columns(columns: list)[source]

Drop specified columns from the dataset.

Parameters

columnsstr or list of str

A single column name (string) or a list of column names to be dropped.

Returns

pd.DataFrame

A new DataFrame with the specified columns dropped.

Notes

This method utilizes the pandas library for DataFrame manipulation.

See Also

  • pandas.DataFrame.drop : Drop specified labels from rows or columns.

Examples

>>> # Drop a single column
>>> df = SupervisedLearning(dataset)
>>> df.drop_columns('column_name')
>>> # Drop multiple columns
>>> df = SupervisedLearning(dataset)
>>> df.drop_columns(['column1', 'column2'])
eda()[source]

Perform Exploratory Data Analysis (EDA) on the dataset.

Returns

Dict

A dictionary containing various EDA results, including data head, data tail, descriptive statistics, mode, distinct count, null count, total null count, and correlation matrix.

Notes

This method utilizes functionalities from pandas for data analysis.

Examples

>>> # Perform Exploratory Data Analysis
>>> df = SupervisedLearning(dataset)
>>> eda_results = df.eda()

See Also

  • pandas.DataFrame.info : Get a concise summary of a DataFrame.

  • pandas.DataFrame.head : Return the first n rows.

  • pandas.DataFrame.tail : Return the last n rows.

  • pandas.DataFrame.describe : Generate descriptive statistics.

  • pandas.DataFrame.mode : Get the mode(s) of each element.

  • pandas.DataFrame.nunique : Count distinct observations.

  • pandas.DataFrame.isnull : Detect missing values.

  • pandas.DataFrame.corr : Compute pairwise correlation of columns.

eda_visual(histogram_bins: int = 10, figsize_heatmap: tuple = (15, 10), figsize_histogram: tuple = (15, 10), figsize_barchart: tuple = (15, 10), before_data_cleaning: bool = True)[source]

Generate visualizations for exploratory data analysis (EDA).

Parameters

histogram_bins: int

The number of bins for each instogram.

figsize_heatmap: tuple

The length and breadth for the frame of the heatmap.

figsize_histogram: tuple

The length and breadth for the frame of the histogram.

figsize_barchart: tuple

The length and breadth for the frame of the barchart.

before_data_cleaningbool, default True

If True, visualizes data before cleaning. If False, visualizes cleaned data.

Returns

None

The method generates and displays various visualizations based on the data distribution and correlation.

Notes

This method utilizes the following libraries for visualization: - matplotlib.pyplot for creating histograms and heatmaps. - seaborn for creating count plots and box plots.

Examples

>>> # Generate EDA visualizations before data cleaning
>>> df = SupervisedLearning(dataset)
>>> df.eda_visual(y='target_variable', before_data_cleaning=True)
>>> # Generate EDA visualizations after data cleaning
>>> df.eda_visual(y='target_variable', before_data_cleaning=False)
extract_date_features(datetime_column, hrs_mins_sec: bool = False)[source]

Extracts date-related features from a datetime column.

Parameters

datetime_columnstr, list, or tuple

The name of the datetime column or a list/tuple of datetime columns.

hrs_mins_secbool, default False

Flag indicating whether to include hour, minute, and second features.

Returns

DataFrame

A DataFrame with additional columns containing extracted date features.

Notes

  • This method extracts date-related features such as day, month, year, quarter, and day of the week from the specified datetime column(s).

  • If hrs_mins_sec is set to True, it also includes hour, minute, and second features.

Examples

>>> # Create a supervised learning instance and extract date features
>>> model = SupervisedLearning(dataset)
>>> date_columns = ['DateOfBirth', 'TransactionDate']
>>> model.extract_date_features(date_columns, hrs_mins_sec=True)
>>>
>>> # Access the DataFrame with additional date-related columns
>>> processed_data = model.get_dataset()

See Also

  • pandas.Series.dt : Accessor object for datetime properties.

  • sklearn.preprocessing.OneHotEncoder : Encode categorical integer features using a one-hot encoding.

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • matplotlib.pyplot : Plotting library for creating visualizations.

References

filter_data(column: str, operation: str = None, value: int = None)[source]

Filter data based on specified conditions.

Parameters

  • column: str or list or tuple

    The column or columns to filter.

  • operation: str or list or tuple, optional

    The operation or list of operations to perform. Supported operations: ‘greater than’, ‘less than’, ‘equal to’, ‘greater than or equal to’, ‘less than or equal to’, ‘not equal to’, ‘>’, ‘<’, ‘==’, ‘>=’, ‘<=’, ‘!=’. Default is None.

  • value: int or float or str or list or tuple, optional

    The value or list of values to compare against. Default is None.

Returns

pandas.DataFrame

The filtered DataFrame.

Raises

  • TypeError: If input parameters are invalid or inconsistent.

Example

>>> # Create a supervised learning instance and sort the dataset
>>> data = SupervisedLearning(dataset)
>>>
>>> # Filter data where 'column' is greater than 5
>>> filter_data = data.filter_data(column='column', 
>>>                                operation='>', 
>>>                                value=5)
>>>
>>> # Filter data where 'column1' is less than or equal to 10 and 'column2' is not equal to 'value'
>>> filter_data = data.filter_data(column=['column1', 'column2'], 
>>>                                operation=['<=', '!='], 
>>>                               value=[10, 'value'])

References

fix_missing_values(strategy: str = None)[source]

Fix missing values in the dataset.

Parameters

strategystr, optional

The strategy to use for imputation. If not specified, it defaults to “mean”. Options: “mean”, “median”, “mode”.

Returns

pd.DataFrame

The dataset with missing values imputed.

Notes

This method uses the sklearn.impute library for handling missing values.

See Also

sklearn.impute.SimpleImputer : Imputation transformer for completing missing values.

Examples

>>> # Fix missing values using the default strategy ("mean")
>>> df = SupervisedLearning(dataset)
>>> df.fix_missing_values()
>>> # Fix missing values using a specific strategy (e.g., "median")
>>> df.fix_missing_values(strategy="median")
fix_unbalanced_dataset(sampler: str, k_neighbors: int = None, random_state: int = None)[source]

Apply techniques to address class imbalance in the dataset.

Parameters

samplerstr

The resampling technique. Options: “SMOTE”, “RandomOverSampler”, “RandomUnderSampler”.

k_neighborsint, optional

The number of nearest neighbors to use in the SMOTE algorithm.

random_stateint, optional

Seed for reproducibility.

Returns

dict

A dictionary containing the resampled training data.

Raises

TypeError
  • If k_neighbors is specified for a sampler other than “SMOTE”.

Notes

  • This method addresses class imbalance in the dataset using various resampling techniques.

  • Supported samplers include SMOTE, RandomOverSampler, and RandomUnderSampler.

Examples

>>> # Create a supervised learning instance and fix unbalanced dataset
>>> model = SupervisedLearning(dataset)
>>> model.fix_unbalanced_dataset(sampler="SMOTE", k_neighbors=5)

See Also

  • imblearn.over_sampling.SMOTE : Synthetic Minority Over-sampling Technique.

  • imblearn.over_sampling.RandomOverSampler : Random over-sampling.

  • imblearn.under_sampling.RandomUnderSampler : Random under-sampling.

  • sklearn.impute.SimpleImputer : Simple imputation for handling missing values.

References

get_bestK_KNNclassifier(weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, figsize: tuple = (15, 10))[source]

Find the best value of k for K-Nearest Neighbors (KNN) classifier.

Parameters

weightstr, default ‘uniform’

Weight function used in prediction. Possible values: ‘uniform’ or ‘distance’.

algorithmstr, default ‘auto’

Algorithm used to compute the nearest neighbors. Possible values: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.

metricstr, default ‘minkowski’

Distance metric for the tree. Refer to the documentation of sklearn.neighbors.DistanceMetric for more options.

max_k_rangeint, default 31

Maximum range of k values to consider.

figsize: tuple

A tuple containing the frame length and breadth for the graph to be plotted.

Returns

Int

An integer indicating the best k value for the KNN Classifier.

Raises

TypeError

If invalid values are provided for ‘algorithm’ or ‘weight’.

Notes

This method evaluates the KNN classifier with different values of k and plots a graph to help identify the best k. The best k-value is determined based on the highest accuracy score.

Examples

>>> data = SupervisedLearning(dataset)
>>> data.get_bestK_KNNclassifier(weight='distance', 
>>>                              algorithm='kd_tree')

See Also

  • sklearn.neighbors.KNeighborsClassifier : K-nearest neighbors classifier.

References

get_bestK_KNNregressor(weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, figsize: tuple = (15, 10))[source]

Find the best value of k for K-Nearest Neighbors (KNN) regressor.

Parameters

weightstr, default ‘uniform’

Weight function used in prediction. Possible values: ‘uniform’ or ‘distance’.

algorithmstr, default ‘auto’

Algorithm used to compute the nearest neighbors. Possible values: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.

metricstr, default ‘minkowski’

Distance metric for the tree. Refer to the documentation of sklearn.neighbors.DistanceMetric for more options.

max_k_rangeint, default 31

Maximum range of k values to consider.

figsize: tuple

A tuple containing the frame length and breadth for the graph to be plotted.

Returns

Int

An integer indicating the best k value for the KNN Regressor.

Raises

TypeError

If invalid values are provided for ‘algorithm’ or ‘weight’.

Notes

This method evaluates the KNN regressor with different values of k and plots a graph to help identify the best k. The best k-value is determined based on the highest R-squared score.

Examples

>>> data = SupervisedLearning(dataset)
>>> data.get_bestK_KNNregressor(weight='distance', algorithm='kd_tree')

See Also

  • sklearn.neighbors.KNeighborsRegressor : K-nearest neighbors regressor.

References

get_dataset()[source]

Retrieve the original dataset and the processed data.

Returns

Tuple

A tuple containing the original dataset and the processed data.

Notes

This method provides access to both the original and processed datasets.

See Also

pandas.DataFrame : Data structure for handling tabular data.

Examples

>>> # Get the original and processed datasets
>>> df = SupervisedLearning(dataset)
>>> original_data, processed_data = df.get_dataset()
get_training_test_data()[source]

Get the training and test data splits.

Returns

Tuple

A tuple containing X_train, X_test, y_train, and y_test.

Notes

This method uses the sklearn.model_selection library for splitting the data into training and test sets.

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

Examples

>>> # Get training and test data splits
>>> df = SupervisedLearning(dataset)
>>> X_train, X_test, y_train, y_test = df.get_training_test_data()
group_data(columns: list, column_to_groupby: str, aggregate_function: str, reset_index: bool = False, inplace: bool = False)[source]

Group data by specified columns and apply an aggregate function.

Parameters

columnslist or tuple

Columns to be grouped and aggregated.

column_to_groupbystr or list or tuple

Column or columns to be used for grouping.

aggregate_functionstr

The aggregate function to apply (e.g., ‘mean’, ‘count’, ‘min’, ‘max’, ‘std’, ‘var’, ‘median’).

reset_indexbool, default False

Whether to reset the index after grouping.

inplacebool, default False

Replace the original dataset with this groupby operation.

Returns

DataFrame

Grouped and aggregated data.

Raises

TypeError

If the column types or aggregate function are not recognized.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Group data by 'Category' and calculate the mean for 'Value'
>>> grouped_data = model.group_data(columns=['Value'], 
>>>                                 column_to_groupby='Category', 
>>>                                 aggregate_function='mean')
>>>
>>> print(grouped_data)

See Also

  • pandas.DataFrame.groupby : Group DataFrame using a mapper or by a Series of columns.

  • pandas.DataFrame.agg : Aggregate using one or more operations over the specified axis.

References

load_large_dataset(dataset)[source]

Load a large dataset using the Datatable library.

Parameters

datasetstr

The path or URL of the dataset.

Returns

DataFrame

Pandas DataFrame containing the loaded data.

See Also

  • datatable.fread : Read a DataTable from a file.

References

numerical_to_categorical(column)[source]

Convert numerical columns to categorical in the dataset.

Parameters

columnstr, list, or tuple

The name of the column or a list/tuple of column names to be converted.

Returns

DataFrame

A new DataFrame with specified numerical columns converted to categorical.

Notes

  • This method converts numerical columns in the dataset to categorical type.

  • It is useful when dealing with features that represent categories or labels but are encoded as numerical values.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> data = SupervisedLearning(dataset)
>>>
>>> # Convert a single numerical column to categorical
>>> data.numerical_to_categorical("numeric_column")
>>>
>>> # Convert multiple numerical columns to categorical
>>> data.numerical_to_categorical(["numeric_col1", "numeric_col2"])

See Also

References

pandas_profiling(output_file: str = 'Pandas Profile Report.html', dark_mode: bool = False, title: str = 'Report')[source]

Generate a Pandas profiling report for the dataset.

Parameters

output_filestr, default “Pandas Profile Report.html”

The name of the HTML file to save the Pandas profiling report.

dark_modebool, default False

If True, use a dark mode theme for the generated report.

titlestr, default “Report”

The title of the Pandas profiling report.

Returns

None

See Also

  • pandas_profiling.ProfileReport : Generate a profile report from a DataFrame.

References

poly_get_optimal_degree(max_degree: int = 10, whole_dataset: bool = False, test_size: float = 0.2, random_state: int = 0, include_bias: bool = True, cross_validation: bool = False, kfold: int = 5)[source]

This method is designed to determine the optimal degree for polynomial regression. It evaluates the performance of polynomial regression models with degrees ranging from 1 to a specified maximum degree. The evaluation includes training and testing the models, as well as optional cross-validation metrics.

Parameters

max_degreeint, optional, default=10

The maximum degree of the polynomial to evaluate.

whole_datasetbool, optional, default=False

If True, the model is trained on the entire dataset without splitting into training and testing sets.

test_sizefloat, optional, default=0.2

The proportion of the dataset to include in the test split if not using the entire dataset.

random_stateint, optional, default=0

Seed for the random number generator.

include_biasbool, optional, default=True

Whether to include a bias column in the polynomial features.

cross_validationbool, optional, default=False

If True, includes cross-validation metrics in the output.

kfoldint, optional, default=5

Number of folds for cross-validation.

Returns

  • If cross_validation is False:

    DataFrame: A DataFrame containing metrics for each degree, including training R2, training RMSE, test R2, and test RMSE.

  • If cross_validation is True:
    Dictionary: A dictionary containing two keys:
    • “Degree Metrics”: A DataFrame with metrics for each degree, including training R2, training RMSE, test R2, test RMSE, cross-validation mean, and cross-validation standard deviation.

    • “Cross Validation Info”: An array containing cross-validation scores.

Example

>>> # Import Libraries
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from buildml import SupervisedLearning
>>> 
>>> # Get the Dataset
>>> dataset = pd.read_csv("Your dataset")
>>> 
>>> # Using BuildML
>>> automate = SupervisedLearning(data)
>>> 
>>> # EDA
>>> eda = automate.eda()
>>> 
>>> # Further Data Preparation and Segregation
>>> select_variables = automate.select_dependent_and_independent(predict = "Salary")
>>> best_degree = automate.poly_get_optimal_degree(max_degree=5, 
>>>                                                whole_dataset=False, 
>>>                                                test_size=0.2, 
>>>                                                random_state=42, 
>>>                                                include_bias=True, 
>>>                                                cross_validation=True)

Notes

  • Cross-validation scores are only available in the output when cross_validation is True.

  • This method uses polynomial regression models and linear regression as the base algorithm.

  • The output provides insights into model performance with various degrees, aiding in selecting the optimal degree for polynomial regression.

polyreg_graph(title: str, xlabel: str, ylabel: str, figsize: tuple = (15, 10), line_style: str = 'dashed', line_width: float = 2, line_marker: str = 'o', line_marker_size: float = 12, train_color_marker: str = 'red', test_color_marker: str = 'red', line_color: str = 'green', size_train_marker: float = 10, size_test_marker: float = 10, whole_dataset: bool = False)[source]

Generate a polynomial regression graph for visualization.

Parameters

titlestr

The title of the graph.

xlabelstr

A title for the xaxis.

ylabelstr

A title for the yaxis.

figsizestr, optional, default: (15, 10)

The size(length, breadth) of the figure frame where we plot our graph.

line_stylestr, optional, default: “dashed”

Style of the regression line (“solid”, “dashed”, “dashdot”, etc.).

line_widthfloat, optional, default: 2

Width of the regression line.

line_markerstr, optional, default: “o”

Marker style for data points on the regression line.

line_marker_sizefloat, optional, default: 12

Size of the marker on the regression line.

train_color_markerstr, optional, default: “red”

Color of markers for training data.

test_color_markerstr, optional, default: “red”

Color of markers for test data.

line_colorstr, optional, default: “green”

Color of the regression line.

size_train_markerfloat, optional, default: 10

Size of markers for training data.

size_test_markerfloat, optional, default: 10

Size of markers for test data.

whole_datasetbool, optional, default: False

If True, visualize the regression line on the entire dataset. If False, visualize on training and test datasets separately.

Returns

None

Displays the polynomial regression graph.

See Also

  • matplotlib.pyplot.scatter: Plot a scatter plot using Matplotlib.

  • matplotlib.pyplot.plot: Plot lines and/or markers using Matplotlib.

  • numpy: Fundamental package for scientific computing with Python.

  • scikit-learn: Simple and efficient tools for predictive data analysis.

Example

>>> import pandas as pd
>>> import numpy as np
>>> from buildml import SupervisedLearning
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Get the Dataset
>>> dataset = pd.read_csv("Your dataset/path")
>>> 
>>> # Assuming `automate` is an instance of the SupervisedLearning class
>>> automate = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>>
>>> # Further Data Preparation and Segregation
>>> select_variables = automate.select_dependent_and_independent(predict = "Salary")
>>> poly_x = automate.polyreg_x(degree = 5)
>>>
>>> # Model Building
>>> training = automate.train_model_regressor(regressor)
>>> prediction = automate.regressor_predict()
>>> evaluation = automate.regressor_evaluation()
>>> poly_reg = automate.polyreg_graph(title = "Analyzing salary across different levels",  
>>>                                   xlabel = "Levels", 
>>>                                   ylabel = "Salary", 
>>>                                   whole_dataset = True, 
>>>                                   line_marker = None, 
>>>                                   line_style = "solid")
polyreg_x(degree: int, include_bias: bool = False)[source]

Polynomial Regression Feature Expansion.

This method performs polynomial regression feature expansion on the independent variables (features). It uses scikit-learn’s PolynomialFeatures to generate polynomial and interaction features up to a specified degree.

Parameters

degreeint

The degree of the polynomial features.

include_biasbool, optional, default=False

If True, the polynomial features include a bias column (intercept).

Returns

pd.DataFrame

DataFrame with polynomial features.

Notes

This method utilizes scikit-learn’s PolynomialFeatures for feature expansion.

Examples

>>> from buildml import SupervisedLearning
>>> model = SupervisedLearning(dataset)
>>> model.polyreg_x(degree=2, include_bias=True)

References

reduce_data_memory_useage(verbose: bool = True)[source]

Reduce memory usage of the dataset by converting data types.

Parameters

verbosebool, default True

If True, print information about the memory reduction.

Returns

DataFrame

Pandas DataFrame with reduced memory usage.

See Also

  • pandas.DataFrame.memory_usage : Return the memory usage of each column.

References

regressor_evaluation(kfold: int = None, cross_validation: bool = False)[source]

Evaluate the performance of the regression model.

Parameters

kfoldint, optional

Number of folds for cross-validation. If provided, cross-validation will be performed.

cross_validationbool, default False

If True, perform cross-validation; otherwise, perform a simple train-test split evaluation.

Returns

dict

Dictionary containing evaluation metrics.

Raises

TypeError

If invalid combination of parameters is provided.

Notes

This method uses the sklearn.metrics and sklearn.model_selection libraries for regression evaluation.

Examples

>>> # Evaluate regression model performance using simple train-test split
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>> evaluation_results = df.regressor_evaluation()
>>> # Evaluate regression model performance using 10-fold cross-validation
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>> evaluation_results = df.regressor_evaluation(kfold=10, cross_validation=True)

See Also

  • sklearn.metrics.r2_score : R-squared (coefficient of determination) regression score function.

  • sklearn.metrics.mean_squared_error : Mean squared error regression loss.

  • sklearn.model_selection.cross_val_score : Evaluate a score by cross-validation.

regressor_model_testing(variables_values: list, scaling: bool = False)[source]

Test the trained regressor model with given input variables.

Parameters

variables_valueslist

A list containing values for each independent variable.

scalingbool, default False

Whether to scale the input variables. If True, the method expects scaled input using the same scaler used during training.

Returns

np.ndarray

The predicted values from the regressor model.

Raises

AssertionError

If the problem type is not regression.

Notes

  • This method tests a pre-trained regressor model.

  • If scaling is set to True, the input variables are expected to be scaled using the same scaler used during training.

Examples

>>> # Assuming df is an instance of SupervisedLearning class with a trained regressor model
>>> df.regressor_model_testing([1.5, 0.7, 2.0], scaling=True)
array([42.0])

See Also

  • sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

regressor_predict()[source]

Predict the target variable for regression models.

Returns

Dict[str, np.ndarray]

A dictionary containing actual training and test targets along with predicted values, or None if the model is set for classification.

Raises

AssertionError

If the training phase of the model is set to classification, as regression models cannot predict a classification model.

Notes

This method uses the sklearn library for regression model prediction.

Examples

>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>>
>>> print(predictions)
{'Actual Training Y': array([...]), 'Actual Test Y': array([...]),
 'Predicted Training Y': array([...]), 'Predicted Test Y': array([...])}

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

remove_duplicates(which_columns: str = None)[source]

Remove duplicate rows from the dataset based on specified columns.

Parameters

which_columnsstr or list or tuple, optional

Column(s) to consider when identifying duplicate rows. If specified, the method will drop rows that have the same values in the specified column(s).

Returns

DataFrame

A new DataFrame with duplicate rows removed.

Raises

TypeError

If which_columns is not a valid string, list, or tuple.

Notes

  • This method removes duplicate rows from the dataset based on the specified column(s).

  • If no columns are specified, it considers all columns when identifying duplicates.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>> # Remove duplicate rows based on a specific column
>>> model.remove_duplicates(which_columns='column_name')

See Also

  • pandas.DataFrame.drop_duplicates : Drop duplicate rows.

  • pandas.DataFrame : Pandas DataFrame class for handling tabular data.

  • ydata_profiling : Data profiling library for understanding and analyzing datasets.

  • sweetviz : Visualize and compare datasets for exploratory data analysis.

References

remove_outlier(drop_na: bool)[source]

Remove outliers from the dataset.

This method uses the sklearn.preprocessing library for standard scaling and outlier removal.

Parameters

drop_nabool

If False, outliers are replaced with NaN values. If True, rows with NaN values are dropped.

Returns

pd.DataFrame

The dataset with outliers removed.

Notes

The method applies standard scaling using sklearn.preprocessing.StandardScaler and removes outliers based on the range of -3 to 3 standard deviations.

See Also

sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

Examples

>>> # Remove outliers, replace with NaN
>>> df = SupervisedLearning(dataset)
>>> df.remove_outlier(drop_na=False)
>>> # Remove outliers and drop rows with NaN values
>>> df.remove_outlier(drop_na=True)
rename_columns(old_column: str, new_column: str)[source]

Rename columns in the dataset.

Parameters

old_columnstr or list

The old column name(s) to be renamed.

new_columnstr or list

The new column name(s).

Returns

pd.DataFrame

A dataframe containing the modified dataset with the column name(s) changed.

Examples

>>> # Create a supervised learning instance and rename columns
>>> data = SupervisedLearning(dataset)
>>> renamed_data = data.rename_columns("old_column_name", "new_column_name")
>>> print(renamed_data)

See Also

replace_values(replace: int, new_value: int)[source]

Replace specified values in the dataset.

Parameters

replaceint or float or str or list or tuple or dict

The value or set of values to be replaced.

new_valueint or float or str or list or tuple

The new value or set of values to replace the existing ones.

Returns

pd.DataFrame

A dataframe containing the modified dataset with replaced values.

Raises

TypeError
  • If replace is a string, integer, or float, and new_value is not a string, integer, or float.

  • If replace is a list or tuple, and new_value is not a string, integer, float, list, or tuple.

  • If replace is a dictionary, and new_value is not specified.

Notes

This method replaces specified values in the dataset. The replacement can be done for a single value, a list of values, or using a dictionary for multiple replacements.

Examples

>>> # Create a supervised learning instance and replace values
>>> data = SupervisedLearning(dataset)
>>>
>>> # Replace a single value
>>> replaced_values = data.replace_values(0, -1)
>>>
>>> # Replace multiple values using a dictionary
>>> replaced_values = data.replace_values({'Male': 1, 'Female': 0})

See Also

References

reset_index(drop_index_after_reset: bool = False)[source]

Reset the index of the dataset.

Parameters

drop_index_after_resetbool, optional

Whether to drop the old index after resetting. Default is False.

Returns

pd.DataFrame

A dataframe containing the modified dataset with the index reset.

Examples

>>> # Create a supervised learning instance and reset the index
>>> data = SupervisedLearning(dataset)
>>> reset_index_data = data.reset_index(drop_index_after_reset=True)
>>> print(reset_index_data)

See Also

scale_independent_variables()[source]

Standardize independent variables using sklearn.preprocessing.StandardScaler.

Returns

pd.DataFrame

A DataFrame with scaled independent variables.

Notes

This method uses the sklearn.preprocessing library for standardization.

Examples

>>> # Create an instance of SupervisedLearning
>>> df = SupervisedLearning(dataset)
>>> # Scale independent variables
>>> scaled_data = df.scale_independent_variables()

See Also

sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

select_datatype(datatype_to_select: str = None, datatype_to_exclude: str = None, inplace: bool = False)[source]

Select columns of specific data types from the dataset.

Parameters

datatype_to_selectstr, optional

Data type(s) to include. All data types are included by default.

datatype_to_excludestr, optional

Data type(s) to exclude. None are excluded by default.

inplacebool, default False

Replace the original dataset with this groupby operation.

Returns

DataFrame

Subset of the dataset containing columns of the specified data types.

See Also

  • pandas.DataFrame.select_dtypes : Select columns based on data type.

References

select_dependent_and_independent(predict: str)[source]

Select the dependent and independent variables for the supervised learning model.

Parameters

predictstr

The name of the column to be used as the dependent variable.

Returns

Dict

A dictionary containing the dependent variable and independent variables.

Notes

This method uses the pandas library for data manipulation.

Examples

>>> # Select dependent and independent variables
>>> df = SupervisedLearning(dataset)
>>> variables = df.select_dependent_and_independent("target_column")

See Also

  • pandas.DataFrame.drop : Drop specified labels from rows or columns.

  • pandas.Series : One-dimensional ndarray with axis labels.

select_features(strategy: str, estimator, number_of_features: int)[source]

Select features using different techniques.

Parameters

strategystr

The feature selection strategy. Options include “rfe”, “selectkbest”, “selectfrommodel”, and “selectpercentile”.

estimator

The estimator or score function used for feature selection.

number_of_featuresint

The number of features to select.

Returns

DataFrame or dict

DataFrame with selected features or a dictionary with selected features and selection metrics.

Raises

TypeError

If the strategy or estimator is not recognized.

Notes

  • This method allows feature selection using different techniques such as Recursive Feature Elimination (RFE), SelectKBest, SelectFromModel, and SelectPercentile.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Select features using Recursive Feature Elimination (RFE)
>>> selected_features = model.select_features(strategy='rfe', 
>>>                                           estimator=RandomForestRegressor(), 
>>>                                           number_of_features=5)
>>>
>>> print(selected_features)

See Also

  • sklearn.feature_selection.RFE : Recursive Feature Elimination.

  • sklearn.feature_selection.SelectKBest : Select features based on k highest scores.

  • sklearn.feature_selection.SelectFromModel : Feature selection using an external estimator.

  • sklearn.feature_selection.SelectPercentile : Select features based on a percentile of the highest scores.

References

set_index(column: str)[source]

Set the index of the dataset.

Parameters

columnstr or list

The column(s) to set as the index.

Returns

pd.DataFrame

The dataset with the index set to the specified column or columns.

Examples

>>> # Create a supervised learning instance and set the index
>>> data = SupervisedLearning(dataset)
>>> index_set_data = data.set_index("column_name")
>>> print(index_set_data)

See Also

simple_linregres_graph(regressor, title: str, xlabel: str, ylabel: str, figsize: tuple = (15, 10), line_style: str = 'dashed', line_width: float = 2, line_marker: str = 'o', line_marker_size: float = 12, train_color_marker: str = 'red', test_color_marker: str = 'red', line_color: str = 'green', size_train_marker: float = 10, size_test_marker: float = 10, whole_dataset: bool = False)[source]

Generate a simple linear regression graph with optional visualization of training and test datasets.

Parameters:

regressor (object or list):

Single or list of regression models (e.g., sklearn.linear_model.LinearRegression) to visualize.

title (str):

Title of the graph.

xlabelstr

A title for the xaxis.

ylabelstr

A title for the yaxis.

figsizestr, optional, default: (15, 10)

The size(length, breadth) of the figure frame where we plot our graph.

line_stylestr, optional, default: “dashed”

Style of the regression line (“solid”, “dashed”, “dashdot”, etc.).

line_width (float, optional):

Width of the regression line. Default is 2.

line_marker (str, optional):

Marker style for data points on the regression line. Default is “o”.

line_marker_size (float, optional):

Size of the marker for data points on the regression line. Default is 12.

train_color_marker (str, optional):

Color of markers for the training dataset. Default is “red”.

test_color_marker (str, optional):

Color of markers for the test dataset. Default is “red”.

line_color (str, optional):

Color of the regression line. Default is “green”.

size_train_marker (float, optional):

Size of markers for the training dataset. Default is 10.

size_test_marker (float, optional):

Size of markers for the test dataset. Default is 10.

whole_dataset (bool, optional):

If True, visualize the entire dataset with the regression line. If False, visualize training and test datasets separately. Default is False.

Returns:

None

Displays a simple linear regression graph.

Examples:

>>> # Example 1: Visualize a simple linear regression model
>>> model.simple_linregres_graph(regressor=LinearRegression(), 
                                 title="Simple Linear Regression",
                                 xlabel="Specify your title for xaxis",
                                 ylabel="Specify your title for yaxis")
>>> # Example 2: Visualize multiple linear regression models
>>> regressors = [LinearRegression(), Ridge(), Lasso()]
>>> model.simple_linregres_graph(regressor=regressors, 
                                 title="Analyzing Impact of Expenditure on Growth",
                                 xlabel="Expenditure",
                                 ylabel="Growth")

References:

sort_index(column: str, ascending: bool = True, reset_index: bool = False)[source]

Sort the dataset based on the index.

Parameters

columnstr or list

The index column(s) to sort the dataset.

ascendingbool, optional

Whether to sort in ascending order. Default is True.

reset_indexbool, optional

Whether to reset the index after sorting. Default is False.

Returns

pd.DataFrame

A dataframe containing the sorted dataset.

Examples

>>> # Create a supervised learning instance and sort the dataset based on index
>>> data = SupervisedLearning(dataset)
>>> sorted_index_data = data.sort_index("index_column", 
>>>                                     ascending=False, 
>>>                                     reset_index=True)
>>> print(sorted_index_data)

See Also

sort_values(column: str, ascending: bool = True, reset_index: bool = False)[source]

Sort the dataset based on specified columns.

Parameters

columnstr or list

The column(s) to sort the dataset.

ascendingbool, optional

Whether to sort in ascending order. Default is True.

reset_indexbool, optional

Whether to reset the index after sorting. Default is False.

Returns

pd.DataFrame

The dataset sorted according to the specified column or columns.

Examples

>>> # Create a supervised learning instance and sort the dataset
>>> data = SupervisedLearning(dataset)
>>> sorted_data = data.sort_values("column_name", 
>>>                                ascending=False, 
>>>                                reset_index=True)
>>> print(sorted_data)

See Also

split_data(test_size: float = 0.2)[source]

Split the dataset into training and test sets.

Parameters

test_size: float, optional, default=0.2

Specifies the size of the data to split as test data.

Returns

Dict

A dictionary containing the training and test sets for independent (X) and dependent (y) variables.

Notes

This method uses the sklearn.model_selection.train_test_split function for data splitting.

Examples

>>> # Split the data into training and test sets
>>> df = SupervisedLearning(dataset)
>>> data_splits = df.split_data()
>>>
>>> X_train = data_splits["Training X"]
>>> X_test = data_splits["Test X"]
>>> y_train = data_splits["Training Y"]
>>> y_test = data_splits["Test Y"]

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

sweetviz_profile_report(filename: str = 'Pandas Profile Report.html', auto_open: bool = False)[source]

Generate a Sweetviz profile report for the dataset.

Parameters

filenamestr, default “Pandas Profile Report.html”

The name of the HTML file to save the Sweetviz report.

auto_openbool, default False

If True, open the generated HTML report in a web browser.

Returns

None

See Also

  • sweetviz.analyze : Generate and analyze a Sweetviz data comparison report.

References

train_model_classifier(classifier)[source]

Train a classifier on the provided data.

Parameters

classifierAny

The classifier object to be trained.

Returns

Any

The trained classifier.

Notes

This method uses the sklearn.model_selection and sklearn.metrics libraries for training and evaluating the classifier.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> # Train a classifier
>>> df = SupervisedLearning(dataset)
>>> classifier = RandomForestClassifier(random_state = 0)
>>> trained_classifier = df.train_model_classifier(classifier)

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • sklearn.metrics.accuracy_score : Accuracy classification score.

train_model_regressor(regressor)[source]

Train a regressor model.

Parameters

regressorAny

A regressor model object compatible with scikit-learn’s regressor interface.

Returns

Any

The trained regressor model.

Notes

  • This method uses the sklearn.model_selection and sklearn.metrics libraries for training and evaluation.

  • All required steps before model training should have been completed before running this function.

Examples

>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • sklearn.metrics.r2_score : R^2 (coefficient of determination) regression score function.

unique_elements_in_columns(count: bool = False)[source]

Extracts unique elements in each column of the dataset.

This method generates a DataFrame containing unique elements in each column. If specified, it can also provide the count of unique elements in each column.

Parameters

countbool, optional, default=False

If True, returns the count of unique elements in each column.

Returns

pd.DataFrame or pd.Series

If count is False, a DataFrame with unique elements. If count is True, a Series with counts.

Notes

This method utilizes Pandas for extracting unique elements and their counts.

Examples

>>> from buildml import SupervisedLearning
>>> model = SupervisedLearning(dataset)
>>> unique_elements = model.unique_elements_in_columns(count=True)

References

buildml.date_features module

Handles simple date complexities in your data like extracting date features and turning categorical data to datetime.

buildml.date_features.categorical_to_datetime(data, column)[source]
buildml.date_features.extract_date_features(data, datetime_column, hrs_mins_sec: bool = False)[source]

buildml.eda module

Perform Exploratory Data Analysis with the BuildML libraray.

buildml.eda.eda(data)[source]
buildml.eda.eda_visual(data, y: str, histogram_bins: int = 10, figsize_heatmap: tuple = (15, 10), figsize_histogram: tuple = (15, 10), figsize_barchart: tuple = (15, 10), before_data_cleaning: bool = True)[source]
buildml.eda.pandas_profiling(dataset, output_file: str = 'Pandas Profile Report.html', dark_mode: bool = False, title: str = 'Report')[source]
buildml.eda.sweet_viz(dataset, filename: str, auto_open: bool = False)[source]

buildml.build_model module

Create multiple regressor and classifiers based on your approach to the problem. This module provides functions for model splitting, training, prediction, evaluation, etc.

buildml.build_model.FindK_KNN_Classifier(x_train, y_train, weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, warning: bool = False)[source]
buildml.build_model.FindK_KNN_Regressor(x_train, y_train, weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, warning: bool = False)[source]
buildml.build_model.build_classifier_model(classifier, x_train, y_train, x_test, y_test, kfold: int = None, cross_validation: bool = False, warning: bool = False)[source]
buildml.build_model.build_multiple_classifiers(classifiers: list, x_train, y_train, x_test, y_test, kfold: int = None, cross_validation: bool = False, warning: bool = False)[source]
buildml.build_model.build_multiple_classifiers_from_features(x, y, classifiers: list, test_size: float, random_state: int, strategy: str, estimator: str, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False, warning: bool = False)[source]
buildml.build_model.build_multiple_regressors(regressors: list, x_train, y_train, x_test, y_test, kfold: int = None, cross_validation: bool = False, warning: bool = False)[source]
buildml.build_model.build_multiple_regressors_from_features(x, y, regressors: list, test_size: float, random_state: int, strategy: str, estimator: str, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False, warning: bool = False)[source]
buildml.build_model.build_regressor_model(regressor, x_train, y_train, x_test, y_test, kfold: int = None, cross_validation: bool = False, warning: bool = False)[source]
buildml.build_model.build_single_classifier_from_features(x, y, classifier, test_size: float, random_state: int, strategy: str, estimator: str, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False, warning: bool = False)[source]
buildml.build_model.build_single_regressor_from_features(x, y, regressor, test_size: float, random_state: int, strategy: str, estimator: str, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False, warning: bool = False)[source]
buildml.build_model.classifier_graph(classifier, x_train, y_train, cmap_train='viridis', cmap_test='viridis', size_train_marker: float = 10, size_test_marker: float = 10, x_test=None, y_test=None, resolution=100, plot_title='Decision Boundary', warning: bool = False)[source]
buildml.build_model.classifier_model_testing(classifier_model, variables_values: list, scaling: bool = False, warning: bool = False)[source]
buildml.build_model.regressor_model_testing(regressor_model, variables_values: list, scaling: bool = False, warning: bool = False)[source]
buildml.build_model.select_features(x, y, strategy: str, estimator: str, number_of_features: int, warning: bool = False)[source]
buildml.build_model.simple_linregres_graph(x, y, regressor, title: str, line_style: str = 'dashed', line_width: float = 2, line_marker: str = 'o', line_marker_size: float = 12, train_color_marker: str = 'red', test_color_marker: str = 'red', line_color: str = 'green', size_train_marker: float = 10, size_test_marker: float = 10, whole_dataset: bool = False, test_size: float = 0.2)[source]
buildml.build_model.split_data(x, y, test_size, random_state, warning: bool = False)[source]

buildml.output_dataset module

Return your pandas dataframe as a CSV or Excel file.

buildml.output_dataset.output_dataset_as_csv(dataset, file_name: str, file_path: str = None)[source]
buildml.output_dataset.output_dataset_as_excel(dataset, file_name: str, file_path: str = None)[source]

buildml.preprocessing module

BuildML’s preprocessing module for data cleaning, transformation, handling data types and more.

buildml.preprocessing.categorical_to_numerical(dataset, columns: list = None, warning: bool = False)[source]
buildml.preprocessing.column_binning(data, column, number_of_bins: int = 10, warning: bool = False)[source]
buildml.preprocessing.count_column_categories(dataset, column: str, reset_index: bool = False)[source]
buildml.preprocessing.drop_columns(dataset, columns: list, warning: bool = False)[source]
buildml.preprocessing.filter_data(dataset, column: str, operation: str = None, value: int = None)[source]
buildml.preprocessing.fix_missing_values(dataset, strategy: str = None, warning: bool = False)[source]
buildml.preprocessing.fix_unbalanced_dataset(x_train, y_train, sampler: str, k_neighbors: int = None, warning: bool = False)[source]
buildml.preprocessing.group_data(dataset, columns: list, column_to_groupby: str, aggregate_function: str, reset_index: bool = False)[source]
buildml.preprocessing.load_large_dataset(dataset: str)[source]
buildml.preprocessing.numerical_to_categorical(dataset, column, warning: bool = False)[source]
buildml.preprocessing.remove_duplicates(dataset, which_columns: str = None)[source]
buildml.preprocessing.remove_outlier(dataset, warning: bool = False)[source]
buildml.preprocessing.rename_columns(dataset, old_column: str, new_column: str)[source]
buildml.preprocessing.replace_values(dataset, replace: int, new_value: int)[source]
buildml.preprocessing.reset_index(dataset, drop_index_after_reset: bool = False)[source]
buildml.preprocessing.scale_independent_variables(x)[source]
buildml.preprocessing.select_datatype(dataset, datatype_to_select: str = None, datatype_to_exclude: str = None, warning: bool = False)[source]
buildml.preprocessing.set_index(dataset, column: str)[source]
buildml.preprocessing.sort_index(dataset, column: str, ascending: bool = True, reset_index: bool = False)[source]
buildml.preprocessing.sort_values(dataset, column: str, ascending: bool = True, reset_index: bool = False)[source]

Module contents

Machine Learning Toolkit for Python

BuildML is a comprehensive Python toolkit designed to simplify and streamline the machine learning workflow. It provides a set of tools and utilities that cover various aspects of building machine learning models.

Key Features: - Data Exploration and Analysis: Perform exploratory data analysis to gain insights into your datasets. - Data Preprocessing and Cleaning: Easily handle data preprocessing and cleaning tasks to ensure high-quality input for your models. - Model Training and Prediction: Train machine learning models effortlessly and make predictions with ease. - Regression and Classification: Support for both regression and classification tasks to address diverse machine learning needs. - Supervised Learning: Built to support various supervised learning scenarios, making it versatile for different use cases. - Model Evaluation: Evaluate the performance of your models using comprehensive metrics.

BuildML is built on top of popular scientific Python packages such as numpy, scipy, and matplotlib, ensuring seamless integration with the broader Python ecosystem.

Visit our documentation at https://buildml.readthedocs.io/ for detailed information on how to use BuildML and unleash the power of machine learning in your projects.

class buildml.SupervisedLearning(dataset, show_warnings: bool = False)[source]

Bases: object

Automated Supervised Learning module designed for end-to-end data handling, preprocessing, model development, and evaluation in the context of supervised machine learning.

Parameters

datasetpd.DataFrame

The input dataset for supervised learning.

show_warningsbool, optional

If True, display warnings. Default is False.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
>>> from sklearn.snm import SVC
>>> from buildml import SupervisedLearning
>>>
>>>
>>> dataset = pd.read_csv("Your_file_path")  # Load your dataset(e.g Pandas DataFrame)
>>> data = SupervisedLearning(dataset)
>>>
>>> # Exploratory Data Analysis
>>> eda = data.eda()
>>> 
>>> # Build and Evaluate Classifier
>>> classifiers = ["LogisticRegression(random_state = 0)", 
>>>                "RandomForestClassifier(random_state = 0)", 
>>>                "DecisionTreeClassifier(random_state = 0)", 
>>>                "SVC()"]
>>> build_model = data.build_multiple_classifiers(classifiers)

Notes

This class encapsulates various functionalities for data handling and model development. It leverages popular libraries such as pandas, numpy, matplotlib, seaborn, ydata_profiling, sweetviz, imbalanced-learn, scikit-learn, warnings, feature-engine, and datatable.

The workflow involves steps like loading and handling data, cleaning and manipulation, formatting and transformation, exploratory data analysis, feature engineering, data preprocessing, model building and evaluation, data aggregation and summarization, and data type handling.

References

build_multiple_classifiers(classifiers: list, kfold: int = None, cross_validation: bool = False, graph: bool = False, length: int = 20, width: int = 10, linestyle: str = 'dashed', marker: str = 'o', markersize: int = 12, fontweight: int = 80, fontstretch: int = 50)[source]

Build and evaluate multiple classifiers.

Parameters

classifierslist or tuple

A list or tuple of classifier objects to be trained and evaluated.

kfoldint, optional, default=None

Number of folds for cross-validation. If None, cross-validation is not performed.

cross_validationbool, optional, default=False

Perform cross-validation if True.

graphbool, optional, default=False

Whether to display performance metrics as graphs.

lengthint, optional, default=None

Length of the graph (if graph=True).

widthint, optional, default=None

Width of the graph (if graph=True).

Returns

dict

A dictionary containing classifier metrics and additional information.

Notes

This method builds and evaluates multiple classifiers on the provided dataset. It supports both traditional training/testing evaluation and cross-validation.

If graph is True, the method also displays graphs showing performance metrics for training and testing datasets.

References

Example:

>>> classifiers = [LogisticRegression(random_state = 0), 
>>>                RandomForestClassifier(random_state = 0), 
>>>                SVC(random_state = 0)]
>>> results = build_multiple_classifiers(classifiers, 
>>>                                      kfold=5, 
>>>                                      cross_validation=True, 
>>>                                      graph=True, 
>>>                                      length=8, 
>>>                                      width=12)

Note: Ensure that the classifiers provided are compatible with scikit-learn’s classification API.

build_multiple_classifiers_from_features(strategy: str, estimator, classifiers: list, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build multiple classifiers using different feature selection strategies and machine learning algorithms.

This method performs feature selection, trains multiple classifiers, and evaluates their performance.

Parameters:

strategystr

Feature selection strategy. Should be one of ‘selectkbest’, ‘selectpercentile’, ‘rfe’, or ‘selectfrommodel’.

estimator

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a classifier that implements ‘fit’. Could be one of [“f_classif”, “chi2”, “mutual_info_classif”] if strategy is set to: [“selectkbest”, “selectpercentile”].

classifierslist or tuple

List of classifier instances to be trained and evaluated.

max_num_featuresint, optional

Maximum number of features to consider. If None, all features are considered.

min_num_featuresint, optional

Minimum number of features to consider. If None, the process starts with max_num_features and decreases the count until 1.

kfoldint, optional

Number of folds for cross-validation. If None, regular train-test split is used.

cvbool, optional

If True, perform cross-validation. If False, use a train-test split.

Returns:

dict: A dictionary containing feature metrics and additional information about the trained models.

See Also:

Example:

>>> from buildml import SupervisedLearning
>>> from sklearn.feature_selection import f_classif
>>> from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier
>>>
>>>
>>> data = SupervisedLearning(dataset)
>>> classifiers = [RandomForestClassifier(random_state = 0), 
>>>                DecisionTreeClassifier(random_state = 0)]
>>> result = data.build_multiple_classifiers_from_features(strategy='selectkbest', 
>>>                                                        estimator=f_classif, 
>>>                                                        classifiers=classifiers, 
>>>                                                        max_num_features=10, 
>>>                                                        kfold=5)
build_multiple_regressors(regressors: list, kfold: int = None, cross_validation: bool = False, graph: bool = False, length: int = 20, width: int = 10, linestyle: str = 'dashed', marker: str = 'o', markersize: int = 12, fontweight: int = 80, fontstretch: int = 50)[source]

Build, evaluate, and optionally graph multiple regression models.

This method facilitates the construction and assessment of multiple regression models using a variety of algorithms. It supports both single train-test split and k-fold cross-validation approaches. The generated models are evaluated based on key regression metrics, providing insights into their performance on both training and test datasets.

Parameters

regressorslist or tuple

List of regression models to build and evaluate.

kfoldint, optional

Number of folds for cross-validation. Default is None.

cross_validationbool, default False

If True, perform cross-validation; otherwise, use a simple train-test split.

graphbool, default False

If True, plot evaluation metrics for each regression model.

lengthint, optional, default=None

Length of the graph (if graph=True).

widthint, optional, default=None

Width of the graph (if graph=True).

Returns

dict

A dictionary containing regression metrics and additional information.

Notes

This method uses the following libraries: - sklearn.model_selection for train-test splitting and cross-validation. - matplotlib.pyplot for plotting if graph=True.

Examples

>>> # Build and evaluate multiple regression models
>>> df = SupervisedLearning(dataset)
>>> models = [LinearRegression(), 
>>>           RandomForestRegressor(), 
>>>           GradientBoostingRegressor()]
>>> results = df.build_multiple_regressors(regressors=models, 
>>>                                        kfold=5, 
>>>                                        cross_validation=True, 
>>>                                        graph=True)

See Also

  • SupervisedLearning.train_model_regressor : Train a single regression model.

  • SupervisedLearning.regressor_predict : Make predictions using a trained regression model.

  • SupervisedLearning.regressor_evaluation : Evaluate the performance of a regression model.

build_multiple_regressors_from_features(strategy: str, estimator, regressors: list, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate multiple regression models with varying numbers of features.

Parameters:

strategystr

The feature selection strategy. Supported values are ‘selectkbest’, ‘selectpercentile’, ‘rfe’, and ‘selectfrommodel’.

estimator

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a regressor that implements ‘fit’. Could be one of [“f_regression”, “chi2”] if strategy is set to: [“selectkbest”, “selectpercentile”].

regressorslist or tuple

List of regression models to build and evaluate.

max_num_featuresint, optional

Maximum number of features to consider during the feature selection process.

min_num_featuresint, optional

Minimum number of features to consider during the feature selection process.

kfoldint, optional

Number of folds for cross-validation. If provided, cross-validation metrics will be calculated.

cvbool, default False

If True, perform cross-validation; otherwise, use a single train-test split.

Returns:

dict

A dictionary containing feature metrics and additional information for each model.

Example:

>>> from buildml import SupervisedLearning
>>> from sklearn.feature_selection import f_regression
>>> from sklearn.ensemble import RandomForestRegressor, DecisionTreeRegressor
>>> from sklearn.linear_model import LinearRegression
>>>
>>>
>>> data = SupervisedLearning(dataset)
>>> results = data.build_multiple_regressors_from_features(
>>>        strategy='selectkbest',
>>>        estimator=f_regression,
>>>        regressors=[LinearRegression(), 
>>>                    RandomForestRegressor(random_state = 0), 
>>>                    DecisionTreeRegressor(random_state = 0)],
>>>        max_num_features=10,
>>>        kfold=5,
>>>        cv=True
>>>        )

See Also:

  • sklearn.feature_selection.SelectKBest

  • sklearn.feature_selection.SelectPercentile

  • sklearn.feature_selection.RFE

  • sklearn.feature_selection.SelectFromModel

  • sklearn.linear_model

  • sklearn.ensemble.RandomForestRegressor

  • sklearn.model_selection.cross_val_score

  • sklearn.metrics.mean_squared_error

  • sklearn.metrics.r2_score

build_single_classifier_from_features(strategy: str, estimator, classifier, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate a single classification model using feature selection.

Parameters

strategystr

Feature selection strategy. Should be one of [“selectkbest”, “selectpercentile”, “rfe”, “selectfrommodel”].

estimator :

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a classifier that implements ‘fit’. Could be one of [“f_classif”, “chi2”, “mutual_info_classif”] if strategy is set to: [“selectkbest”, “selectpercentile”].

classifierobject

Classification model object to be trained.

max_num_featuresint, optional

Maximum number of features to consider, by default None.

min_num_featuresint, optional

Minimum number of features to consider, by default None.

kfoldint, optional

Number of folds for cross-validation, by default None. Needs cv to be set to True to work.

cvbool, optional

Whether to perform cross-validation, by default False.

Returns

dict

A dictionary containing feature metrics and additional information about the models.

Notes

  • This method builds a classification model using feature selection techniques and evaluates its performance.

  • The feature selection strategies include “selectkbest”, “selectpercentile”, “rfe”, and “selectfrommodel”.

  • The estimator parameter is required for “rfe” and “selectfrommodel” strategies.

  • This method assumes that the dataset and labels are already set in the class instance.

See Also

  • scikit-learn.feature_selection for feature selection techniques.

  • scikit-learn.linear_model for classification models.

  • scikit-learn.model_selection for cross-validation techniques.

  • scikit-learn.metrics for classification performance metrics.

  • Other libraries used in this method: numpy, pandas, matplotlib, seaborn.

References

Example

>>> from sklearn.feature_selection import f_classif
>>> from buildml import SupervisedLearning
>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> learn = SupervisedLearning(dataset)
>>> results = learn.build_single_classifier_from_features(strategy='selectkbest', 
>>>                                                       estimator=f_classif, 
>>>                                                       classifier=RandomForestClassifier(random_state = 0))
>>> print(results)
build_single_regressor_from_features(strategy: str, estimator, regressor, max_num_features: int = None, min_num_features: int = None, kfold: int = None, cv: bool = False)[source]

Build and evaluate a single regression model using feature selection.

Parameters

strategystr

Feature selection strategy. Should be one of [“selectkbest”, “selectpercentile”, “rfe”, “selectfrommodel”].

estimator :

Estimator used for feature selection, applicable for “rfe” and “selectfrommodel” strategies. Is set to a regressor that implements ‘fit’. Could be one of [“f_regression”, “f_oneway”, “chi2”] if strategy is set to: [“selectkbest”, “selectpercentile”].

regressorobject

Regression model object to be trained.

max_num_featuresint, optional

Maximum number of features to consider, by default None.

min_num_featuresint, optional

Minimum number of features to consider, by default None.

kfoldint, optional

Number of folds for cross-validation, by default None. Needs cv to be set to True to work.

cvbool, optional

Whether to perform cross-validation, by default False.

Returns

dict

A dictionary containing feature metrics and additional information about the models.

Notes

  • This method builds a regression model using feature selection techniques and evaluates its performance.

  • The feature selection strategies include “selectkbest”, “selectpercentile”, “rfe”, and “selectfrommodel”.

  • The estimator parameter is required for “rfe” and “selectfrommodel” strategies.

  • This method assumes that the dataset and labels are already set in the class instance.

See Also

  • scikit-learn.feature_selection for feature selection techniques.

  • scikit-learn.linear_model for regression models.

  • scikit-learn.model_selection for cross-validation techniques.

  • scikit-learn.metrics for regression performance metrics.

  • Other libraries used in this method: numpy, pandas, matplotlib, seaborn, ydata_profiling, sweetviz, imblearn, sklearn, warnings, datatable.

References

Example

>>> from sklearn.feature_selection import f_regression
>>> from buildml import SupervisedLearning
>>> from sklearn.linear_model import LinearRegression
>>>
>>> learn = SupervisedLearning(dataset)
>>> results = learn.build_single_regressor_from_features(strategy='selectkbest', 
>>>                                                      estimator=f_regression, 
>>>                                                      regressor=LinearRegression())
>>> print(results)
categorical_to_datetime(column)[source]

Convert specified categorical columns to datetime format.

Parameters

columnstr, list, or tuple

The column or columns to be converted to datetime format.

Returns

DataFrame

The DataFrame with specified columns converted to datetime.

Notes

This method allows for the conversion of categorical columns containing date or time information to the datetime format.

Examples

>>> # Create a supervised learning instance and convert a single column
>>> model = SupervisedLearning(dataset)
>>> model.categorical_to_datetime('date_column')
>>> # Convert multiple columns
>>> model.categorical_to_datetime(['start_date', 'end_date'])
>>> # Convert a combination of columns using a tuple
>>> model.categorical_to_datetime(('start_date', 'end_date'))

See Also

  • pandas.to_datetime : Convert argument to datetime.

References

categorical_to_numerical(columns: list = None)[source]

Convert categorical columns to numerical using one-hot encoding.

Parameters

columnslist, optional

A list of column names to apply one-hot encoding. If not provided, one-hot encoding is applied to all categorical columns.

Returns

pd.DataFrame

Transformed DataFrame with categorical columns converted to numerical using one-hot encoding.

Notes

This method uses the pandas library for one-hot encoding.

See Also

pandas.get_dummies : Convert categorical variable(s) into dummy/indicator variables.

Examples

>>> # Convert all categorical columns to numerical using one-hot encoding
>>> df = SupervisedLearning(dataset)
>>> df.categorical_to_numerical()
>>> # Convert specific columns to numerical using one-hot encoding
>>> df.categorical_to_numerical(columns=['Category1', 'Category2'])
classifier_evaluation(kfold: int = None, cross_validation: bool = False)[source]

Evaluate the performance of a classification model.

Parameters

kfoldint, optional

Number of folds for cross-validation. If not provided, default is None.

cross_validationbool, default False

Flag to indicate whether to perform cross-validation.

Returns

dict

A dictionary containing evaluation metrics for both training and test sets.

Raises

TypeError
  • If kfold is provided without enabling cross-validation.

AssertionError
  • If called for a regression problem.

Notes

  • This method evaluates the performance of a classification model using metrics such as confusion matrix, classification report, accuracy, precision, recall, and F1 score.

  • If kfold is not provided, it evaluates the model on the training and test sets.

  • If cross_validation is set to True, cross-validation scores are also included in the result.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Evaluate the model
>>> evaluation_results = model.classifier_evaluation(kfold=5, cross_validation=True)
>>> print(evaluation_results)

See Also

  • sklearn.metrics.confusion_matrix : Compute confusion matrix.

  • sklearn.metrics.classification_report : Build a text report showing the main classification metrics.

  • sklearn.metrics.accuracy_score : Accuracy classification score.

  • sklearn.metrics.precision_score : Compute the precision.

  • sklearn.metrics.recall_score : Compute the recall.

  • sklearn.metrics.f1_score : Compute the F1 score.

  • sklearn.model_selection.cross_val_score : Evaluate a score by cross-validation.

References

classifier_graph(classifier, cmap_train='viridis', cmap_test='viridis', size_train_marker: float = 10, size_test_marker: float = 10, resolution=100)[source]

Visualize the decision boundaries of a classification model.

Parameters

classifierscikit-learn classifier object

The trained classification model.

cmap_trainstr, default “viridis”

Colormap for the training set.

cmap_teststr, default “viridis”

Colormap for the test set.

size_train_markerfloat, default 10

Marker size for training set points.

size_test_markerfloat, default 10

Marker size for test set points.

resolutionint, default 100

Resolution of the decision boundary plot.

Raises

AssertionError

If called for a regression problem.

TypeError

If the number of features is not 2.

Notes

  • This method visualizes the decision boundaries of a classification model by plotting the regions where the model predicts different classes.

  • It supports both training and test sets, with different markers and colormaps for each.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Visualize the decision boundaries
>>> model.classifier_graph(classifier=model.model_classifier)

See Also

  • sklearn.preprocessing.LabelEncoder : Encode target labels.

  • sklearn.linear_model.LogisticRegression : Logistic Regression classifier.

  • sklearn.svm.SVC : Support Vector Classification.

  • sklearn.tree.DecisionTreeClassifier : Decision Tree classifier.

  • sklearn.ensemble.RandomForestClassifier : Random Forest classifier.

  • sklearn.neighbors.KNeighborsClassifier : K-Nearest Neighbors classifier.

  • matplotlib.pyplot.scatter : Plot scatter plots.

References

classifier_model_testing(variables_values: list, scaling: bool = False)[source]

Test a classification model with given input variables.

Parameters

variables_valueslist

A list containing values for input variables used to make predictions.

scalingbool, default False

Flag to indicate whether to scale input variables. If True, the method assumes that the model was trained on scaled data.

Returns

array

Predicted labels for the given input variables.

Raises

AssertionError

If called for a regression problem.

Notes

  • This method is used to test a classification model by providing values for the input variables and obtaining predicted labels.

  • If scaling is required, it is important to set the scaling parameter to True.

Examples

>>> # Create a supervised learning instance and train a classification model
>>> model = SupervisedLearning(dataset)
>>> model.train_model_classifier()
>>>
>>> # Provide input variables for testing
>>> input_data = [value1, value2, value3]
>>>
>>> # Test the model
>>> predicted_labels = model.classifier_model_testing(input_data, scaling=True)
>>> print(predicted_labels)

See Also

  • sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

  • sklearn.neighbors.KNeighborsClassifier : K-nearest neighbors classifier.

  • sklearn.ensemble.RandomForestClassifier : A meta-estimator that fits a number of decision tree classifiers on various sub-samples of the dataset.

References

classifier_predict()[source]

Predict the target variable using the trained classifier.

Returns

Dict[str, np.ndarray]

A dictionary containing the actual and predicted values for training and test sets. Keys include ‘Actual Training Y’, ‘Actual Test Y’, ‘Predicted Training Y’, and ‘Predicted Test Y’.

Raises

AssertionError

If the model is set for regression, not classification.

Notes

This method uses the sklearn library for classification model prediction.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> classifier = RandomForestClassifier(random_state = 0)
>>> trained_classifier = df.train_model_classifier(classifier)
>>>
>>> # Predict for regression model
>>> predictions = df.classifier_predict()
>>>
>>> print(predictions)
{'Actual Training Y': array([...]), 'Actual Test Y': array([...]),
 'Predicted Training Y': array([...]), 'Predicted Test Y': array([...])}

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

column_binning(column, number_of_bins: int = 10, labels: list = None)[source]

Apply binning to specified columns in the dataset.

Parameters

columnstr or list or tuple

The column(s) to apply binning to.

number_of_binsint, default 10

The number of bins to use.

labels: list or tuple, default = None

Name the categorical columns created by giving them labels.

Returns

DataFrame

The dataset with specified column(s) binned.

Notes

  • This method uses the pd.cut function to apply binning to the specified column(s).

  • Binning is a process of converting numerical data into categorical data.

Examples

>>> # Create a supervised learning instance and perform column binning
>>> model = SupervisedLearning(dataset)
>>> model.column_binning(column="Age", number_of_bins=5)
>>> model.column_binning(column=["Salary", "Experience"], number_of_bins=10)

See Also

  • pandas.cut : Bin values into discrete intervals.

  • pandas.DataFrame : Data structure for handling the dataset.

References

count_column_categories(column: str, reset_index: bool = False, inplace: bool = False, test_data: bool = False)[source]

Count the occurrences of categories in a categorical column.

Parameters

columnstr or list or tuple

Categorical column or columns to count categories.

reset_indexbool, default False

Whether to reset the index after counting.

inplacebool, default False

Replace the original dataset with this groupby operation.

test_databool, default False

Include the categories count for the test data.

Returns

DataFrame

Count of occurrences of each category in the specified column.

Raises

TypeError

If the column type is not recognized.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Count the occurrences of each category in the 'Category' column
>>> category_counts = model.count_column_categories(column='Category')
>>>
>>> print(category_counts)

See Also

  • pandas.Series.value_counts : Return a Series containing counts of unique values.

  • pandas.DataFrame.reset_index : Reset the index of a DataFrame.

References

drop_columns(columns: list)[source]

Drop specified columns from the dataset.

Parameters

columnsstr or list of str

A single column name (string) or a list of column names to be dropped.

Returns

pd.DataFrame

A new DataFrame with the specified columns dropped.

Notes

This method utilizes the pandas library for DataFrame manipulation.

See Also

  • pandas.DataFrame.drop : Drop specified labels from rows or columns.

Examples

>>> # Drop a single column
>>> df = SupervisedLearning(dataset)
>>> df.drop_columns('column_name')
>>> # Drop multiple columns
>>> df = SupervisedLearning(dataset)
>>> df.drop_columns(['column1', 'column2'])
eda()[source]

Perform Exploratory Data Analysis (EDA) on the dataset.

Returns

Dict

A dictionary containing various EDA results, including data head, data tail, descriptive statistics, mode, distinct count, null count, total null count, and correlation matrix.

Notes

This method utilizes functionalities from pandas for data analysis.

Examples

>>> # Perform Exploratory Data Analysis
>>> df = SupervisedLearning(dataset)
>>> eda_results = df.eda()

See Also

  • pandas.DataFrame.info : Get a concise summary of a DataFrame.

  • pandas.DataFrame.head : Return the first n rows.

  • pandas.DataFrame.tail : Return the last n rows.

  • pandas.DataFrame.describe : Generate descriptive statistics.

  • pandas.DataFrame.mode : Get the mode(s) of each element.

  • pandas.DataFrame.nunique : Count distinct observations.

  • pandas.DataFrame.isnull : Detect missing values.

  • pandas.DataFrame.corr : Compute pairwise correlation of columns.

eda_visual(histogram_bins: int = 10, figsize_heatmap: tuple = (15, 10), figsize_histogram: tuple = (15, 10), figsize_barchart: tuple = (15, 10), before_data_cleaning: bool = True)[source]

Generate visualizations for exploratory data analysis (EDA).

Parameters

histogram_bins: int

The number of bins for each instogram.

figsize_heatmap: tuple

The length and breadth for the frame of the heatmap.

figsize_histogram: tuple

The length and breadth for the frame of the histogram.

figsize_barchart: tuple

The length and breadth for the frame of the barchart.

before_data_cleaningbool, default True

If True, visualizes data before cleaning. If False, visualizes cleaned data.

Returns

None

The method generates and displays various visualizations based on the data distribution and correlation.

Notes

This method utilizes the following libraries for visualization: - matplotlib.pyplot for creating histograms and heatmaps. - seaborn for creating count plots and box plots.

Examples

>>> # Generate EDA visualizations before data cleaning
>>> df = SupervisedLearning(dataset)
>>> df.eda_visual(y='target_variable', before_data_cleaning=True)
>>> # Generate EDA visualizations after data cleaning
>>> df.eda_visual(y='target_variable', before_data_cleaning=False)
extract_date_features(datetime_column, hrs_mins_sec: bool = False)[source]

Extracts date-related features from a datetime column.

Parameters

datetime_columnstr, list, or tuple

The name of the datetime column or a list/tuple of datetime columns.

hrs_mins_secbool, default False

Flag indicating whether to include hour, minute, and second features.

Returns

DataFrame

A DataFrame with additional columns containing extracted date features.

Notes

  • This method extracts date-related features such as day, month, year, quarter, and day of the week from the specified datetime column(s).

  • If hrs_mins_sec is set to True, it also includes hour, minute, and second features.

Examples

>>> # Create a supervised learning instance and extract date features
>>> model = SupervisedLearning(dataset)
>>> date_columns = ['DateOfBirth', 'TransactionDate']
>>> model.extract_date_features(date_columns, hrs_mins_sec=True)
>>>
>>> # Access the DataFrame with additional date-related columns
>>> processed_data = model.get_dataset()

See Also

  • pandas.Series.dt : Accessor object for datetime properties.

  • sklearn.preprocessing.OneHotEncoder : Encode categorical integer features using a one-hot encoding.

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • matplotlib.pyplot : Plotting library for creating visualizations.

References

filter_data(column: str, operation: str = None, value: int = None)[source]

Filter data based on specified conditions.

Parameters

  • column: str or list or tuple

    The column or columns to filter.

  • operation: str or list or tuple, optional

    The operation or list of operations to perform. Supported operations: ‘greater than’, ‘less than’, ‘equal to’, ‘greater than or equal to’, ‘less than or equal to’, ‘not equal to’, ‘>’, ‘<’, ‘==’, ‘>=’, ‘<=’, ‘!=’. Default is None.

  • value: int or float or str or list or tuple, optional

    The value or list of values to compare against. Default is None.

Returns

pandas.DataFrame

The filtered DataFrame.

Raises

  • TypeError: If input parameters are invalid or inconsistent.

Example

>>> # Create a supervised learning instance and sort the dataset
>>> data = SupervisedLearning(dataset)
>>>
>>> # Filter data where 'column' is greater than 5
>>> filter_data = data.filter_data(column='column', 
>>>                                operation='>', 
>>>                                value=5)
>>>
>>> # Filter data where 'column1' is less than or equal to 10 and 'column2' is not equal to 'value'
>>> filter_data = data.filter_data(column=['column1', 'column2'], 
>>>                                operation=['<=', '!='], 
>>>                               value=[10, 'value'])

References

fix_missing_values(strategy: str = None)[source]

Fix missing values in the dataset.

Parameters

strategystr, optional

The strategy to use for imputation. If not specified, it defaults to “mean”. Options: “mean”, “median”, “mode”.

Returns

pd.DataFrame

The dataset with missing values imputed.

Notes

This method uses the sklearn.impute library for handling missing values.

See Also

sklearn.impute.SimpleImputer : Imputation transformer for completing missing values.

Examples

>>> # Fix missing values using the default strategy ("mean")
>>> df = SupervisedLearning(dataset)
>>> df.fix_missing_values()
>>> # Fix missing values using a specific strategy (e.g., "median")
>>> df.fix_missing_values(strategy="median")
fix_unbalanced_dataset(sampler: str, k_neighbors: int = None, random_state: int = None)[source]

Apply techniques to address class imbalance in the dataset.

Parameters

samplerstr

The resampling technique. Options: “SMOTE”, “RandomOverSampler”, “RandomUnderSampler”.

k_neighborsint, optional

The number of nearest neighbors to use in the SMOTE algorithm.

random_stateint, optional

Seed for reproducibility.

Returns

dict

A dictionary containing the resampled training data.

Raises

TypeError
  • If k_neighbors is specified for a sampler other than “SMOTE”.

Notes

  • This method addresses class imbalance in the dataset using various resampling techniques.

  • Supported samplers include SMOTE, RandomOverSampler, and RandomUnderSampler.

Examples

>>> # Create a supervised learning instance and fix unbalanced dataset
>>> model = SupervisedLearning(dataset)
>>> model.fix_unbalanced_dataset(sampler="SMOTE", k_neighbors=5)

See Also

  • imblearn.over_sampling.SMOTE : Synthetic Minority Over-sampling Technique.

  • imblearn.over_sampling.RandomOverSampler : Random over-sampling.

  • imblearn.under_sampling.RandomUnderSampler : Random under-sampling.

  • sklearn.impute.SimpleImputer : Simple imputation for handling missing values.

References

get_bestK_KNNclassifier(weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, figsize: tuple = (15, 10))[source]

Find the best value of k for K-Nearest Neighbors (KNN) classifier.

Parameters

weightstr, default ‘uniform’

Weight function used in prediction. Possible values: ‘uniform’ or ‘distance’.

algorithmstr, default ‘auto’

Algorithm used to compute the nearest neighbors. Possible values: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.

metricstr, default ‘minkowski’

Distance metric for the tree. Refer to the documentation of sklearn.neighbors.DistanceMetric for more options.

max_k_rangeint, default 31

Maximum range of k values to consider.

figsize: tuple

A tuple containing the frame length and breadth for the graph to be plotted.

Returns

Int

An integer indicating the best k value for the KNN Classifier.

Raises

TypeError

If invalid values are provided for ‘algorithm’ or ‘weight’.

Notes

This method evaluates the KNN classifier with different values of k and plots a graph to help identify the best k. The best k-value is determined based on the highest accuracy score.

Examples

>>> data = SupervisedLearning(dataset)
>>> data.get_bestK_KNNclassifier(weight='distance', 
>>>                              algorithm='kd_tree')

See Also

  • sklearn.neighbors.KNeighborsClassifier : K-nearest neighbors classifier.

References

get_bestK_KNNregressor(weight='uniform', algorithm='auto', metric='minkowski', max_k_range: int = 31, figsize: tuple = (15, 10))[source]

Find the best value of k for K-Nearest Neighbors (KNN) regressor.

Parameters

weightstr, default ‘uniform’

Weight function used in prediction. Possible values: ‘uniform’ or ‘distance’.

algorithmstr, default ‘auto’

Algorithm used to compute the nearest neighbors. Possible values: ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’.

metricstr, default ‘minkowski’

Distance metric for the tree. Refer to the documentation of sklearn.neighbors.DistanceMetric for more options.

max_k_rangeint, default 31

Maximum range of k values to consider.

figsize: tuple

A tuple containing the frame length and breadth for the graph to be plotted.

Returns

Int

An integer indicating the best k value for the KNN Regressor.

Raises

TypeError

If invalid values are provided for ‘algorithm’ or ‘weight’.

Notes

This method evaluates the KNN regressor with different values of k and plots a graph to help identify the best k. The best k-value is determined based on the highest R-squared score.

Examples

>>> data = SupervisedLearning(dataset)
>>> data.get_bestK_KNNregressor(weight='distance', algorithm='kd_tree')

See Also

  • sklearn.neighbors.KNeighborsRegressor : K-nearest neighbors regressor.

References

get_dataset()[source]

Retrieve the original dataset and the processed data.

Returns

Tuple

A tuple containing the original dataset and the processed data.

Notes

This method provides access to both the original and processed datasets.

See Also

pandas.DataFrame : Data structure for handling tabular data.

Examples

>>> # Get the original and processed datasets
>>> df = SupervisedLearning(dataset)
>>> original_data, processed_data = df.get_dataset()
get_training_test_data()[source]

Get the training and test data splits.

Returns

Tuple

A tuple containing X_train, X_test, y_train, and y_test.

Notes

This method uses the sklearn.model_selection library for splitting the data into training and test sets.

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

Examples

>>> # Get training and test data splits
>>> df = SupervisedLearning(dataset)
>>> X_train, X_test, y_train, y_test = df.get_training_test_data()
group_data(columns: list, column_to_groupby: str, aggregate_function: str, reset_index: bool = False, inplace: bool = False)[source]

Group data by specified columns and apply an aggregate function.

Parameters

columnslist or tuple

Columns to be grouped and aggregated.

column_to_groupbystr or list or tuple

Column or columns to be used for grouping.

aggregate_functionstr

The aggregate function to apply (e.g., ‘mean’, ‘count’, ‘min’, ‘max’, ‘std’, ‘var’, ‘median’).

reset_indexbool, default False

Whether to reset the index after grouping.

inplacebool, default False

Replace the original dataset with this groupby operation.

Returns

DataFrame

Grouped and aggregated data.

Raises

TypeError

If the column types or aggregate function are not recognized.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Group data by 'Category' and calculate the mean for 'Value'
>>> grouped_data = model.group_data(columns=['Value'], 
>>>                                 column_to_groupby='Category', 
>>>                                 aggregate_function='mean')
>>>
>>> print(grouped_data)

See Also

  • pandas.DataFrame.groupby : Group DataFrame using a mapper or by a Series of columns.

  • pandas.DataFrame.agg : Aggregate using one or more operations over the specified axis.

References

load_large_dataset(dataset)[source]

Load a large dataset using the Datatable library.

Parameters

datasetstr

The path or URL of the dataset.

Returns

DataFrame

Pandas DataFrame containing the loaded data.

See Also

  • datatable.fread : Read a DataTable from a file.

References

numerical_to_categorical(column)[source]

Convert numerical columns to categorical in the dataset.

Parameters

columnstr, list, or tuple

The name of the column or a list/tuple of column names to be converted.

Returns

DataFrame

A new DataFrame with specified numerical columns converted to categorical.

Notes

  • This method converts numerical columns in the dataset to categorical type.

  • It is useful when dealing with features that represent categories or labels but are encoded as numerical values.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> data = SupervisedLearning(dataset)
>>>
>>> # Convert a single numerical column to categorical
>>> data.numerical_to_categorical("numeric_column")
>>>
>>> # Convert multiple numerical columns to categorical
>>> data.numerical_to_categorical(["numeric_col1", "numeric_col2"])

See Also

References

pandas_profiling(output_file: str = 'Pandas Profile Report.html', dark_mode: bool = False, title: str = 'Report')[source]

Generate a Pandas profiling report for the dataset.

Parameters

output_filestr, default “Pandas Profile Report.html”

The name of the HTML file to save the Pandas profiling report.

dark_modebool, default False

If True, use a dark mode theme for the generated report.

titlestr, default “Report”

The title of the Pandas profiling report.

Returns

None

See Also

  • pandas_profiling.ProfileReport : Generate a profile report from a DataFrame.

References

poly_get_optimal_degree(max_degree: int = 10, whole_dataset: bool = False, test_size: float = 0.2, random_state: int = 0, include_bias: bool = True, cross_validation: bool = False, kfold: int = 5)[source]

This method is designed to determine the optimal degree for polynomial regression. It evaluates the performance of polynomial regression models with degrees ranging from 1 to a specified maximum degree. The evaluation includes training and testing the models, as well as optional cross-validation metrics.

Parameters

max_degreeint, optional, default=10

The maximum degree of the polynomial to evaluate.

whole_datasetbool, optional, default=False

If True, the model is trained on the entire dataset without splitting into training and testing sets.

test_sizefloat, optional, default=0.2

The proportion of the dataset to include in the test split if not using the entire dataset.

random_stateint, optional, default=0

Seed for the random number generator.

include_biasbool, optional, default=True

Whether to include a bias column in the polynomial features.

cross_validationbool, optional, default=False

If True, includes cross-validation metrics in the output.

kfoldint, optional, default=5

Number of folds for cross-validation.

Returns

  • If cross_validation is False:

    DataFrame: A DataFrame containing metrics for each degree, including training R2, training RMSE, test R2, and test RMSE.

  • If cross_validation is True:
    Dictionary: A dictionary containing two keys:
    • “Degree Metrics”: A DataFrame with metrics for each degree, including training R2, training RMSE, test R2, test RMSE, cross-validation mean, and cross-validation standard deviation.

    • “Cross Validation Info”: An array containing cross-validation scores.

Example

>>> # Import Libraries
>>> import pandas as pd
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from buildml import SupervisedLearning
>>> 
>>> # Get the Dataset
>>> dataset = pd.read_csv("Your dataset")
>>> 
>>> # Using BuildML
>>> automate = SupervisedLearning(data)
>>> 
>>> # EDA
>>> eda = automate.eda()
>>> 
>>> # Further Data Preparation and Segregation
>>> select_variables = automate.select_dependent_and_independent(predict = "Salary")
>>> best_degree = automate.poly_get_optimal_degree(max_degree=5, 
>>>                                                whole_dataset=False, 
>>>                                                test_size=0.2, 
>>>                                                random_state=42, 
>>>                                                include_bias=True, 
>>>                                                cross_validation=True)

Notes

  • Cross-validation scores are only available in the output when cross_validation is True.

  • This method uses polynomial regression models and linear regression as the base algorithm.

  • The output provides insights into model performance with various degrees, aiding in selecting the optimal degree for polynomial regression.

polyreg_graph(title: str, xlabel: str, ylabel: str, figsize: tuple = (15, 10), line_style: str = 'dashed', line_width: float = 2, line_marker: str = 'o', line_marker_size: float = 12, train_color_marker: str = 'red', test_color_marker: str = 'red', line_color: str = 'green', size_train_marker: float = 10, size_test_marker: float = 10, whole_dataset: bool = False)[source]

Generate a polynomial regression graph for visualization.

Parameters

titlestr

The title of the graph.

xlabelstr

A title for the xaxis.

ylabelstr

A title for the yaxis.

figsizestr, optional, default: (15, 10)

The size(length, breadth) of the figure frame where we plot our graph.

line_stylestr, optional, default: “dashed”

Style of the regression line (“solid”, “dashed”, “dashdot”, etc.).

line_widthfloat, optional, default: 2

Width of the regression line.

line_markerstr, optional, default: “o”

Marker style for data points on the regression line.

line_marker_sizefloat, optional, default: 12

Size of the marker on the regression line.

train_color_markerstr, optional, default: “red”

Color of markers for training data.

test_color_markerstr, optional, default: “red”

Color of markers for test data.

line_colorstr, optional, default: “green”

Color of the regression line.

size_train_markerfloat, optional, default: 10

Size of markers for training data.

size_test_markerfloat, optional, default: 10

Size of markers for test data.

whole_datasetbool, optional, default: False

If True, visualize the regression line on the entire dataset. If False, visualize on training and test datasets separately.

Returns

None

Displays the polynomial regression graph.

See Also

  • matplotlib.pyplot.scatter: Plot a scatter plot using Matplotlib.

  • matplotlib.pyplot.plot: Plot lines and/or markers using Matplotlib.

  • numpy: Fundamental package for scientific computing with Python.

  • scikit-learn: Simple and efficient tools for predictive data analysis.

Example

>>> import pandas as pd
>>> import numpy as np
>>> from buildml import SupervisedLearning
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Get the Dataset
>>> dataset = pd.read_csv("Your dataset/path")
>>> 
>>> # Assuming `automate` is an instance of the SupervisedLearning class
>>> automate = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>>
>>> # Further Data Preparation and Segregation
>>> select_variables = automate.select_dependent_and_independent(predict = "Salary")
>>> poly_x = automate.polyreg_x(degree = 5)
>>>
>>> # Model Building
>>> training = automate.train_model_regressor(regressor)
>>> prediction = automate.regressor_predict()
>>> evaluation = automate.regressor_evaluation()
>>> poly_reg = automate.polyreg_graph(title = "Analyzing salary across different levels",  
>>>                                   xlabel = "Levels", 
>>>                                   ylabel = "Salary", 
>>>                                   whole_dataset = True, 
>>>                                   line_marker = None, 
>>>                                   line_style = "solid")
polyreg_x(degree: int, include_bias: bool = False)[source]

Polynomial Regression Feature Expansion.

This method performs polynomial regression feature expansion on the independent variables (features). It uses scikit-learn’s PolynomialFeatures to generate polynomial and interaction features up to a specified degree.

Parameters

degreeint

The degree of the polynomial features.

include_biasbool, optional, default=False

If True, the polynomial features include a bias column (intercept).

Returns

pd.DataFrame

DataFrame with polynomial features.

Notes

This method utilizes scikit-learn’s PolynomialFeatures for feature expansion.

Examples

>>> from buildml import SupervisedLearning
>>> model = SupervisedLearning(dataset)
>>> model.polyreg_x(degree=2, include_bias=True)

References

reduce_data_memory_useage(verbose: bool = True)[source]

Reduce memory usage of the dataset by converting data types.

Parameters

verbosebool, default True

If True, print information about the memory reduction.

Returns

DataFrame

Pandas DataFrame with reduced memory usage.

See Also

  • pandas.DataFrame.memory_usage : Return the memory usage of each column.

References

regressor_evaluation(kfold: int = None, cross_validation: bool = False)[source]

Evaluate the performance of the regression model.

Parameters

kfoldint, optional

Number of folds for cross-validation. If provided, cross-validation will be performed.

cross_validationbool, default False

If True, perform cross-validation; otherwise, perform a simple train-test split evaluation.

Returns

dict

Dictionary containing evaluation metrics.

Raises

TypeError

If invalid combination of parameters is provided.

Notes

This method uses the sklearn.metrics and sklearn.model_selection libraries for regression evaluation.

Examples

>>> # Evaluate regression model performance using simple train-test split
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>> evaluation_results = df.regressor_evaluation()
>>> # Evaluate regression model performance using 10-fold cross-validation
>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>> evaluation_results = df.regressor_evaluation(kfold=10, cross_validation=True)

See Also

  • sklearn.metrics.r2_score : R-squared (coefficient of determination) regression score function.

  • sklearn.metrics.mean_squared_error : Mean squared error regression loss.

  • sklearn.model_selection.cross_val_score : Evaluate a score by cross-validation.

regressor_model_testing(variables_values: list, scaling: bool = False)[source]

Test the trained regressor model with given input variables.

Parameters

variables_valueslist

A list containing values for each independent variable.

scalingbool, default False

Whether to scale the input variables. If True, the method expects scaled input using the same scaler used during training.

Returns

np.ndarray

The predicted values from the regressor model.

Raises

AssertionError

If the problem type is not regression.

Notes

  • This method tests a pre-trained regressor model.

  • If scaling is set to True, the input variables are expected to be scaled using the same scaler used during training.

Examples

>>> # Assuming df is an instance of SupervisedLearning class with a trained regressor model
>>> df.regressor_model_testing([1.5, 0.7, 2.0], scaling=True)
array([42.0])

See Also

  • sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

regressor_predict()[source]

Predict the target variable for regression models.

Returns

Dict[str, np.ndarray]

A dictionary containing actual training and test targets along with predicted values, or None if the model is set for classification.

Raises

AssertionError

If the training phase of the model is set to classification, as regression models cannot predict a classification model.

Notes

This method uses the sklearn library for regression model prediction.

Examples

>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)
>>>
>>> # Predict for regression model
>>> predictions = df.regressor_predict()
>>>
>>> print(predictions)
{'Actual Training Y': array([...]), 'Actual Test Y': array([...]),
 'Predicted Training Y': array([...]), 'Predicted Test Y': array([...])}

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

remove_duplicates(which_columns: str = None)[source]

Remove duplicate rows from the dataset based on specified columns.

Parameters

which_columnsstr or list or tuple, optional

Column(s) to consider when identifying duplicate rows. If specified, the method will drop rows that have the same values in the specified column(s).

Returns

DataFrame

A new DataFrame with duplicate rows removed.

Raises

TypeError

If which_columns is not a valid string, list, or tuple.

Notes

  • This method removes duplicate rows from the dataset based on the specified column(s).

  • If no columns are specified, it considers all columns when identifying duplicates.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>> # Remove duplicate rows based on a specific column
>>> model.remove_duplicates(which_columns='column_name')

See Also

  • pandas.DataFrame.drop_duplicates : Drop duplicate rows.

  • pandas.DataFrame : Pandas DataFrame class for handling tabular data.

  • ydata_profiling : Data profiling library for understanding and analyzing datasets.

  • sweetviz : Visualize and compare datasets for exploratory data analysis.

References

remove_outlier(drop_na: bool)[source]

Remove outliers from the dataset.

This method uses the sklearn.preprocessing library for standard scaling and outlier removal.

Parameters

drop_nabool

If False, outliers are replaced with NaN values. If True, rows with NaN values are dropped.

Returns

pd.DataFrame

The dataset with outliers removed.

Notes

The method applies standard scaling using sklearn.preprocessing.StandardScaler and removes outliers based on the range of -3 to 3 standard deviations.

See Also

sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

Examples

>>> # Remove outliers, replace with NaN
>>> df = SupervisedLearning(dataset)
>>> df.remove_outlier(drop_na=False)
>>> # Remove outliers and drop rows with NaN values
>>> df.remove_outlier(drop_na=True)
rename_columns(old_column: str, new_column: str)[source]

Rename columns in the dataset.

Parameters

old_columnstr or list

The old column name(s) to be renamed.

new_columnstr or list

The new column name(s).

Returns

pd.DataFrame

A dataframe containing the modified dataset with the column name(s) changed.

Examples

>>> # Create a supervised learning instance and rename columns
>>> data = SupervisedLearning(dataset)
>>> renamed_data = data.rename_columns("old_column_name", "new_column_name")
>>> print(renamed_data)

See Also

replace_values(replace: int, new_value: int)[source]

Replace specified values in the dataset.

Parameters

replaceint or float or str or list or tuple or dict

The value or set of values to be replaced.

new_valueint or float or str or list or tuple

The new value or set of values to replace the existing ones.

Returns

pd.DataFrame

A dataframe containing the modified dataset with replaced values.

Raises

TypeError
  • If replace is a string, integer, or float, and new_value is not a string, integer, or float.

  • If replace is a list or tuple, and new_value is not a string, integer, float, list, or tuple.

  • If replace is a dictionary, and new_value is not specified.

Notes

This method replaces specified values in the dataset. The replacement can be done for a single value, a list of values, or using a dictionary for multiple replacements.

Examples

>>> # Create a supervised learning instance and replace values
>>> data = SupervisedLearning(dataset)
>>>
>>> # Replace a single value
>>> replaced_values = data.replace_values(0, -1)
>>>
>>> # Replace multiple values using a dictionary
>>> replaced_values = data.replace_values({'Male': 1, 'Female': 0})

See Also

References

reset_index(drop_index_after_reset: bool = False)[source]

Reset the index of the dataset.

Parameters

drop_index_after_resetbool, optional

Whether to drop the old index after resetting. Default is False.

Returns

pd.DataFrame

A dataframe containing the modified dataset with the index reset.

Examples

>>> # Create a supervised learning instance and reset the index
>>> data = SupervisedLearning(dataset)
>>> reset_index_data = data.reset_index(drop_index_after_reset=True)
>>> print(reset_index_data)

See Also

scale_independent_variables()[source]

Standardize independent variables using sklearn.preprocessing.StandardScaler.

Returns

pd.DataFrame

A DataFrame with scaled independent variables.

Notes

This method uses the sklearn.preprocessing library for standardization.

Examples

>>> # Create an instance of SupervisedLearning
>>> df = SupervisedLearning(dataset)
>>> # Scale independent variables
>>> scaled_data = df.scale_independent_variables()

See Also

sklearn.preprocessing.StandardScaler : Standardize features by removing the mean and scaling to unit variance.

select_datatype(datatype_to_select: str = None, datatype_to_exclude: str = None, inplace: bool = False)[source]

Select columns of specific data types from the dataset.

Parameters

datatype_to_selectstr, optional

Data type(s) to include. All data types are included by default.

datatype_to_excludestr, optional

Data type(s) to exclude. None are excluded by default.

inplacebool, default False

Replace the original dataset with this groupby operation.

Returns

DataFrame

Subset of the dataset containing columns of the specified data types.

See Also

  • pandas.DataFrame.select_dtypes : Select columns based on data type.

References

select_dependent_and_independent(predict: str)[source]

Select the dependent and independent variables for the supervised learning model.

Parameters

predictstr

The name of the column to be used as the dependent variable.

Returns

Dict

A dictionary containing the dependent variable and independent variables.

Notes

This method uses the pandas library for data manipulation.

Examples

>>> # Select dependent and independent variables
>>> df = SupervisedLearning(dataset)
>>> variables = df.select_dependent_and_independent("target_column")

See Also

  • pandas.DataFrame.drop : Drop specified labels from rows or columns.

  • pandas.Series : One-dimensional ndarray with axis labels.

select_features(strategy: str, estimator, number_of_features: int)[source]

Select features using different techniques.

Parameters

strategystr

The feature selection strategy. Options include “rfe”, “selectkbest”, “selectfrommodel”, and “selectpercentile”.

estimator

The estimator or score function used for feature selection.

number_of_featuresint

The number of features to select.

Returns

DataFrame or dict

DataFrame with selected features or a dictionary with selected features and selection metrics.

Raises

TypeError

If the strategy or estimator is not recognized.

Notes

  • This method allows feature selection using different techniques such as Recursive Feature Elimination (RFE), SelectKBest, SelectFromModel, and SelectPercentile.

Examples

>>> # Create a supervised learning instance and load a dataset
>>> model = SupervisedLearning(dataset)
>>>
>>> # Select features using Recursive Feature Elimination (RFE)
>>> selected_features = model.select_features(strategy='rfe', 
>>>                                           estimator=RandomForestRegressor(), 
>>>                                           number_of_features=5)
>>>
>>> print(selected_features)

See Also

  • sklearn.feature_selection.RFE : Recursive Feature Elimination.

  • sklearn.feature_selection.SelectKBest : Select features based on k highest scores.

  • sklearn.feature_selection.SelectFromModel : Feature selection using an external estimator.

  • sklearn.feature_selection.SelectPercentile : Select features based on a percentile of the highest scores.

References

set_index(column: str)[source]

Set the index of the dataset.

Parameters

columnstr or list

The column(s) to set as the index.

Returns

pd.DataFrame

The dataset with the index set to the specified column or columns.

Examples

>>> # Create a supervised learning instance and set the index
>>> data = SupervisedLearning(dataset)
>>> index_set_data = data.set_index("column_name")
>>> print(index_set_data)

See Also

simple_linregres_graph(regressor, title: str, xlabel: str, ylabel: str, figsize: tuple = (15, 10), line_style: str = 'dashed', line_width: float = 2, line_marker: str = 'o', line_marker_size: float = 12, train_color_marker: str = 'red', test_color_marker: str = 'red', line_color: str = 'green', size_train_marker: float = 10, size_test_marker: float = 10, whole_dataset: bool = False)[source]

Generate a simple linear regression graph with optional visualization of training and test datasets.

Parameters:

regressor (object or list):

Single or list of regression models (e.g., sklearn.linear_model.LinearRegression) to visualize.

title (str):

Title of the graph.

xlabelstr

A title for the xaxis.

ylabelstr

A title for the yaxis.

figsizestr, optional, default: (15, 10)

The size(length, breadth) of the figure frame where we plot our graph.

line_stylestr, optional, default: “dashed”

Style of the regression line (“solid”, “dashed”, “dashdot”, etc.).

line_width (float, optional):

Width of the regression line. Default is 2.

line_marker (str, optional):

Marker style for data points on the regression line. Default is “o”.

line_marker_size (float, optional):

Size of the marker for data points on the regression line. Default is 12.

train_color_marker (str, optional):

Color of markers for the training dataset. Default is “red”.

test_color_marker (str, optional):

Color of markers for the test dataset. Default is “red”.

line_color (str, optional):

Color of the regression line. Default is “green”.

size_train_marker (float, optional):

Size of markers for the training dataset. Default is 10.

size_test_marker (float, optional):

Size of markers for the test dataset. Default is 10.

whole_dataset (bool, optional):

If True, visualize the entire dataset with the regression line. If False, visualize training and test datasets separately. Default is False.

Returns:

None

Displays a simple linear regression graph.

Examples:

>>> # Example 1: Visualize a simple linear regression model
>>> model.simple_linregres_graph(regressor=LinearRegression(), 
                                 title="Simple Linear Regression",
                                 xlabel="Specify your title for xaxis",
                                 ylabel="Specify your title for yaxis")
>>> # Example 2: Visualize multiple linear regression models
>>> regressors = [LinearRegression(), Ridge(), Lasso()]
>>> model.simple_linregres_graph(regressor=regressors, 
                                 title="Analyzing Impact of Expenditure on Growth",
                                 xlabel="Expenditure",
                                 ylabel="Growth")

References:

sort_index(column: str, ascending: bool = True, reset_index: bool = False)[source]

Sort the dataset based on the index.

Parameters

columnstr or list

The index column(s) to sort the dataset.

ascendingbool, optional

Whether to sort in ascending order. Default is True.

reset_indexbool, optional

Whether to reset the index after sorting. Default is False.

Returns

pd.DataFrame

A dataframe containing the sorted dataset.

Examples

>>> # Create a supervised learning instance and sort the dataset based on index
>>> data = SupervisedLearning(dataset)
>>> sorted_index_data = data.sort_index("index_column", 
>>>                                     ascending=False, 
>>>                                     reset_index=True)
>>> print(sorted_index_data)

See Also

sort_values(column: str, ascending: bool = True, reset_index: bool = False)[source]

Sort the dataset based on specified columns.

Parameters

columnstr or list

The column(s) to sort the dataset.

ascendingbool, optional

Whether to sort in ascending order. Default is True.

reset_indexbool, optional

Whether to reset the index after sorting. Default is False.

Returns

pd.DataFrame

The dataset sorted according to the specified column or columns.

Examples

>>> # Create a supervised learning instance and sort the dataset
>>> data = SupervisedLearning(dataset)
>>> sorted_data = data.sort_values("column_name", 
>>>                                ascending=False, 
>>>                                reset_index=True)
>>> print(sorted_data)

See Also

split_data(test_size: float = 0.2)[source]

Split the dataset into training and test sets.

Parameters

test_size: float, optional, default=0.2

Specifies the size of the data to split as test data.

Returns

Dict

A dictionary containing the training and test sets for independent (X) and dependent (y) variables.

Notes

This method uses the sklearn.model_selection.train_test_split function for data splitting.

Examples

>>> # Split the data into training and test sets
>>> df = SupervisedLearning(dataset)
>>> data_splits = df.split_data()
>>>
>>> X_train = data_splits["Training X"]
>>> X_test = data_splits["Test X"]
>>> y_train = data_splits["Training Y"]
>>> y_test = data_splits["Test Y"]

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

sweetviz_profile_report(filename: str = 'Pandas Profile Report.html', auto_open: bool = False)[source]

Generate a Sweetviz profile report for the dataset.

Parameters

filenamestr, default “Pandas Profile Report.html”

The name of the HTML file to save the Sweetviz report.

auto_openbool, default False

If True, open the generated HTML report in a web browser.

Returns

None

See Also

  • sweetviz.analyze : Generate and analyze a Sweetviz data comparison report.

References

train_model_classifier(classifier)[source]

Train a classifier on the provided data.

Parameters

classifierAny

The classifier object to be trained.

Returns

Any

The trained classifier.

Notes

This method uses the sklearn.model_selection and sklearn.metrics libraries for training and evaluating the classifier.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> # Train a classifier
>>> df = SupervisedLearning(dataset)
>>> classifier = RandomForestClassifier(random_state = 0)
>>> trained_classifier = df.train_model_classifier(classifier)

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • sklearn.metrics.accuracy_score : Accuracy classification score.

train_model_regressor(regressor)[source]

Train a regressor model.

Parameters

regressorAny

A regressor model object compatible with scikit-learn’s regressor interface.

Returns

Any

The trained regressor model.

Notes

  • This method uses the sklearn.model_selection and sklearn.metrics libraries for training and evaluation.

  • All required steps before model training should have been completed before running this function.

Examples

>>> from sklearn.linear_model import LinearRegression
>>>
>>> # Train a regressor model
>>> df = SupervisedLearning(dataset)
>>> regressor = LinearRegression()
>>> trained_regressor = df.train_model_regressor(regressor)

See Also

  • sklearn.model_selection.train_test_split : Split arrays or matrices into random train and test subsets.

  • sklearn.metrics.r2_score : R^2 (coefficient of determination) regression score function.

unique_elements_in_columns(count: bool = False)[source]

Extracts unique elements in each column of the dataset.

This method generates a DataFrame containing unique elements in each column. If specified, it can also provide the count of unique elements in each column.

Parameters

countbool, optional, default=False

If True, returns the count of unique elements in each column.

Returns

pd.DataFrame or pd.Series

If count is False, a DataFrame with unique elements. If count is True, a Series with counts.

Notes

This method utilizes Pandas for extracting unique elements and their counts.

Examples

>>> from buildml import SupervisedLearning
>>> model = SupervisedLearning(dataset)
>>> unique_elements = model.unique_elements_in_columns(count=True)

References