dataSharp

PreprocessResource.dataSharp(request, sub_analysis_id, **kwargs)

Prior mandatory steps 1) Upload dataset 2) Create analysis 3) Create sub analysis

DataSharp performs the following preprocessing on the clean data generated by intuceo. The processing function performs data prep functions in the order described below.

  1. fastFill is a proprietary algorithm of Intuceo. fastFill performs Imputation. It is intended to fill the missing values in dataset. This function first checks for attributes that have missing values greater than a specified percentage (default 10%) and removes those attributes. Remaining attributes are processed by fastFill algorithm to substitute a value for each missing value intelligently. The imputation (auto fill) is performed on both categorical and numeric attributes. Note that the data after fastFill may have fewer attributes than the dataset uploaded.

    To get the data after fastFill, use GET method in fastFill function below. In the next version of the API spec, there will be an option to bypass the fastFill.

  2. autoMerge: After the data is autofilled for missing values, the categorical attributes are processed. Any categorical attribute that has more than 29 unique levels, is automatically merged ignoring the “Merge” setting in data definitions for those attributes. Other attributes having less than 29levels, levels are merged only if the Merge setting is set to Y.

    To get the data after the merging, use GET method in autoMerge function below.

  3. Check Data sufficiency: As part of datasharp, intuceo verifies if the data loaded is sufficient for generating predictive models or not. This function takes the cleaned, merged data as input and returns the recommendation string - Good, Inadequate or Inconclusive. In all cases, this function returns a dataframe called learningcurvedata. This dataframe contains test error, train error, sample size for multiple samples of the learning curves.

    To get the information, use GET method in runLearningcurves function below.

  4. dNoise: dNoise function performs binning of the numerical attributes. This function performs two types of binning, equal width or equal frequency.

    1. Equal width ensures the design of the bin is such that all bins have same number of records.
    2. Equal frequency ensures the design of the bins such that all bins have same range of values.

    To GET the binned data, use GET method under dNoise function below.

  5. oDetect: Outlier detection. Intuceo has a proprietary technique to identify the outliers in data. These are called too good to be true [TGTBT] attributes. These attributes are not considered for further engines to generate patterns or predictive models.

    To GET the TGTBT attributes, use GET method under oDetect function below.

  6. nReduce: This function detects the importance of the attributes other than the TGTBTs. Finds the importance and gives a relative importance of the attribute. Using proprietary method, tags each attribute as important and un-important. Regardless of the detection, further analysis for patterns, predictive models and suggestions use all attributes unless user specifically sets an attribute as un-important using setAttributes.

    To GET the tags for attributes, use GET method under nReduce function below.

Arguments

sub_analysis_id Give sub analysis id

Possible errors

Error message
Invalid sub analysis id

GET Request Example

curl -u username:password {url_prefix}/data_sharp/{sub_analysis_id}/

Response Example

{
    "error": false,
    "error_msg": "",
    "result": {
        "msg": "Please wait, auto process is running in background"
    }
}