Using the Datonis Analytics Feature (Notebooks)
Datonis Documentation Home > Using the Datonis Analytics Feature (Notebooks)
Introduction
The Datonis Platform provides a python notebook feature for performing interactive and exploratory data analysis and running machine learning algorithms on your sensor data. This feature is generally useful when you have collected a sizeable amount of data (for a few days/months).
A notebook provides an integrated containerized big data computing environment. It lets the platform user run plain python programs or distributed programs using the pyspark framework which can be made scalable by provisioning a separate hadoop cluster based on the processing needs. The main advantage of the notebook feature is the ability to perform data analysis close to where the thing/sensor data is stored in the platform, thus eliminating the need for expensive and time consuming ETL in order to get the data to an environment where analytics can be performed. This is a premium feature and only available in the enterprise accounts.
Getting Started
A notebook features an interactive UI editor and the code is organized into paragraphs. A notebook can have multiple paragraphs for logically separating steps to be performed in a program. Paragraphs can be run individually or all together in an order. Each paragraph can optionally print some output or even display graphical elements like charts etc, when performing interactive analysis of data.
Datonis Provider
The DatonisProvider is a python library that provides APIs related to the Datonis environment. A datonis provider instance can be created as follows
%pyspark from datonis import DatonisProvider dp = DatonisProvider(sc)
This provides the 'dp' object that will be used to call the API methods
Here 'sc' is a special variable automatically created at the start of the notebook program depicting the Spark Execution Context. This is typical of all spark/pyspark programs.
The Datonis provider has 3 main APIs:
- get_thing_data: This API lets you load data of your things/sensors from the timeseries store. Its returns a distributed collection of data points grouped into named columns called as a dataframe (See Dataframes). This API takes in the following inputs:
- time range: start time and end time between which to fetch the sensor data. These should be python datetime objects.
- thing_template_key: A string representing the thing_template key for which to fetch data.
- thing_keys: An array of strings represeting the thing keys for which to fetch data.
- metrics/properties: An array of properties of the things you are interested in.
- timegroup (optional): The aggregation/grouping level to be applied based on the time domain before returning data.
- raw: This is the default option. This returns all data for the thing/sensor without applying any aggregation or grouping.
- minute: Data is aggregated on a minute level. Hence you get upto 60 data points if you query for a hour of data.
- n_minute: Data is aggregated on a 'n' minute level. E.g. a value of '5_minute' will get upto 12 data points if you query for a hour of data.
- hour: Data is aggregated on a hourly level.
- month: Data is aggregated on a monthly level
- day: Data is aggregated on a day level.
- timezone: The timezone to use to return data. By default this is UTC.
A few examples of how to call the API are given below:
#Getting Raw Data thing_template_key = "972adct84d" # Thing_template_key of things thing_keys = ["t7f7b9a3ea","4bc27857f1","c241314e1a"] # Thing keys whose data we wish to aggregate metrics = ["pressure", "forging_temp", "job.value"] # List of metrics that we want to aggregate thing_data_frames = dp.get_thing_data(start_time, end_time, thing_template_key, thing_keys, metrics) thing_data_frames.show() +--------------------+----------+----------------+--------+------------+-----+ | ts| thing_key| thing_name|pressure|forging_temp|value| +--------------------+----------+----------------+--------+------------+-----+ |2018-05-16 01:38:...|c241314e1a|Forging Hammer-1| 28.0| 1196.0| 2.0| |2018-05-16 01:38:...|t7f7b9a3ea|Forging Hammer-2| 14.0| 1023.0| 3.0| |2018-05-16 01:38:...|4bc27857f1|Forging Hammer-3| 21.0| 815.0| 4.0| |2018-05-16 01:39:...|c241314e1a|Forging Hammer-1| 11.0| 1065.0| 2.0| |2018-05-16 01:39:...|t7f7b9a3ea|Forging Hammer-2| 12.0| 870.0| 2.0| |2018-05-16 01:39:...|4bc27857f1|Forging Hammer-3| 22.0| 786.0| 2.0| |2018-05-16 01:39:...|c241314e1a|Forging Hammer-1| 7.0| 1133.0| 2.0| |2018-05-16 01:39:...|t7f7b9a3ea|Forging Hammer-2| 8.0| 940.0| 4.0| |20
#Getting Minute level Data metrics = ["pressure", "forging_temp"] thing_data_frames = dp.get_thing_data(start_time, end_time, thing_template_key, thing_keys, metrics, timegroup="minute") thing_data_frames.show() +--------------------+----------+----------------+-----------------+-------------------+-----------------+-----------------+-----------------+-------------+---------------+-------------+-------------+-------------+ | ts| thing_key| thing_name|forging_temp::sum|forging_temp::count|forging_temp::max|forging_temp::min|forging_temp::avg|pressure::sum|pressure::count|pressure::max|pressure::min|pressure::avg| +--------------------+----------+----------------+-----------------+-------------------+-----------------+-----------------+-----------------+-------------+---------------+-------------+-------------+-------------+ |2018-05-16 01:38:...|t7f7b9a3ea|Forging Hammer-2| 1023.0| 1.0| 1023.0| 1023.0| 1023.0| 14.0| 1.0| 14.0| 14.0| 14.0| |2018-05-16 01:38:...|c241314e1a|Forging Hammer-1| 1196.0| 1.0| 1196.0| 1196.0| 1196.0| 28.0| 1.0| 28.0| 28.0| 28.0| |2018-05-16 01:38:...|4bc27857f1|Forging Hammer-3| 815.0| 1.0| 815.0| 815.0| 815.0| 21.0| 1.0| 21.0| 21.0| 21.0| |2018-05-16 01:39:...|t7f7b9a3ea|Forging Hammer-2| 1810.0| 2.0| 940.0| 870.0| 905.0| 20.0| 2.0| 12.0| 8.0| 10.0| |2018-05-16 01:39:...|4bc27857f1|Forging Hammer-3| 786.0| 1.0| 786.0| 786.0| 786.0| 22.0| 1.0| 22.0| 22.0| 22.0| |2018-05-16 01:39:...|c241314e1a|Forging Hammer-1| 2198.0| 2.0| 1133.0| 1065.0| 1099.0| 18.0| 2.0| 11.0| 7.0| 9.0| |2018-05-16 01:40:...|c241314e1a|Forging Hammer-1| 1148.0| 1.0| 1148.0| 1148.0| 1148.0| 25.0| 1.0| 25.0| 25.0| 25.0| |2018-05-16 01:40:...|4bc27857f1|Forging Hammer-3| 1695.0| 2.0| 874.0| 821.0| 847.5| 31.0| 2.0| 19.0| 12.0| 15.5|
DataFrames can also be accessed as SQL tables by performing the following operation
%pyspark table_name = "workcenter_hierarchy" dataframe.createOrReplaceTempView(table_name) %sql select * from workcenter_hierarchy
- save_and_publish_ml_model: This API lets you save a trained spark machine learning model on the file system for later use. This also publishes this model as a first class model in Datonis ML Models repository. The API takes the following parameters:
- model: The python object representing the machine learning model.
- name: Name given to the model. The model is referred to by this name in Datonis.
- description: A description of what the model represents.
- input_fields: An array of independent input variables used to create the model.
- output_fields: An array of depedent output variables.
An example of saving a trained model.
%pyspark dp.save_and_publish_aml_model(model, "solar-panel-dc-current-predictor", "Predicts Solar Panel's Ideal DC Output", ["AmbientTemperature", "DcVoltage", "Irradiation", "ModuleTemperature"], ["DcCurrent"])
- load_ml_model_by_name(name): This API lets you load a previously trained & saved model from the file system into the pyspark execution context. A loaded model can further be used for predicting the values of the output variables based on the values of the input variables that are fed to it. The API takes in the following parameters as input:
- Name of the model: The name given to the model when it was saved and published.
An example of loading a previous saved model.
%pyspark model = dp.load_ml_model_by_name("solar-panel-dc-current-predictor")