Managing IoT data streams in
real-time and geographical space
Our R&D is generating solid evidence that the speed of execution, accuracy and effectiveness are the main factors in managing a data stream lifecycle. In particular, finding the trade-off among these factors is essential for avoiding the risk of overflowing an automated analytical workflow with useless data streams which are continually being transmitted from IoT devices to the edge, fog and cloud resources. We are exploring a data stream lifecycle using Cisco Kinetics in order to meet the security, geographical distribution, and transmission needs of data lifecycles tasks such as data ingestion, data transportation, data storage, data leverage, and data control flow.
We are using the Petri Net (PN) model as a process mining technique to determine the stream behaviour of the automated analytical tasks. A PN model is a bipartite directed graph consisting of two types of nodes: places and transitions. The relation between the nodes are defined by arcs. In our PN model, transitions represent a sequence of events that take place when the analytical tasks are being executed by the algorithms running at the edge, fog, and cloud resources. Transitions represent the state changes of the data streams being transported along the analytical tasks. Places are used to model the resources available at the edge, fog, and cloud which are needed to execute our streaming analytical workflow.
We are using two PN tools known as ProM6 and Disco Fluxicon.
We are members of the Fluxicon Academic Initiative for processing mining research and education. We are working closely with a number of forward-thinking organizations around the world led by Professor Wil van der Aalst, who invented process mining at Eindhoven University of Technology. Under this initiative, Dr. Wachowicz’s students have access to the latest technology and process mining practices available in the market and to a discussion forum.
Analytics Everywhere Ecosystem
Our major breakthrough is our early prototype of an “analytics everywhere” ecosystem which has provided us with an iterative learning experience on how to advance our research towards automated analytical tasks for the Internet of Things. Combining analytical tasks using distributed resources (i.e. edge-fog-cloud) is not a trivial task, and the deployment of different IoT use cases has been crucial in identifying the limitations and benefits of our ecosystem. Four main building blocks of our proposed ecosystem are:
Autonomous Analytical Workflows
We are developing a set of autonomous tasks to process and analyze data streams coming from IoT devices. The data streams are usually an unbounded sequence of tuples generated with high data rates and containing exceptionally noisy data. From a conceptual perspective, the design of an automated analytical workflow depends on the integration of complementary mobility contexts for processing massive data streams without human intervention. We have introduced the notions of trips, network of trips and mobility neighborhoods to represent a mobility context across geographical scales. For example, a trip taken by a transit vehicle or a mobility neighborhood of a moving autonomous car on a highway towards a network of trips or a cluster of synchronized mobility neighborhoods on a highway. Our research will continue to focus on identifying new conceptual artifacts that can be used to support the automation of analytical tasks. Despite the scientific evidence that context plays an important role in analytics, it continues to lack careful examination.
Algorithm Transparency: This research direction has emerged due to our concerns about the inherited high risk of bias in data and algorithms in automated analytical workflows. It will require broader steps to resolve them, but our R&D outcomes are pointing out to the need to understand how the analytical tasks, data lifecycles, and geographic spaces that are used to collect training IoT data can influence the behavior of the predictive models. In the United States, there are already initiatives such as AI Now in which New York University and the Algorithmic Justice League with the help of MIT Media Lab are warning about the power of biased algorithms and their social implications. There is no similar initiative in Canada.
Machine Learning on Graphs
Anticipatory Learning: We are interested in investigating anticipatory learning to address two challenges. First is the challenge of labeling training IoT data, which is currently being done manually. Second is the challenge of creating event logs for the Petri Nets, which are currently done in a batch processing mode. We are interested in investigating a feedback mechanism in the course of the analytical workflow that will help us to address these challenges.
Cognitive Mapping for Machine Learning
We are exploring the Dervin’s sense-making model due to its wide applicability in the past to communicating settings from both procedural and cognitive perspectives, as well as at individual and group levels.
We have built an edge/fog/cloud pipeline that consists of a distributed resource infrastructure. We have used the Cisco Kinetic platform because it is a single platform that can be deployed at both edge and fog nodes, and manage the data lifecycle of different types of IoT devices (e.g. lighting, parking, traffic, water management). Using the Cisco Kinetic platform was critical for initiating a data lifecycle at any time and any place. We also have the support of Compute Canada for the implementation of Hadoop cloud clusters, allowing us access to the East Cloud and West Cloud resources. This has led to a scientific breakthrough in realizing that the next challenge is not how fast to transport different data streams produced by IoT devices, but in fact, the actual stream behavior of an automated analytical workflow. Our research work is raising the awareness of how crucial it is to understand whether the data streams, which are being generated under different mobility contexts belonging to different IoT systems, actually conform to the expected data lifecycle necessary to execute a sequence of tasks of an automated analytical workflow at running at the edge, fog and cloud.