Tutorials organized by experts in particular areas constitute a great addition to the conference.

Please check the list of accepted tutorials below:

- Symbolic Data Analysis
~~Evolutionary Multivariate Approximation Models~~(CANCELED)- Learning Through Physiological Signals: from sensors for data acquisition to data processing for knowledge discovery

### Symbolic Data Analysis

**Contacts: **

**Paula Brito**, Universidade do Porto & LIAAD-INESC TEC, Portugal, email

**Sónia Dias**, Polytechnic Institute of Viana do Castelo & LIAAD-INESC TEC, Portugal, email

**Description:**

Symbolic Data is concerned with analysing data with intrinsic variability, which is to be taken into account. In Data Mining, Multivariate Data Analysis and classical Statistics, the elements under analysis are generally individual entities for which a single value is recorded for each variable – e.g., individuals, described by age, salary, education level, etc. But when the elements of interest are classes or groups of some kind – the citizens living in given towns; car models, rather than specific vehicles – then there is variability inherent to the data.

Symbolic data goes beyond the usual data representation model, considering variables whose observed values for each element are no longer necessarily single real values or categories, but may assume the form of sets, intervals, or, more generally, distributions. In this tutorial, we shall introduce Symbolic Data Analysis with motivating examples. We then proceed to the definition of the new variable types, and introduce alternative symbolic data representations. Multivariate analysis of interval or histogram-value data will then be

addressed, focusing on clustering and regression approaches.

**Tutorial Planning**Symbolic data: Introduction and Motivation

Symbolic Variables

Interval-Valued Variables

Distribution-Valued Variables

Parametric models for interval and distributional-valued data

Quantile-function representation of interval and distributional-valued data

Clustering approaches

Linear Regression models

Software and References

### CANCELED

~~Evolutionary Multivariate Approximation Models~~

**Contacts: **

**Angel Fernando Kuri-Morales**, Instituto Tecnológico Autónomo de México, México, email

**Description:**

We will address the problem of finding closed supervised models of multivariate phenomena. “Closed” refers to the fact that the “black-box” characteristic of models based on heuristics (i.e. Neural Networks) will be superseded. The purpose of multivariate approximation is to model the behavior of an dependent variable (*y*) as a function of a set of n independent variables (v_{1}, v_{2}, …, v_{n}). A pair (x,y) is called a *tuple*. Since the form of the model is unknown we will select a general polynomial form for the approximant. Our selection of such approximant’s form is justified by the Weierstrass Approximation Theorem [1]. Accordingly, the experimental model is defined to have the form y=c_{1}X_{1}+…+c_{m}X_{m}, where every pair c_{i}X_{i} is called a term, c_{i} is a coefficient associated to the i-th term, *m *is the number of desired terms and X_{i} is a monomial: the product of the powers of the n independent variables each elevated to a maximum positive degree. We assume that there is a sample of size N such that for every set of the independent variables v there is a known value of the dependent variable y. Given the (assumed) appropriatness of the data in the sample, our basic aim is to find a mathematical expression y = f(x) (the model) such that, when the values of the independent variables are input to the model, the values of y gotten from the model adequately approach those of the sample. We stress the fact that, in this regard, the “multivariate analysis” stemming from our model should not be confused with the typical paradigm in which the form of the model is assumed (linear, exponential, etc.) and then data is proven to adjust (or not) to the purported form. Here the form of the model is the most general one and, in principle, may adjust to any distribution. We further point out that when arriving at the purported algebraic model the relations between the variables are exposed and the underlying patterns in the

sample are made mathematically explicit. Cases of this method are discussed and shown to explicitly exhibit the underlying patterns.

Several questions arise: a) How does one determine the number of terms in the model? b) Given the number of terms, how do we find the coefficients of the monomials? c) How do we determine the particular degree of every variable in the monomials? d) How do we ensure the adequateness of the model? In the tutorial we solve all of these questions. We first address (b) and show that the so-called **Ascent Algorithm (AA)** allows us to find the values of the ci ́s given a clearly defined form of any approximant. Many heuristic algoritms have been proposed and analyzed [2]. Here we show that the AA does not need any heuristic considerations and may be proven to find the minimum approximation error given any arbitrary sample [3].

Next we address (c) by showing that a particular kind of genetic algorithm (the so-called **Eclectic GA**) is the optimization method which targets on the best values of d. We discuss the EGA, how it compares with other possible variations and how it was proven to

be the best [4].

Regarding (d) we rely on **Cybenko ́s Universal Approximation Theorem (UAT)**. This theorem [5] proves that an artificial neural network (ANN) necessitates no more than one hidden layer to achieve the smallest possible approximation error for an arbitrary sample. Its architecture may be directly calculated from a closed formula discussed in [6].

To avoid the black-box nature of ANNs we algebraically approximate the activation function *1/(1+e ^{-x})* with Chebyshev polynomials [7] and show that the algebraic expansion of both layers delimits the powers in the terms in such a way that no more than 20 predetermined powers are needed. These are imposed on the individuals of EGA.

Finally, we discuss the way in which the number of terms

*m*may be determined from

**a previously trained ANN to find**(46 datasets from the University of California Machine Learning dataset repository and the Knowledge

*m*from experimentally selected data setsExtraction Evolutionary Learning dataset repository were collected) [8]. A total of 352 polynomials were calculated using our method. In every case the best values of

*m*were recorded. An ANN was trained to identify

*m*in the general case.

### Learning Through Physiological Signals: from sensors for data acquisition to data processing for knowledge discovery

**Contacts: **

**Raquel Sebastião**, University of Aveiro, Portugal, email

**Vitor Sencadas**, University of Aveiro, Portugal, email

**Rita Paula Ribeiro**, University of Porto & INESC-TEC, Portugal, email

**Description:**

The autonomic nervous system (ANS) regulates fundamental physiological states, upregulating and downregulating various functions within our body. While maintaining the equilibrium of the body’s systems according to both internal and external stimuli, many physiological signals reflect the activity of the ANS.

Biomedical sensors, which are usually minimally invasive equipment and often wireless, can continuously stream to common devices (e.g., smartphones), offering an excellent opportunity to monitor the physiological correlates of several psychophysiological states of human subjects. Relevant information and meaningful characteristics from physiological signals can be obtained through the application of data mining methods.

This course will expose the analysis of physiological reactions related to different induced conditions: from the design of soft sensors for data acquisition to several methodologies to extract relevant information from the gathered signals. It will show the design and development of soft sensors for wearable applications, the collection of physiological data under different induced conditions, will present raw gathered data in different experiments, and the importance of signal pre-processing, it will disclose relevant features extracted from those signals and approaches for recognizing patterns hidden in the data.

Therefore, by presenting the soft sensors for data acquisition in wearable applications, signal pre-processing techniques, and the extraction of relevant information and meaningful characteristics from those signals along with the approaches to analyze them, we have foreseen that this course will capture the interest of the intended audience.

Moreover, this course accounts for hands-on on the topics described above: attendees will have the opportunity to explore signal collection and processing for data analysis, enabling the audience to understand the several topics covered. Also, along with the distinct topics addressed, the audience will be engaged through questions and insights to promote discussion on the approaches that could be used in each part. Different approaches will also be shown in order to compare the outputs obtained.

The format of the course serves the following objectives:

– Familiarization with wearables and data collection;

– Importance of signal pre-processing;

– Relevance of data preparation, feature extraction, and feature selection;

– Approaches to analyzing data, including data-preprocessing techniques to deal with the imbalance of data, disclosing knowledge hidden in the collected data;

**1. Wearables for Data Monitoring and Collection**Wearables for physiological monitoring are devices that can be worn on the body and monitor various aspects of health and well-being. These devices use sensors to collect data about body ́s physiological parameters and provide valuable insights into overall health.

Current wearable technologies are based on traditional bulky, heavy, and rigid electronics, making them incompatible with the elastic properties of the human skin. Small electronic portable devices like fitness bands, smartwatches, or smartrings, are not reliable to gather physiological data continuously over days or weeks with the accuracy desired by physicians to improve diagnosis and treatment efficiency. Despite the technological advances in wearable devices, there is an increasing need to collect greater and reliable information from the human body to understand the physiological parameters of normal health and disease status, and to miniaturize these systems for integration into portable and skin compliant devices.

The wearable devices can directly impact decision-making by recording the user’s health status 24h/day, pro-actively improving individuals health quality, with the potential to reduce costs to the healthcare system, especially in demographic sectors with high healthcare needs such as elderly people, and for remote locations with reduced population and scarce access to medical care.

**Key Insights that attendees will take away from this presentation:**

Attendees will have the opportunity to have a brief introduction to polymer science, processing and characterization and its impact in the properties of the materials for sensors and actuator applications. The presentation will focus in the synthesis of polymeric materials for the next generation of sensors for wearable applications, their main requisites and performance.

**2. Pre-processing Physiological Signals**The ECG represents cardiac activity, and this signal can be contaminated by powerline interference, interference from the skeletal muscle contractions, loss in electrode-skin contact, baseline wander (0.15 to 0.3 Hz) resulting from subjects breathing, other instrumentation noise. Commonly, filtering is used to attenuate these artifacts, mainly to remove the baseline wander and high frequencies.

EMG measures the electrical activity at the skin’s surface caused by muscle contraction and this signal is often affected by noise sources, including electronic components noise, powerline interference, cable motion artifacts, electrode motion artifact, and “crosstalk” (due to potentials from other muscles). Since the power density of motion artifact is mostly below 20 Hz, a high-pass filter is often used to suppress this effect, and powerline interference is often removed through a narrow-fixed notch filter. Depending on the type of information sought, other methods of denoising the EMG, such as Empirical Mode Decomposition and Wavelet Decomposition, may be needed.

EDA refers to the variation of the electrical properties of the skin in response to sweat secretion and this signal is mainly corrupted by the patient’s movement and temperature fluctuations. This noise is often removed with a low-pass filter, exponential smoothing, or removing signal segments.

In respiratory signals, noise can be, mostly, due to body movements and talking (speech breathing) Thus, filters can be used to reduce high-frequency noise while maintaining the signal’s structure. Accounting that the normal respiratory rate is roughly 12 to 20 bpm, to lower the overall noise level in the signals, a frequency band of approximately 0..1 to 0.35 Hz can be considered.

**Key Insights that attendees will take away from this presentation:**

This presentation will show the raw different collected signals and the main demands and requirements of pre-processing signals, exhibiting the frequency bands with contents of interest of each signal and the most appropriate filters (advantages/disadvantages) for each signal. Attendees will understand the need for pre-processing signals and the importance of this step in the success of the posterior tasks.

**3. Univariate and Multivariate Analysis of Extracted Physiological Features**This presentation will show the significance of some of the several features that can be extracted from the collected physiological signals.

However, it is unlikely the assumption that a single signal may be able to disclose all the information regarding an induced condition/behavior. Therefore, a multivariate perspective is of utmost relevance for a more comprehensive approach and, therefore, towards a more effective understanding of the physiological correlates of induced conditions/behavior. Thus, multivariate analysis of physiological

features will also be a focus of this presentation.

**Key Insights that attendees will take away from this presentation:**

Attendees will understand the meaning of different physiological features, how to compute them, and the importance of selecting the most relevant in the domain context.

**4. Supervised learning in imbalanced domains**When analyzing ANS reactions, we have a variety of data science techniques that help uncover meaningful physiological signal characteristics through data preparation (e.g. feature extraction and selection) and modeling stages. This modeling typically relies on supervised learning methods. However, although clinical domains are commonly imbalanced, the standard classification tasks do not account for that. Thus, advising to go beyond the traditional classification approaches, this presentation will address this problem and

advance strategies to overcome such drawbacks. Standard machine learning models aim to maximize the performance across all the observations and thus neglect the performance over the rare ones, which are the most important ones from the domain perspective. Several techniques have been proposed to tackle this problem of learning imbalanced domains. Data-level approaches are the most commonly used and consist of changing the data distribution through undersampling, oversampling, and synthetic data generation techniques. Additionally, the choice of appropriate performance metrics is also of crucial importance.

**Key Insights that attendees will take away from this presentation:**

Attendees will apprehend the drawbacks of neglecting the imbalanced nature of the domain, as well as effective strategies for mitigating this issue. This presentation will also focus on identifying suitable performance metrics to assess the effectiveness of machine learning models.

**5. Hands-on**The attendees will be challenged to address the topics covered by this course: after a brief experiment for data collecting, they will be given a prepared dataset with physiological signals and features (and also contextual data) from a specific protocol and they will work on the topics: from signal pre-processing, through feature extraction and selection, to data analysis for Knowledge Discovery.

We will rely on the WESAD publicly available dataset (https://ubicomp.eti.uni-siegen.de/home/datasets/icmi18/), which is a Multimodal Dataset for Wearable Stress and Affect Detection To achieve the proposed goals, we used a publicly available multimodal dataset for stress and affect detection, with physiological data (ECG, EMG, EDA, Resp) of 15 healthy participants during a lab study designed for stress and affect detection, in which the subjects were exposed to different affective stimuli (neutral, stress, and amusement) and two meditation periods (to de-excite the participants).

Besides the physiological signals collected during these conditions, the dataset also includes context notes about the participants and self-assessment report results.

**Description of and/or links to any planned materials or resources to be distributed to attendees**In this course it will be used the WESAD publicly available dataset (https://ubicomp.eti.uni-siegen.de/home/datasets/icmi18/). The attendees are supposed to bring personal laptops. Further details on software will be given to the attendees prior to the course. Consent from (some of) the attendees to monitor their own signals will be required (data will not be recorded).