#### Data Collections we used in our first Evaluation

The experimental evaluation we performed had various goals. The first was to ensure that the QualiMaster infrastructure is able to process the real data from the financial domain as provided by our application partner. This refers not only to the quality of resulted financial analysis but also to the ability to process the data in the desired time frames. However, in order to reach the ultimate purposes of the project, we needed to also test the behavior of the infrastructure in more complex situations, for example the behavior when we notice a larger number of market players or a higher number of ticks per second. To capture these goals we used real data sets from the financial domain as well as synthetic data sets. The following paragraphs provide their detailed description while Table 1 summarizes their main characteristics.

#### (I) Real data sets from the Financial Domain

Our first data sets are composed by real data. This data is provided by SPRING through a custom API. SPRING gets the data from established financial data providers. Two data providers have been used for this. For an initial small data set, data was gathered from “TeleTrader Software GmbH”. In a later stage of the project, the source of data switched to “TaiPan”, which is a product of “Lenz und Partner AG”, which provides much bigger and real time data sets. We collected the data of a whole day. In particular, we have two data sets: **SRD-A** that contains the data of 03/18/2014 and **SRD-B** that contains the data of 07/08/2015. As shown by the overview in Table 1, SRD-A has 125 marker players and 29.2 (average) ticks per seconds, whereas SRD-B has 2830 market players and 422.7 (average) ticks per second.

Given the data sets characteristics, SRD-A is the smallest among all real and synthetic data sets, since it has the smaller number of market players as well as the less average ticks per second. In addition, we consider SRD-B, as it is the representative data set for the financial domain, since it covers all important segments of the financial markets and provides high accuracy data.

#### (II) Synthetic Financial Data Sets

As we explained, we also created synthetic data sets in order to investigate the infrastructure’s behavior on more complex situations and on various characteristics. For this, we used a data simulator provided by SPRING. This simulator allows to create data sets with custom market player numbers, data rate (ticks per second), data length and specific behavior of the market player’s data in terms of correlation of the market players.

Using the simulator, we created three collections that contain data sets with different characteristics. In particularly, we varied the following characteristics: (a) the number of market players that provide stock information; (b) the number of ticks provided by all market players within a second; and (c) the overall comparisons that the correlation algorithms need to compute. Table 1 provides an overview of the created collections. As show, the first collection, named** Increasing-MP&TS**, contains data sets with an increasing number of market players and ticks per second (and thus also correlations). In the second collection, named **Increasing-MP**, we fixed the number of ticks per second and modified only the number of players. The overall number of correlations that must be computed is increasing along with the number of players. In the third collection, named** Increasing-T**s, we fixed the number of market players and varied the ticks per second. In this collection the number of correlations is the same for all data sets since the correlations depend only on the number of market players.

Figure 1 provides a graphical illustration of the three collections. Each collection, represented by a different color, contains a small number of data sets with a varying number of the market players and ticks per second (corresponding to the two axis of the plot), as we already explained. The number of correlations that must be computed is denoted by the size of the circles as well as the number shown inside each circle.

One important aspect of the generated data sets is that for some of them we known the behavior of the market correlations. When executing the particular data sets, we should notice a sequential inchange of the correlations between market players from uncorrelated to correlated. I.e., the market players are initially uncorrelated, they then become correlated, then uncorrelated, etc. Thus, we use these data sets for validating the quality of the corresponding correlation algorithms.

As we explained, our goal is to enhance these data sets in the upcoming months. In addition to the aspects discussed above, we are currently considering accompanying the collections with other data aspects or even configuration attributes of the QualiMaster infrastructure execution environment that might influence the overall processing performance. With respect to this, we are (currently) investigating the following:

- Allow market players to provide more than one tick per second (currently we consider one tick per market player per second as the maximum).
- Modify the size of the streaming window.
- Check the behavior of the infrastructure when processing the data sets on a clustering with a varying number of nodes.

#### (I) Validation Methodology

One methodology for validating the results is to use data sets in which the behavior is a-priori known, i.e., data sets that are either composed by real data or generated by a simulator. Thus, when executing a financial processing algorithm over the particular data collections we should we see the expected behavior. As we already explained and illustrated in Table 1, for data sets I-ALL-A and I-ALL-E (included in the Increasing-MP&TS collection), the expected behavior is to notice a sequential inchange of uncorrelated to correlated market players. This can be used for validating the evaluation results.

For example, we executed one of our implementations over the specific data sets, and then computed the average of the correlation values returns on each second. These average correlation values per second were actually as expected. Figure 19 shows the results of this evaluation over the I-ALL-A data set. This data set is for one hour, and thus, the plot shows the average of all returned correlation values for each second during this hour. It is easy to see that the results are as expected, i.e., market players are uncorrelated and become correlated, they then become uncorrelated and so on.

#### (I) Real-life Situations

An interesting aspect of our data sets is that they include real-life situations. For example, we detected that the SRD-B data set has a large number of market players for some hours, as shown in Figure 3. For instance, around 16:00 we see 1684 market players in 10 minutes, which means that we generate 1.4 Million correlations. Algorithms developed for processing financial data should be able to handle such situations.

### Data Collection