Data Collections

Data Collections we used in our first Evaluation

The  experimental  evaluation  we  performed  had  various  goals.  The  first  was  to  ensure  that  the QualiMaster infrastructure is able to process the real data from the financial domain as provided by our application partner. This refers not only to the quality of resulted financial analysis but also to the ability to process the data in the desired time frames. However, in order to reach the ultimate purposes of the project, we needed to also test the  behavior of the infrastructure in more complex situations, for example the behavior when we notice a larger number of market players or a higher number  of  ticks  per  second.  To  capture  these  goals  we  used  real  data  sets  from  the  financial domain as well as synthetic data sets. The following paragraphs provide their detailed description while Table 1 summarizes their main characteristics.

 (I) Real data sets from the Financial Domain

Our first data sets are composed by real data. This data is provided by SPRING through a custom API.  SPRING  gets  the  data  from  established  financial  data  providers.  Two  data  providers  have been  used  for  this.  For  an  initial  small  data  set,  data  was  gathered  from “TeleTrader  Software GmbH”. In a later stage of the project, the source of data switched to “TaiPan”, which is a product of “Lenz und Partner AG”, which provides much bigger and real time data sets. We collected the data  of  a  whole  day.  In  particular,  we  have  two  data  sets: SRD-A that  contains  the  data  of 03/18/2014 and SRD-B that contains the data of 07/08/2015. As shown by the overview in Table 1, SRD-A has 125 marker players and 29.2 (average) ticks per seconds, whereas SRD-B has 2830 market players and 422.7 (average) ticks per second.

Given the data sets characteristics, SRD-A is the smallest among all real and synthetic data sets, since it has the smaller number of market players as well as the less average ticks per second. In addition, we consider SRD-B, as it is the representative data set for the financial domain, since it covers all important segments of the financial markets and provides high accuracy data.


Table 1: An overview of the collections with data sets used in our current evaluation: (a) real world financial data streams; and (b-d) collections of synthetic data sets.



(II) Synthetic Financial Data Sets

As we explained, we also created synthetic data sets in order to investigate the infrastructure’s behavior on more complex situations and on various characteristics. For this, we used a data simulator provided by SPRING. This simulator allows to create data sets with custom market player numbers, data rate (ticks per second), data length and specific behavior of the market player’s data in terms of correlation of the market players.
Using the simulator, we created three collections that contain data sets with different characteristics. In particularly, we varied the following characteristics: (a) the number of market players that provide stock information; (b) the number of ticks provided by all market players within a second; and (c) the overall comparisons that the correlation algorithms need to compute. Table 1 provides an overview of the created collections. As show, the first collection, named Increasing-MP&TS, contains data sets with an increasing number of market players and ticks per second (and thus also correlations). In the second collection, named Increasing-MP, we fixed the number of ticks per second and modified only the number of players. The overall number of correlations that must be computed is increasing along with the number of players. In the third collection, named Increasing-Ts, we fixed the number of market players and varied the ticks per second. In this collection the number of correlations is the same for all data sets since the correlations depend only on the number of market players.
Figure 1 provides a graphical illustration of the three collections. Each collection, represented by a different color, contains a small number of data sets with a varying number of the market players and ticks per second (corresponding to the two axis of the plot), as we already explained. The number of correlations that must be computed is denoted by the size of the circles as well as the number shown inside each circle.
One important aspect of the generated data sets is that for some of them we known the behavior of the market correlations. When executing the particular data sets, we should notice a sequential inchange of the correlations between market players from uncorrelated to correlated. I.e., the market players are initially uncorrelated, they then become correlated, then uncorrelated, etc. Thus, we use these data sets for validating the quality of the corresponding correlation algorithms.


Figure 1: An illustration of the collection of data sets used in our current experimental evaluations, which investigates influence of market players, ticks per second and required comparisons (denoted by the size of the circles as well as the shown numbers).

table 2

Table 2: Statistics of the four data sets with Social data.

As we explained, our goal is to enhance these data sets in the upcoming months. In addition to the aspects discussed above, we are currently considering accompanying the collections with other data aspects or even configuration attributes of the QualiMaster infrastructure execution environment that might influence the overall processing performance. With respect to this, we are (currently) investigating the following:

  • Allow market players to provide more than one tick per second (currently we consider one tick per market player per second as the maximum).
  • Modify the size of the streaming window.
  • Check the behavior of the infrastructure when processing the data sets on a clustering with a varying number of nodes.

(I) Validation Methodology

One methodology for validating the results is to use data sets in which the behavior is a-priori known, i.e., data sets that are either composed by real data or generated by a simulator. Thus, when executing a financial processing algorithm over the particular data collections we should we see the expected behavior. As we already explained and illustrated in Table 1, for data sets I-ALL-A and I-ALL-E (included in the Increasing-MP&TS collection), the expected behavior is to notice a sequential inchange of uncorrelated to correlated market players. This can be used for validating the evaluation results.
For example, we executed one of our implementations over the specific data sets, and then computed the average of the correlation values returns on each second. These average correlation values per second were actually as expected. Figure 19 shows the results of this evaluation over the I-ALL-A data set. This data set is for one hour, and thus, the plot shows the average of all returned correlation values for each second during this hour. It is easy to see that the results are as expected, i.e., market players are uncorrelated and become correlated, they then become uncorrelated and so on.

figure 2

Figure 2: The average correlation value generated for the I-ALL-A data set (i.e., Increasing-MP&TS collection) for all market players on each second. The correlation algorithm is correct since the sequential inchange of uncorrelated to correlated market players is the behavior that we expect to see when processing the particular data set.

(I) Real-life Situations

An interesting aspect of our data sets is that they include real-life situations. For example, we detected that the SRD-B data set has a large number of market players for some hours, as shown in Figure 3. For instance, around 16:00 we see 1684 market players in 10 minutes, which means that we generate 1.4 Million correlations. Algorithms developed for processing financial data should be able to handle such situations.

figure 3

Figure 3: The number of ticks and market players in the SRD-B data set.


Data Collection