Normalizing Streaming Data & Piecewise Aggregate Approximation

Ok, so you’ve read the last post, downloaded and read the papers on SAX, and you’re ready to get going!  Wonderful.  First, you’ll need some data which I’ve thoughtfully included for download here- SAX Prep (an excel file with some trades in it).  Download the data, and then follow along below.

WHAT ARE WE DOING?

What we want to do is take a whole bunch of numeric data and reduce the dimension of it and then convert it into some type of symbolic representation.  This is so we can do some other interesting things with it later that are much easier when the data is represented this way.  Currently, the data in the Excel spreadsheet that I’ve toiled for hours on just for you, looks like this:

What we see in this chart, is a day’s worth of trade prices for a make believe symbol.  Actually, I know the symbol, but I can’t tell you that because you didn’t buy the data!

In the next step, we want to normalize the data with a mean of 0 and standard deviation of 1.  So, compute the average for the day, and then for each price, subtract the average and divide by the standard deviation.  Or just use some Excel functions; which I have done in the spreadsheet for you.

Piecewise Aggregate Approximation (PAA)

Once we’ve normalized the data, we can apply PAA,.  I picked time divisions of an hour, and averaged the normalized price information.  You can see the normalized price data and resulting buckets, as as computed via PAA in the chart at the right.  There’s something important to notice here, although I didn’t pick a bunch of divisions, which might have given more specificity to the resulting PAA analysis, you can see that the shape of the PAA looks like the underlying data.  This is important when we then use

With Applied PAA

symbols to describe the patterns – because we’re using PAA underneath, we can calculate the distance between observed SAX patterns.  Also, you can see some of the statistically irrelevant spikes have been ignored.  Super Good!

EVERYTHING’S A SYMBOL, MAN…

So, how do we go from normalized PAA to symbol?  Easy; if you look in the spreadsheet, you’ll see the values -1.28, -.84, -.52, -.25, 0, .25, .52, .84, and 1.28.  And I’ve associated letters with those #’s.  So, the first PAA is 1.68, which is greater than 1.28, so our word begins with I.

MAY I HAVE THE ENVELOPE PLEASE

So after all of this analysis, our SAX word that represents a whole lot of trade data is, “IFGDBAB.”  How cools is that?  A whole day’s worth of data expressed as a few symbols.  Think of how much easier it would be to look up a nearest neighbor to this pattern, or maybe classify it given some cluster analysis, or detect something that we haven’t seen before using suffix trees?  All much easier to do with symbolic vs continuous numeric data.

TAKE ME TO THE ‘B’ SECTION

If you read the papers I recommended, and have paid attention, you might notice a potential problem with the methodology outlined and applied here so far.  What is it?  Also, this has been a lot of fun to do using Excel, but I think we could actually get this done easier and faster using some good old sliding windows and aggregation (CEP).

AND AS ALWAYS

Thanks for reading – I’ll be showing how to do this using DarkStar next.  Because chances are if we’re doing this in real time, we’re doing it for *a bunch* of data, and Excel, although wonderful, just ain’t going to cut it.

Comments

  1. Normalizing Streaming Data & Piecewise Aggregate Approximation http://blog.cloudeventprocessing.com/201

  2. Normalizing Streaming Data & Piecewise Aggregate Approximation http://blog.cloudeventprocessing.com/201

  3. Normalizing Streaming Data & Piecewise Aggregate Approximation http://blog.cloudeventprocessing.com/201

  4. Normalizing Streaming Data & Piecewise Aggregate Approximation http://blog.cloudeventprocessing.com/201

  5. Normalizing Streaming Data & Piecewise Aggregate Approximation http://blog.cloudeventprocessing.com/201

  6. Can you find the ‘gotcha’ in this data mining technique for streaming data? http://bit.ly/byFXza #cep #analytics #sax
    This comment was originally posted on Twitter

  7. eicg says:

    Can you find the ‘gotcha’ in this data mining technique for streaming data? http://bit.ly/byFXza #cep #analytics #sax
    This comment was originally posted on Twitter

  8. ITBlogNet says:

    #CEP #Blogs Normalizing Streaming Data & Piecewise Aggregate Approximation: Twitter When building a system like Da… http://bit.ly/96f4na
    This comment was originally posted on Twitter

  9. CEPBlogs says:

    #CEP #Blogs Normalizing Streaming Data & Piecewise Aggregate Approximation: Twitter When building a system like Da… http://bit.ly/96f4na
    This comment was originally posted on Twitter

  10. datapeddler says:

    Normalizing Streaming Data & Piecewise Aggregate Approximation http://bit.ly/9AuRLz
    This comment was originally posted on Twitter

  11. No 1 has found flaw in normalizing streaming data 4 data mining B4 applying piecewise aggregate approximation at http://bit.ly/byFXza #CEP
    This comment was originally posted on Twitter

  12. Matt Rosen says:

    Great stuff.
    As the paper was about the iSAX implementation, I gather that your challenge question is with regard to iSAX’s ability to flexibly change resolutions so that you can smooth out the number of data items assigned to a given SAX “word”. In your example, the SAX value “B” has 2,240 assigned data points while “C”, “E”, and “H” have none. Your distribution approach seems to assume a normal distribution, which as the paper describes is rarely the case. The paper describes a technique for raising and lowering resolution so that an index can be constructed to limit the number of data points that need to be loaded and checked for a query match. I am guessing that this is what you are hinting at.

    I also have a question as to how to decided on / determined the values to use in your distribution (why the distribution step sizes of +/- 0.25, 0.27, 0.32, and 0.44?).

    • Colin says:

      Matt,

      Good observations. The distribution may present an issue based upon the type of data we’re analyzing. We’ll save that for later. But you do need to have a normalized data set for the PAA to work. We’re looking for patterns regardless of ‘scale.’ In regards to the distribution step sizes, each value represents an integral of equal size such that the area under the distribution curve is equal for consecutive terms and is dependent upon how many characters we’ll use in the SAX word.

      But the thing to watch out for here is when there’s relatively no activity – in this specific case, it would be for a thinly traded stock that stayed at the same price for an extended period of time. Probably not too probable, but it will occur. It may occur outside of Capital Markets though much more frequently. And then when you were normalizing the data, you’d be dividing by a very small standard deviation, blowing up the results into something meaningless. So time series with little variation have to be fudged a bit – we set them to the midrange of SAX word for that time slice.

  13. How do you compute mean and variance / standard deviation on a stream as it flies by? Even if you approximate those (compute them one day, and use them the next), then you can’t compare the code words you compute (“IFGDBAB” in the example) from one day to the next, unless there is no change in the statistics?

  14. An entire stream to a 7 character string. Cool! :)

    • admin says:

      Thank you, but not exactly. We’ll get a little more precise in upcoming posts. This was to just get the idea across. We’ve some work to do before we can say we’ve really accomplished anything yet.

  15. Robert Hall says:

    I can see the value and potential for mining the data stream. Good job.

    Perhaps I’m jumping the gun here a bit but I see room for using an algorithm to automatically adjust the standard deviation, mean and period to always keep the data sequence word in a meaningful state. Would/could this also be applied to predictive states with any accuracy?

    • admin says:

      Robert,

      Yes, it can and we will – in the next blog post I’ll show how to apply this to sliding windows of continuous data.

  16. jason_trost says:

    Interesting article. Normalizing Streaming Data & Piecewise Aggregate Approximation. http://bit.ly/9SKqtw #CEP #cloud #analytics
    This comment was originally posted on Twitter

  17. Kevin says:

    Interesting post. It makes me wonder if the DIST standard for storing and sharing probability distributions could be of use here. http://probabilitymanagement.org/Dist.htm

    While DISTs are often used to store thousands of Monte Carlo trials there is no problem using them to store historical data. The streaming part has me wondering as there could be too much overhead in creating the DISTs if it were on sub second (tick) data.

    I have on my ‘wish list’ to create a DIST generator (maybe as part of R Project or NinjaTrader) but until then there are 3 or so desktop apps that create DISTs. They are listed on site referenced above.

    Disclosure – I have a ‘relationship’ with XLSim.

    • admin says:

      I don’t know if it could or not – I don’t know enough about DIST to make any informed comment on this. Perhaps you’d care to comment further, or provide an example or two, to accompany your plug?

  18. Kevin says:

    Sure thing, and just for the record the ‘plug’ is for an emerging open standard for storing probability distributions or DISTs. It was developed in conjunction with many of these companies http://probabilitymanagement.org/Membership.htm

    So what is a DIST? It’s a compact and efficient representation of scenarios. It can be useful when modeling interrelated uncertainties be that standard distributions (e.g. normal, log normal), simple historical data, or more exotic (e.g.) fat-tailed distributions.

    A Distribution String (DIST) is comprised of thousands of realizations (Monte Carlo trials e.g.) encapsulated into a single data element.

    The advantages include a smaller storage size, increased speed (unpacking is typically done in memory), and they are additive, not to mention portable as they are represented in XML
    e.g. Header may like like this

    Body
    zPMzP//mZn//2ZmAAD//…//8AAJmZzMzMzJmZMzNmZgAAAACZmczMzMwAAGZmzMwAAJmZ//+Zmf//Zmb/////MzMzMzMzzMwAAJmZmZlmZv//MzP//zMzAAAAAMzM//+ZmTMz//9mZpmZ////////mZkzM8zMZmZmZpmZzMzMzGZmAADMzDMzzMwzM8zMzMwAADMzZmaZmWZmzMwAAJmZAADMzJmZAACZmQAAzMz//5mZMzPMzAAAAABmZmZmZmYzM2ZmmZkAAAAA
    The body contains data from 1000 ‘trades’ for a stock
    Clearly my example has factious data. Your SAX Prep XLS sparked the thought that DISTs may be a useful.

    I’ll tweet more on this in the coming weeks (vacation next week *smiles).

    Kevin – @i001962

  19. Mike says:

    Keep in mind the formulas are peeking forward which is a no-no for out of sample trading. All out of sample data must use the same parameters as used in the original data. Thanks, for the concept. Very clever and I suspect very useful.

    • zadmin says:

      I’m not sure if I understand your “peeking forward” comment. Using this approach, one is able to identify historical patterns. Using this approach, one could implement machine learning algo’s that, given a stream’s pattern, could classify the behavior, and potentially project the next most probable course of events. But the underlying message here is that you can make decisions in real time and in time to take advantage of the opportunity.

      This is very difficult to do however with present day CEP platforms. And one of the key drivers in the research and subsequent realization of DarkStar.

Speak Your Mind

*