# Data to Knowledge Day 1

## Scaling Time Series Data Mining to trillions of series

Eamonn Keogh

1. Similarity search is the fundamental problem for any task
2. This problem is now solved.
3. As effective as possible, as efficient as possible.

What is similarity search? All about required invariances: Looking at ancient manuscripts of fish – identify multiple drawings of the same species. Invariant to color, scale, and orientation.
(Note that rotation becomes phase, easy to handle). Of course database to compare to might be very large, so this must be fast.

Time-series space: Larson transform. Key is often changing between data formats. DNA to time-series: G = step up 1, A = step up two, C = step down 1, etc.
Produces accurate phylogenies Locke et al., Nature.

Identify mosquito species based on activity time series pattern.
Now doing in real time identification of sex and species by flight path from 100m away.

Beet leafhopper, \$10 billion damage by spreading disease between plants.
measure voltage on leafhoppers. Time-series motifs identify common behaviors.
See paper

Finding rules in time series for predicting patterns.
Rules work across Zebra finch songs. Knockout genes, see how this changes.

What are the distance measures? Invariant to: noise, uniform scaling, offset, amplitude scaling, dropouts, phase, warping.

Euclidean distance between ith pt in the time-series.
Much better: distance to i+1 etc. “Dynamic Time Warping”.

Claim to have tested every publicly available dataset in the world.
Got code/advice from every author and asked them how to make it better.
All other methods are slower, more parameter-rich than anything else. Not on average, but on every dataset we’ve tested. (While not “every dataset in the world”, Eamonn probably has assembled one of the largest collections of datasets used in clustering, which he makes conditionally available on his website).

But isn’t DTW is O(N^2) to minimize the path in distance space between any pt in set 1 and any set in pt 2.
LB Keogh method makes it O(N). But, can now do in O(1).

How about high dimensions. answers: argues that low-dimensions is all that you need, which needs domain expertise.

## Random forests

more memory efficient, scalable solution for random forests.
code on github

## coffee break

Nice chatting with

• Josh Bloom, Astrophysics.
• Michael Last, applied math, US Gov’t.

• Excellent discussion with Keogh about applications of time series analysis to early warning

## Sparse GEV Yan Liu.

• EM algorithm approach.

## Classifying signals with missing data

i.e. before we’ve seen the future of the data.
Estimate reliability of the data, decide when to act.
(Reliable != accurate. Just means same prediction as full data) Model random features parameterized from data so far.

## Astro-statistical challenges (Alex Szalay)

limitations of poor sampling give way to systematic errors unknown unknowns when no models available

Sparse signal, large dimensions, much noise, rare events, Large SVD perfect for randomized algorithms

## Jeff Scargle, Bayesian Blocks: Segmented Models for Detecting and Characterizing Transients in Streaming Time Series

Construct a good representation of the variability of the underlying process from time series data.

A Bayesian dynamic programming solution for selecting change-points.