The possiblities of multivariate analysis using a single visualizaton tool in Parallel Coordinates is endless. It might appear absurd to begin with, but once you start playing around, the tool is so easy to use and can turn out to be one of your best friend for data analysis.
Here are couple of scenarios where you can apply Parallel Coordinates (PC).
- Start with a hypothesis and try to prove or disprove the hypothesis based on the correlation between the different variables.
- Or, the classic Unknown Unknowns, where you often end up with the moment ‘Hmm. I dint even realize such relationship existed between these variables.’ PC creates scenarios for such aha moments.
Here is a sample data analysis workflow that I have often used with PC.
- Start with a sample set of variables related to the problem domain or your field of research.
- Do the data munging and number crunching in batch. Most likely a Map Reduce job for very large data sets or a basic script in Python. Couple of points to remember:
- PC works fine with quantitative, ordinal data.
- When the dataset happens to be highly dispersed or continuous, especially with a measure, try to use appropriate segmentation techniques or Bin ranges.
- Produce a simple CSV with all the required variables for analysis.
- Play around with your PC and do your multivariate analysis.
- Keep refining your datasets by adding or removing the variables relevant to the equation. Repeat from Step 2.
Remember, while visualizations like PC can be such a powerful tool, they can be counter-productive for the below reasons.
- Large number of data points is limited by the browser memory and it slows down the user experience. The last thing you want during an engaging story telling session is losing your users’ attention, staring at a blank screen. One way to overcome this issue, in very large dataset scenarios is via sampling.
- Large data sets would result in visual clutter.
- The ordering of the axis is important to identify relationship between variables. It is always better to have the tightly correlated variables adjacent to each other.
My guideline has been 12 different variables and 10,000 data points would work best for a PC story.
Some of the use case patterns where I have used PC are:
- Searching for the Unknown correlation between variables.
- Looking for similar behavior across data elements.
- Behaviroal Analysis. Segregating different categories or clusters of similar data that exhibit a common behavior. Use these clusters to define or benchmark attributes.
- Exception Processing. This is useful especially when you want to identify conditions that correlate to a particular outcome, either at the top or bottom of the data spectrum.
- Root cause analysis.
The ground rules for any visualization applies to Parallel Coordinates as well. Always remember to ask the WHY – purpose, context and the message to convey, WHO – who is going to consume your insights and their abilities to interpret, HOW – how is this insight going to be consumed?
It is extremely important to stick to these ground to make your visual insights to be effective and to reduce the frustration levels.
Here are some good examples on Parallel Coordinates from WWW.
From the master: http://web.stanford.edu/group/mmds/slides/inselberg-mmds.pdf