Tableau Software produces first-rate visualization software. They are among the many open-source and commercial visualization products (and services) listed by KDNuggets: http://www.kdnuggets.com/software/visualization.html.
As blog readers remember from February 2011, I posted on DevExpress and their marketing of the phrase “data mining” for their visualization software. As a result of that blog post, I had productive interaction with one of their product experts who conceded the point. Further, as you may remember from that post, my intention was to challenge commercial vendors to put machine learning algorithms even in the “View” layer (of the MVC design pattern).
Today, I report on heavy marketing by Tableau Software to associate themselves with “data mining” largely absent of machine learning (along the way, we also catch similar Microsoft marketing for PowerPivot). Both Tableau Software and Microsoft produce visualizations with trend lines, which arguably might be a calculated regression. However, trend lines alone do not encompass the rich science behind machine learning algorithms, even those available in SQL Server Data Mining since 2005. The difference provides a competitive opportunity for the much-needed visualization vendors.
Visualization alone is not data mining. If visualization were data mining, then Excel 2010 alone, with all its fancy built-in graphs, would be considered “data mining” (but read on since Excel 2010 does do nifty linear regression visualization, and Tableau Software has nice trend lines too). Under a loose “data mining” assumption, all spreadsheets going back to my earlier favorites, Lotus 1-2-3, and VisiCalc, would be “data mining” software. I liked Lotus 1-2-3 graphs, and seeing how they changed along with source data. Stopping at VisiCalc circa 1983 does NOT promote the incredible machine learning science developed since then. And for C-level executives and venture capitalists looking to invest in the next big “data mining” systems, they should not be paying for just 1985 technology. Its 2011, invest your money more wisely.
In this blog post:
- A demonstration of how Tableau Software is marketing their “data mining” visualizations
- An example of how someone used Tableau Software to connect to SQL Server Data Mining
- A challenge to visualization entrepreneurs to incorporate machine learning into their software
- My own gasoline data example discussing how to see the known and unknown
I have a variety of people reading this blog post including:
- Analysts who use data mining to produce models
- C-level executives and venture capitalists wanting to know what to look for in visual analytics software
- Visualization developers looking for that next competitive edge in the growing business intelligence industry
Someone might be in all these groups, but hopefully my comments will help you explain this “data mining” issue to other groups.
I was tipped off on this point by promotional links I saw from bing.com:
From this ad, we see Tableau Software marketing “Easy Data Mining Software”. To be fair, someone from Microsoft also decided that PowerPivot is a “Free Data Mining Tool”. You may have missed the many videos I have uploaded to YouTube showing how to combine PowerPivot with the actual data mining power of SQL Server Analysis Services to perform data mining. So, yes, Microsoft should know better. If PowerPivot qualifies as “data mining” then we could argue that the pivot table feature (connecting Excel to SQL Server Analysis Services) is also “data mining”, but why stop there? Go further and call Excel graphing “data mining” too.
So I would have made another separate post that “PowerPivot alone is not data mining” but when I look on Microsoft’s website, I do not see that claim. I don’t see it on the Excel product page or on the PowerPivot page. Perhaps some Microsoft people reading this blog post might be surprised that PowerPivot is being marketed as “data mining” so my evidence is important for them to consider when they have SQL Server Analysis Services to sell.
By contrast, Tableau Software uses “data mining” as a key marketing phrase:
I like my reference URL because it links to the specific ad I saw, which will now get more “hits” on Tableau’s back-end.
I took a picture of their demo video for Tableau Desktop, which described a process of discovering outliers on your own (meaning that the user provides the “data mining”).
In this case, the color gradient from red to green applies to the gross profit sum, and we see a total of four dimensions in this visualization. The red circles represent data points manually selected for further analysis. I like the ability to choose data points and put them into another visualization table or graph. However, using only four dimensions for categorization falls mathematically short of considering the 2^31 theoretical limit or 64K practical limit that Microsoft Data Mining provides. You could be lucky in manually navigating toward the dimensions that make the most mathematical difference, or you could use a machine learning clustering algorithm like Microsoft Clustering to automatically determine those critical dimensions for you:
- Wouldn’t you like to have your visualization software not just describe outliers, but also recommend dimensional combinations to try?
- Do you believe in the ability of the analysts in your organization to only rely on their manual navigation toward critical dimensions?
- Would your subject matter experts be better empowered to leveraging technology on top of their industry experience?
Again, I believe that Tableau Software produces excellent software, and you can read that their products often receive awards by those-who-know-the-difference. You should download their trial software and see what it does for yourself. Given my strong belief in their technology I want to showcase an example of how to match high-quality visualization with probabilistic data. I move now to how someone used Tableau Software to visually display data mining projections from SQL Server Analysis Services.
Blogger Richard Christopher posted a sample in February 2011 on how he took Tableau Software and put it in front of Microsoft Time Series (one of several machine learning algorithms available in SQL Server Data Mining). He changed the color for the predicted values to show the difference from the historical data. I believe that Time Series is one of the most frequent, if not the most frequent, data mining algorithm that people use. Thus, this example is pertinent to many people.
I will make another comparison about what makes data mining different than just drill-down. Actual drill-down and partitioning will only reveal combinations and subsets and filters which already exist. As in Christopher’s time series example, the predicted values from a time series do not exist. Predicted values do not live in a data warehouse or in an OLAP cube or a PowerPivot table until someone puts them there. Predicted values amount to science fiction. And, as I often say, science needs science fiction.
While SQL Server Data Mining can produce both descriptive and predictive analytics, the more compelling examples are predictive. Microsoft Data Mining can directly mine from an OLAP cube and provide insight into the values (measures) for dimension combinations which do NOT exist in the source data. From a data warehouse, Microsoft Data Mining provides that same insight, not only peering into the space which exists but that which does not.
Christopher showed how to color the difference between actual data and predicted data. Now, put on your venture capitalist hat, and ask yourself what type of “data mining” visualization software would you put millions of dollars into. You should be paying for visualizations which not only show combinations which exist, but probabilistically display those combinations and future values which do NOT exist. In other words, you should be paying for not only points which exist but also those which do not. Don’t just pay for some of it, invest in all of it.
If you think you are paying for “data mining” software, and all you get is what already exists, then I think you should ask for a refund. What exists provides the scientific evidence base for probabilistically knowing what does not exist. Yes, we should have excellent visualizations for what we do know, but the known can provide insight into the unknown. This argument therefore comprises the challenge for entrepreneurs to advance visual analytics for data mining to the next level. Let’s not just “take the ride” of “data mining” marketing and assume that looking at the past is enough. People who live in machine learning are not just interested in the past but what impact the past might have on the future (a sentence which essentially describes Bayesian philosophy).
Gasoline Data Visualization
To illustrate my point, I will work through an example using Excel 2010 and the actual free data mining tool Microsoft provides for data mining (32-bit, written for Excel 2007 and SQL Server 2008, but I am using it for Excel 2010 and SQL Server 2008 R2).
My problem today is that gas prices have increased in the United States. Now, for my world readers, there may not be much sympathy, because many countries pay much more for gasoline. However, many other countries do not have the rich oil reserves that the United States has, so I believe those prices should be adjusted for the oil supply available (as well as the oil technology too). In this example, I have two columns: one for miles driven (sorry world, the United States has not converted to metric which at least this scientist prefers to use), and the second column for cost of gasoline (averaging about US$4.00 per gallon and assuming 25 miles per gallon mileage).
Now I can use the “data mining” visualizations in Excel to produce a scatter plot:
Note that this scatter plot will NOT deliver the cost for 5 or 15 or 25 miles because data do NOT exist for that combination.
Today, I learned that Excel 2010 has a cool function overlay showing the linear relationship, which you access through the choice box:
A linear regression does apply one of the algorithms from SQL Server Data Mining, and I would argue that this application does qualify. Pressing this single button, my graph changes to the following:
The graph represents what Excel produces, except that I added the red arrow and the red text with SnagIt (which is also not “data mining” software even though the words “linear regression” came out of it). People have seen such graphs so many times, that it might not be obvious that the line does not represent actual fact, but science fiction. In fact, that line might not even pass through any of the actual data. However, the projection gives an indication what might happen if someone were to drive 15 or 18 or 22.6 miles and expect to estimate their costs. As mentioned earlier, some of Tableau Software’s visualizations include trend lines, but I did not see where they claimed an equation like Excel does.
What I like about this regression line from Excel is that it runs through the range of data, and does not project to low values such as 5 miles or high values such as 36 miles. Operationally, people use data mining to predict outside the original attribute (dimensional) space, but the results are more robust when they are within the known data values.
Now, imagine that we were talking about an OLAP cube or PowerPivot. My simple cube from this data has one dimension (miles driven) with three distinct values (and Microsoft SQL Server Analysis Services adds the value “missing”). When someone queries Analysis Services for values which do not exist, they get the “missing” result. In this case, there were no “missing” entries, and therefore the result for querying “missing” is null.
Here is my data in PowerPivot:
Here is a PowerPivot table, and as before, there is no data for 15 or 26 miles.
PowerPivot also makes charts, but again no data for the 15 or 26 mile estimates:
Now, someone might say, well MarkTab, you can look at the PowerPivot chart, and can imagine what the output value would be for 12 or 24 miles. My response is that if we are using my imagination, then we should make a more direct claim: “data mining is all in your head”. Why not make that conclusion, because I do not see those results on the PowerPivot chart.
Why use “data mining” in software advertisements?
- Possibly people want to leverage the interest in “data mining” and “predictive analytics”
- The phrase “data mining” does not have a universally negative connotation (but it can under certain conditions)
- Perhaps some people are literally applying the metaphor of someone going into a physical mine in the ground and discovering something new
Consider the data mining analogy of actually going into a mine. I am claiming in this post that “data mining” is more than just going and discovering patterns in and among existing data (descriptive analysis, which machine learning can assist and produce). I am saying that by studying data from the 1,000,000 known mines, that we can have a probabilistic view into a brand-new mine, without even using a shovel. If you ask someone in the oil exploration business, that’s the type of research and analysis that scientists have been producing for decades. Instead of outright digging a large well, scientists infer where oil may be based on other known visible features. We gain insight into the unknown from what is known.
In this post, I hope I have explained the vision of what “data mining” can produce. I am not primarily interested in chasing all marketing claims from highly-respected vendors. However, I am more interested in what many people believe is a major barrier to using actual data mining: developing sophisticated and savvy viewers so that people can make decisions based on probabilistic information.
If you are a venture capitalist or an entrepreneur working in this space, you should be funding more than someone’s imagination. You should be paying for actual visualization and depiction of values which may not exist in any database. If you want to do some thorough market research, take a look at KD Nuggets’ list of open-source and commercial visualization products, and make sure that when you market your products, that you show organizations what makes you different from what is already on the market:
- Why is your software so much better in addition to what native Excel and PowerPivot (both excellent products which people should use) can produce?
- Can you communicate probabilities for unknown subsets and never-before-seen filtering combinations?
- Finally, how do people enjoy using your software?
I am hoping a new generation of developers trained in graphical arts and media technologies will leverage their skills toward this emerging art and science of visual analytics. Server-side analytics (such as SQL Server Data Mining) can complement advanced visualization. And, as I said at the outset, don’t count out the view layer in having some machine learning tricks up-the-sleeve too.