Microsoft Program Manager Eric Ligman has been posting lists of free eBooks from his blog. I have not mentioned these books in past blog postings. So, this posting is a catch-up for the three lists he has already mentioned and subsequently summarized. I will be commenting on free books which help for Microsoft data analytics and data science.
Continue reading “Free eBooks from Microsoft” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 15
As the authors state, if you are only interested in data mining analysis you could skip this chapter. I believe this book leans toward the Microsoft technology, and therefore the goal of the book is not necessarily to make someone a better data mining analyst (generically). The book therefore logically includes this chapter on architecture since the topics speak toward making SQL Server Data Mining successful in a production environment. I therefore follow an outline which tracks along with the major chapter sections:
- Analysis Services Architecture
- Using XML for Analysis (XMLA)
- Data Mining Processing
- Data Mining Predictions
- Data Mining Administration
Analysis Services Architecture
Discussing Analysis Services in detail is beyond the scope of this book. I did my own review on this blog comprised of 41 chapter-by-chapter postings which break down Analysis Services into topical sections. SQL Server Data Mining shares a strong conceptual similarity with OLAP Cubes: Continue reading “SQL Server Data Mining Architecture” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 12
Neural networks continue to fascinate people because of the history in trying to model the brain. The longer history ties to the Artificial Intelligence community, where people continue serious work on mimicing intelligence in machines. Some say the effort has died, and perhaps modeling a machine to one user is not the end goal for everyone. Rather, I believe the Internet itself is a form of collective intelligence, and therefore the ideas may transcend a single processor, user, or application.
Jeff Hawkins, developer of the Palm Pilot, has been personally interested in modeling the brain, as he describes in his book On Intelligence. In this book, Hawkins describes the brain as a six layered network of neurons, and he provides some speculation on how the layers interact with one another. I use the word speculation intentionally because the scientific method requires science fiction. Speculations and hypotheses are essentially science fiction, and researchers test their fictional but plausible ideas in an attempt to learn new scientific facts. Continue reading “Microsoft Neural Network and Logistic Regression” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 11
This chapter talks about the association machine learning algorithm. The term “market basket analysis” is often used to characterize a common application, where a retailer can either physically arranage items (as in general merchandiser Walmart) which are often purchased together, may provide recommendations (as in online merchandiser Amazon) for similar products while a customer shops, or may provide coupons (as in supermarket retailer Kroger) for future purchases. The application can be not just for products but also for a combination of products and services (as with a health care facility, or with an automobile repair shop).
The outline for this blog post:
- Recap of the Authors’ Solution Example
- The Authors’ DMX Code
Recap of the Authors’ Solution Example
The authors provided an example using the Movie Click data using BIDS (Business Intelligence Development Studio). The intention of the mining structure and single mining model is to completely represent the demographic and output movie selection data. A possible use for this solution would be to make live recommendations to movie shoppers based on their demographics and based on movies (as they select them with their shopping cart).
Continue reading “Microsoft Association Rules” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 10
This chapter pays homage to Andrey Markov, the 19th century Russian mathematician who proposed what we call today Markov chains (page 334). The combination of sequence and clustering in the title reveals the nature of this machine learning algorithm: a combination of sequencing and clustering. The outline for this blog post:
- Revisiting Natural Groups
- Recap of the Authors’ Web Click Solution
- Further Discussion on Web Click Analysis
- Authors’ DMX Code
- Expanding the Applications for Microsoft Sequence Clustering
- Persons Produce Wisdom
Revisiting Natural Groups
As with the last chapter this chapter also refers to natural groups, “The number of natural groups in a sequence clustering model” (page 339), and an example on page 335 explicitly mentions genetic sequencing. Data mining is an applied science, and therefore I recommend avoiding the word natural to describe grouping since this word refers to a specific presupposed world view often attached to genetic and biological sciences. Continue reading “Microsoft Sequence Clustering” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 8
I have commented several times that time series was an entire class when I was in graduate school. It was an appropriate topic for that stage (either for graduate school or later in an undergraduate) because calculus is required to communicate the mathematics. If I had to bet on a single data mining algorithm used across all situations and companies and countries and industries, this one would be it. For the 2008 version, Microsoft has made good improvements to this algorithm, allowing analysts to tune parameters depending on the situation. Among all the available Microsoft data mining algorithms, I believe the parameter choices affect results for this algorithm the most, and therefore might justify multiple models for comparison (since only empirical results can best demonstrate efficient outcomes).
Time series was a big topic for W. Edwards Deming. He used this subject to demonstrate what variance is, and whether a system was in control. Continue reading “Microsoft Time Series Algorithm” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 7
Decision Trees is one of the most useful algorithms. This algorithm conceptually extends modeling into a tree of nested models where each branch provides tailored understanding of the training data. This blog posting will track the DMX code which substantially provides the discussion framework for the chapter. You can get this code for free from the authors’ (actually the publisher’s) website, but if you want to be a data mining professional you should also have the book. This same single algorithm encompasses both Microsoft Decision Trees and Microsoft Linear Regression.
The sample DMX code refers to the ASSprocs stored procedure. That code is available from http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html. While I was looking for the code, I discovered that this book is available from Wiley in e-Book format (see the previous link), and optionally you can see the eBook as part of Safari Books Online (subscription service): http://my.safaribooksonline.com/9780470277744.
Continue reading “Microsoft Decision Trees Algorithm” »
Data Mining with Microsoft SQL Server 2008 Review Chapter 6
The book now goes into a series of chapters, six through twelve, of an in-depth look at the individual algorithms. I will repeat a comment from earlier in this series: this book was authored by the technology gurus who developed this software. The text supplements and extends what is free through MSDN Product Documentation (separately downloadable as SQL Server Books Online). The book has two important features:
- Detailed how-to tutorials and instructions of how to use the technology
- Behind-the-scenes technical tips which, though authoritative, cannot and should not be in the product documentation because Microsoft wants to promise functionality not implementation. In other words, how a product is implemented may change, though the functions should be consistent with the Microsoft documentation.
Now, let’s talk about the use of Microsoft in the chapter title (this chapter and subsequent chapters) to describe the algorithms. The Naïve Bayes machine learning algorithm is well known in the literature. Microsoft has made between minor and major tweaks with each algorithm, allowing them to rightfully claim the implementation as theirs. I do not have personal knowledge on whether these changes amount to a patent level of unique creation, but certainly enough to qualify for a copyright. Later, chapter 17 will talk about extending this technology and developing your own algorithms. Thus, it’s fair for Microsoft to sign their names on their algorithms, and that name persists through the data mining wizards and interfaces. Some future third-party developers might choose to make their own implementation of these same algorithms, and add their own names. If you choose to make one, I encourage you to share it, or at least a free version of it, on the open-source community site codeplex.com.
Continue reading “Microsoft Naïve Bayes” »
Microsoft SQL Server 2008 Analysis Services Unleashed Book Review Chapter 14
This chapter shows how to extend the native commands within Analysis Services using either managed code assemblies or COM assemblies. I will assume knowledge of assembly creation with either COM or .NET languages, since that assumption follows how this chapter is presented. I realize that this assumption will leave out some people from understanding this chapter.
As has been true for years, COM assemblies are (as a rule) less secure than managed assemblies, and therefore the wisdom is to rewrite any COM assemblies in .NET. I concede that there are still COM developers who can write effective code, but going forward, I recommend using one of the many .NET languages to write any code. COM support is turned off by default (page 245) as an extra security precaution.
Continue reading “Extending MDX with Stored Procedures” »
Data Mining with Microsoft SQL Server 2008 Book Review Chapter 3
DMX stands for Data Mining Extensions, though originally was called OLE DB for Data Mining, a name from the pre-.NET days. The book recalls how people on the data mining product team used “guerilla tactics” to encourage Microsoft’s official SQL marketing department to use their preferred acronym. I guess marketing had to have something to do. I’m wondering if someone has a study proving whether DMX is a better marketing phrase than OLE DB for Data Mining. I know for sure I can track how many people will read this article based on my decision to review this book and its technology.
The chapter starts with the premise concluding that “the field is relatively immature” meaning that “there are no standard concepts of mining models, training or predictions”. I have believed that there is a difference between the research-based approach that I term machine learning, and the applied science, I call data mining (I would even use the phrase data mining engineering to be consistent with other engineering disciplines). When the authors talk about the immaturity, they refer to the potential capacity of business intelligence professionals and software architects and database developers to integrate this technology across industries and even across software vendors. By contrast, I have deep respect for the decades of proven mathematical research behind the machine learning algorithms, some of which have proven to be time-tested production-level essential elements of some of the world’s most sophisticated systems.
Continue reading “Data Mining Concepts and DMX” »