SQL Server Data Mining Architecture

Be Sociable, Share!

    Data Mining with Microsoft SQL Server 2008 Review Chapter 15

    As the authors state, if you are only interested in data mining analysis you could skip this chapter. I believe this book leans toward the Microsoft technology, and therefore the goal of the book is not necessarily to make someone a better data mining analyst (generically). The book therefore logically includes this chapter on architecture since the topics speak toward making SQL Server Data Mining successful in a production environment. I therefore follow an outline which tracks along with the major chapter sections:

    • Analysis Services Architecture
    • Using XML for Analysis (XMLA)
    • Data Mining Processing
    • Data Mining Predictions
    • Data Mining Administration

    Analysis Services Architecture

    Discussing Analysis Services in detail is beyond the scope of this book. I did my own review on this blog comprised of 41 chapter-by-chapter postings which break down Analysis Services into topical sections. SQL Server Data Mining shares a strong conceptual similarity with OLAP Cubes: the need to summarized or aggregate source data in a more compact form. Commonly, an OLAP structure aggregates data using sums and counts (and therefore averages). OLAP allows people to take a step in the direction of parsimony (or simplicity), the goal statisticians share to make large datasets model onto smaller mathematical set spaces. Data mining also provides structures and models, where a structure serves as an Analysis Services attribute, and training data are stored in temporary OLAP cubes to build a resulting data mining model (page 482).

    For SQL Server Data Mining, the main message I have is that the Analysis Service engine is always required. The Microsoft Excel add-in is client software and requires an Analysis Server instance behind it. If you build an application for SQL Server Data Mining, you will need to connect to Analysis Services.

    Using XML for Analysis (XMLA)

    Start with my previously-posted statement on XMLA. XMLA has gained industry support, and what XMLA does is allow a protocol for discovering information (not images) or executing commands in other OLAP systems. In this case, Analysis Services is the XMLA provider which can provide information and execute commands (including DMX and MDX). For convenience, third-party vendors provide software development kits (SDKs) for XMLA programming, or otherwise someone could use the XMLA provider’s documentation to see how to structure the XMLA.

    The authors include a Customers solution with this chapter to allow running two blocks of XMLA code. The first block of code uses XMLA to discover mining model columns.

    I like the idea of sending XML among computers since people can figure a way to send the text-based commands through corporate firewalls. This code results in XML output, and I decided to share the first part of this long output because I identified one of the intial lines to have regular expressions embedded as the method to communicate and enforce a restriction.

    The second block of XMLA code includes a Statement command, which could be used to send DMX or MDX.

    On the results, I am again seeing the regular expressions embedded in the XML.

    The authors have a graphic showing how XMLA scripting can be created from Management Studio when connected to an Analysis Services database. I find the menu feature helpful because it creates full XMLA commands. Functionally, I believe developers using XMLA would find ways to automate XMLA code generation, since it is easy to make coding mistake.

    Data Mining Processing

    People who write competing data mining software products would find this section interesting. As I said earlier, this technology leverages features in Analysis Services, and therefore inherits all the strengths and shortcomings of the parent product. As Microsoft improves Analysis Services, those improvements also affect SQL Server Data Mining. Even as I think about the name, it would be more appropriate to call this technology Analysis Services Data Mining, signaling that data mining can operate on cubes, and that functionally data mining processing requires Analysis Services. The term that has more use is SQL Server Data Mining, pointing to SQL Server as a data source, and perhaps internally to Microsoft, indicating the parent budget which developed this technology in the first place. If Microsoft were to develop more user-friendly clients for data mining on the desktop and through SharePoint, we might see them launch Microsoft Office Data Mining.

    No matter what the name, the technology requires Analysis Services. Linking to this technology also explains why this technology can readily handle nested cases. Many competing products require flattened tables as input (denormalized). I repeat a comment I often make, that as a SAS analyst, most of my experience has been in the denormalized world (as compared to the transactional normalized world). With Microsoft Data Mining, I can be more normal. The diagram on page 485 includes a product mention of VISA (did they pay for that product placement?) and how you can use data mining to analyze your nested VISA transaction at a grocery store. Coke also got mentioned, but data mining can analyze Pepsi too.

    Data Mining Predictions

    This chapter starts with a description of how the data mining algorithms see data. Important for people to know is that the algorithms already see a tokenized version of the information. Therefore, during the ETL process, people do NOT need to recode values (for example “False” = 0 and “True” =1) as a necessary precondition, because this technology already does it. I believe that tokenization happens more easily because of Analysis Services, which has a way to tokenize attributes which access values in cubes. I mention this point because if you read some books on preparing data, they will talk about changing categorical fields to numeric constants as a necessary precondition for data mining. That step is not required in this technology.

    Enumerated constants come up as a topic in this section. The authors state that enumerated constants are not as efficient as named parameters for making queries. The reason is this conversion process between original data values and tokenized replacements (which need to be converted back at the end).

    Data Mining Admininstration

    This section covers security and server parameters. The server parameters are set for the entire instance, and not just one database. Those parameters can be set through the Properties section in Management Studio. Security comes by database, and as a best practice, should be managed by roles. Yes, you can add individuals, but using roles simplifies security initialization and maintenance for enterprises. To make models, an individual needs to have administrative permissions on the database level. You can make accounts which have such administrative abilities, and impersonate that user through an application (see the discussion on page 492). I believe the security options follow well with Windows Integrated Security, and though the book offers many helpful suggestions, feel free to read on this topic in other places since tips on how to configure Windows security would also apply here.

    I must mention the authors’ citation of the Data Mining Moratorium Act of 2003 (page 489). This political bill was introduced in January 2003, and would have had impact only in the American Department of Defense and Department of Homeland Security. The book authors read this law as being mistrust toward technology. I view the issue differently, since the bill did not just apply to data mining but to any data access through any technology (put into Microsoft programming terms, any results from a T-SQL query or OLAP MDX query, or OLAP DMX query). The term data mining, as I read the bill, has a larger implication on privacy of information, particularly in light of other Federal legislation, notably the Freedom of Information Act (FOIA). When the government conducts such activities, FOIA may make the contents or the process of the investigations available to the general public.

    I have two props that I often take to my data mining presentations, a costume miner’s hat and a rock hammer. The props symbolize data mining, and I have many analogies to draw with real mines and the stories we know, everything from the California Gold Rush to diamond mining in Africa. People wear helmets because they can get hurt, and I am not beyond believing that data miners might have personal liability for connecting information, perhaps with ill intent. The rock hammer I carry shows the damage we can cause through sharing data. Overall, data sharing is part of a larger discussion of information sharing, and as experienced politicians know, holding information helps build power. Releasing information does not necessarily reduce power, but allows it to flow to another location. In more recent years, people have been more concerned by data mining by Facebook and Google:

    My point is that when we talk about Data Mining Administration, the topic will likely include overall organzational policies on who should have access to data and what they can do with any resulting analysis.

    MacLennan, J., Tang, Z., & Crivat, B. (2009). Data Mining with Microsoft SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.

    ISBN: 978-0-470-27774-4

    Be Sociable, Share!