Using SQL Server Data Mining

Be Sociable, Share!

    Data Mining with Microsoft SQL Server 2008 Review Chapter 4

    This chapter covers a complete look at how to develop data mining structures and data mining models using Business Intelligence Development Studio (BIDS). The authors’ outline Pages 127-128):

    • Using BIDS
    • Understanding immediate mode and offline mode
    • Creating and modifying data sources, data source view, and data mining objects
    • Exploring data and evaluating models

    Let’s start with a fact: SQL Server Data Mining is a technology bound to SQL Server generally and Analysis Services specifically. This technology is neither a desktop nor a web application. It was created to be part of desktop and web applications, and was intended to allow people to modify and extend using XMLA and DMX. BIDS (an implementation of the extended Visual Studio) is the best way to create the DDL required for production (and enterprise level) data mining, and this chapter shows how to do that.

    Part of the discussion involves mapping this technology onto other commercial or open-source data mining software alternatives, which are either desktop applications or web applications or both. I return to the fact that SQL Server Data Mining already comes with SQL Server (standard or enterprise editions) and therefore does not require additional licensing. That compelling fact should make people look at this technology for production uses.

    Immediate mode refers to interactively creating data mining structures and data mining models. A data mining structure is one or more data mining models, and defines the attributes which could be used in any of its models. I think of a structure like a Windows folder. Immediate mode only allows interactive access on one database, where databases together comprise an Analysis Server instance (multiple instances can be physically housed on a single server, and that commonly happens, but sometimes server instances could be physically housed on multiple servers too). Immediate mode is synchronous, and therefore more similar to working on a data mining desktop or web application.

    By contrast, offline mode allows asynchronous access to an entire Analysis Services instance. I view offline mode as more consistent with Microsoft BI development (not just for Analysis Services, but also Integration Services or Reporting Services) and more typical of how developers use Visual Studio. Development this way proceeds through intentional design stages, and the book talks about the process. If permissions are set (from the Analysis Services database through SSMS) then someone can switch from immediate mode to offline mode and back (page 135).

    The book comes with a good sample project called Movie Click. My apologies to my friends around the world since I believe most all these movies are American movies. However, just consider the project an illustration. I believe the bulk of this rich chapter does an excellent job in systematically talking about how to work with BIDS and data mining using this sample Movie Click project. However, I believe the explanation would be better if you could see a video and not just photographs, since we are talking about a dynamic process which extends beyond the static photographs. I don’t have time today to make this movie, but watch for that posting (should be before the end of this week).

    I am running another book review on SSAS technology, and I would like people to see what I am saying about data sources and data source views, because the same insights apply for data mining. But wait — if you have been reading this blog you realize I have not talked about those topics yet. And that’s because they are scheduled for the future. Still, you can read them now:

    My main point for now is about what Microsoft calls the Unified Dimensional Model. The intention is to not just draw from relational (and non-relational, like flat table) data sources. The data source view can align (or “model”) newly defined dimensions (the “dimensional” part) against multiple data sources (the “unified” part). Essentially, as I argue, the data source view allows a way to denormalize normalized data. And yet, the goal is not necessarily total denormalization since unlike competing data mining products, SQL Server Data Mining can work against sources with nested tables. The Unified Dimensional Model allows source data to either be natively housed in Analysis Services (MOLAP) or directly connected to the source (ROLAP). Having a direct live connection to an enterprise-level relational source is an option possible with this technology.

    I will end with a final comment: if someone is insistent in working in immediate mode, then I would instead recommend a local installation of SQL Server on a 64-bit machine (which therefore could break through the 4 GB RAM limitation of 32 bit machines). You can afford this machine with the money you save from having had to pay software purchase or licensing fees from competitive projects. And you can even afford to get a good large screen or two. If you are shopping now for a new machine, you can find lots of good alternatives to allow this power user setup to work, and you need only to run Windows 7 64-bit (not the server operating system). If you have a group, then get them all similar machines too. I know many people repeat the “client-server” catchphrase, but as interpreted, many people assume “client” means multiple. I’m saying “server” can mean multiple too. Instead of worrying about switching from immediate to offline mode on a single production server, I’m saying that full-time data mining analysts and developers are justified in having a machine of this power level individually. Having a machine with 8 GB or 16 GB or even 32 GB RAM allows for more powerful analytic projects.

    MacLennan, J., Tang, Z., & Crivat, B. (2009). Data Mining with Microsoft SQL Server 2008. Indianapolis, IN: Wiley Publishing Inc.

    ISBN: 978-0-470-27774-4

    Be Sociable, Share!