PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

Be Sociable, Share!

    PMML (Predictive Model Markup Language) promises to provide a way to share data mining models in XML. The standard is published by the Data Mining Group, and currently the most recent PMML version is 4.0 released in June 2009. SQL Server Analysis Services (as of SQL Server 2008 R2) only supports through PMML version 2.1, which was released in March 2003. My opinion is that Microsoft needs to keep current with PMML to make this data mining technology a viable option.

    I decided that it was time to investigate this PMML topic, and this blog post shares my observations. As I stated, when the underlying technology is SQL Server 2008 R2 Analysis Services, even PMML 2.1 support is limited, and SQL Server Data Mining does not provide PMML model creation for most of its algorithms. The following table has clickable links to the MSDN Documentation.

    PMML 2.1 Model Creation support in SQL Server 2008 R2 Analysis Services
    Does Support Does Not Support
    Clustering

    Decision Trees

    Association

    Linear Regression

    Logistic Regression

    Naive Bayes

    (MSDN Documentation may claim that there is support, but this claim is not true)Neural Network

    Sequence Clustering

    Time Series

    Looking at the Wikipedia entry for PMML, I can see that R supports through version 3.2. The same webpage says that SAS Enterprise Miner supports versions 2.1 and 3.1, and Oracle supports 3.1. Some packages offer complete PMML support through version 4.0. Wikipedia (currently, as of November 10, 2010) omits that SQL Server Analysis Services supports PMML for Naive Bayes (this fact is correct, even though the MSDN Documentation claims that there is Naive Bayes support).

    XML Notepad 2007

    This free Windows-based open-source XML editor provides a good way to look at PMML. I will be using this software in this blog post. You can obtain a copy by clicking this link.

    Contoso Retail 2.1

    Contoso Retail is a sample SQL Server dataset produced by Microsoft’s Office team to showcase the Business Intelligence and Data Warehousing features of their software. The sample data includes some data mining models that I will be mentioning in this blog post. You can obtain a copy by clicking this link.

    Contoso Retail Mining Models and PMML 2.1

    dm1 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    Contoso Retail contains four mining models as I highlighted in yellow. So then I ask myself what I can I learn about the PMML behind these models. I start with the most straightforward case, the decision tree. Never mind that the word “Forecast” is misspelled in Contoso Retail.

    SELECT model_pmml FROM [Promotion Forcast Decision Tree].pmml

    The output from SSMS (SQL Server Management Studio) shows the PMML:

    dm2 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    If you “copy” from SSMS, you can then paste into XML Notepad 2007:

    dm4 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    dm5 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    So moving on to the Clustering model, I then issued this DMX query:

    SELECT model_pmml FROM [Cluster Customer].pmml

    In SSMS, the grid output does not show anything. I suspect there were some of those nonprinting characters inserted at the beginning. When from the data mining technology, or from the SSMS interface, the net result is that something is missing.

    dm3 300x123 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    However, something is really there, and I know that something is there because I can copy and see results in XML Notepad 2007:

    dm6 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    dm8 PMML 2.1, XML Notepad 2007 and Contoso Retail 2.1

    The final two Contoso Retail models are Association Rules and Time Series. As I outlined in the earlier chart, neither is supported in Analysis Services. However, I decided to run the PMML DMX query anyway to see what happens.

    For the Association Rules command, the result is:

    Executing the query ...
    Obtained object of type: Microsoft.AnalysisServices.AdomdClient.AdomdDataReader
    Error (Data mining): The algorithm does not support the functionality requested by the '' model.
    
    Execution complete

    For the Time Series command, the result is:

    Executing the query ... Obtained object of type: Microsoft.AnalysisServices.AdomdClient.AdomdDataReader Error (Data mining): The 'V Product Forecast' mining model has a column that uses the DATE data type. This data type is not supported by the PMML 2.1 specification. The DATE data type can typically be converted to the DOUBLE data type as a workaround, which is supported by the PMML 2.1 specification. Execution complete

    I decided to post these errors so that they would show up in the search engine should someone be curious. More accurately, the engine should have returned a message uniformly stating that PMML 2.1 is not supported for this algorithm type. That uniform message should report for seven of the nine bundled algorithms in Analysis Services.

    Recommendation

    My message to Microsoft is to support the latest PMML going forward. The choice for Microsoft to uniformly support XML standards has been a good one, and we see that support now across the Microsoft Office products. Inclusion of this XML standard (PMML) predated many other projects at Microsoft. There will always be commercial vendors who will support latest PMML in an effort to further distinguish their products. I hope PMML’s ancestry within Analysis Services does not lead to falling behind even open source data mining competitors.

    Be Sociable, Share!