Data-Centric Automated Data Mining


Data-Centric Automated Data Mining Marcos M. Campos Peter J. Stengard Boriana L. Milenova Data Mining Technologies Overview y Data mining complexity y Proposed design solution y Application I: Oracle Predictive Analytics y Application II: Spreadsheet Add-In for PA Overview y Data mining complexity y Proposed design solution y Application I: Oracle Predictive Analytics y Application II: Spreadsheet Add-In for PA Data Mining Complexity y Knowledge of data mining techniques – Which algorithm do I use? y Algorithm specific data preparation – How should I prepare my data? y Model parameter tuning – What kernel function should I use? y Deployment – I’ve deployed a model, now what data can I score with it? Industry Future “Predictive analytics builds on the data mining multistep process and statistical modeling techniques to add a layer of automation and self-directed built-in intelligence. Business users (and not just Ph.D. statisticians) can now analyze large amounts of customer, supplier, employee and product data for patterns and trends.” -Kent Bauer DM Review Magazine, Dec. 2005 Design Approach y Goal: “Good results with minimum effort” y Data-centric focus – familiar to database and business intelligence communities y Process automation – ease-of-use for non-expert users Data-centric Design y Eliminates concepts of models or complex methodologies y Requires only knowledge of the data source y Supporting objects are either removed or linked to the data source y Users see only predictive or descriptive results y Goal-oriented tasks Goal-oriented Tasks y Explain y Predict y Group y Detect y Map y Profile - attribute importance - classification or regression - clustering/segmentation - anomaly/outlier detection - project data to lower dimensionality - supervised segmentation Process Automation y Statistics computation y Sampling y Attribute type identification y Attribute selection y Algorithm selection y Data transformation y Model creation and selection y Output generation Process Automation (cont.) y Statistics computation – number of records, number of attributes, attribute ranges and cardinality – used to make decisions about target and attribute type and guide data transformations y Sampling (random & stratified) – improve training times for large datasets – ensure sufficient rare target value/range representation Process Automation (cont.) y Attribute type identification – categorical versus numeric types – essential for correct data preparation and algorithm performance y Attribute selection – enhance performance (speed and accuracy) – improve explanatory power – filter methods preferable over wrapper methods in the context of automation Process Automation (cont.) y Algorithm selection – data-driven (e.g., classification vs. regression) y Data transformation – choice and sequence of transformations based on y selected algorithm y data characteristics (e.g., attribute type, attribute range/cardinality, percentage of missing values) – common transformations: binning, normalization, missing value imputation, outlier treatment Process Automation (cont.) y Model creation and selection – model creation across different algorithms or via parameter tuning on a single algorithm – quality assessment and selection – figure of merit provided to the user y Output generation – scoring (e.g., prediction) – descriptive information (e.g., cluster description) – explanations must be compatible with original data values and ranges (transformation reversal) Oracle Predictive Analytics y PL/SQL API y Targets database users y Emphasis on ease-of-use y Data in database (table/view) y Results presented in tables y Actionable results Explain y Embedded data preparation y No intermediary objects persisted y Produce figure of merit and rank for each attribute DBMS_PREDICTIVE_ANALYTICS.EXPLAIN ( data_table_name IN VARCHAR2, explain_column_name IN VARCHAR2, result_table_name IN VARCHAR2, data_schema_name IN VARCHAR2 DEFAULT NULL); Explain Methodology Compute statistics Sample Remove outliers Treat missing values Normalize / Discretize Large data? Yes No Data preparation Build attribute importance model Analyze model Produce key attributes Predict y Automatically determine problem type y Embedded data preparation y No intermediary objects persisted y Produce prediction for each record in data DBMS_PREDICTIVE_ANALYTICS.PREDICT ( accuracy OUT NUMBER, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, result_table_name IN VARCHAR2, data_schema_name IN VARCHAR2 DEFAULT NULL); Predict Methodology Compute statistics Sample Remove outliers Treat missing values Normalize / Discretize Large data? Yes NoData preparation Build classification model Build regression model Target type? Numeric Split the data Categorical Measure performance Score the whole data Spreadsheet Add-In for Predictive Analytics y Excel front-end y Targets business analysts y Emphasis on ease-of-use y Data in database or Excel spreadsheets y Results presented in Excel y Familiar environment for evaluation and presentation Explain Results Explain Results – CoIL 2000 y Task: determine which attributes are useful in predicting the target (caravan insurance policy buyer) y The four most important attributes found were in agreement with submissions to the CoIL 2000 Challenge, including the competition’s winner Predict Results – Classification Predict Results – CoIL 2000 y Data: CoIL 2000 y Task: predict which customers are interested in buying a caravan insurance policy y Results compare favorably to those submitted to the original competition 1381001 110426580 10Actual\Predicted Predict Results – Regression Predict Results – Boston Housing y Data: Boston Housing (US Census Service) y Task: predict the median value of owner- occupied homes (MEDV) y The results also compare well against other published predictive outcomes 5.963.99Excel Regression 3.892.55PA RMSEMAE Conclusions y New design approach offers many benefits – ease-of-use with minimum user input – high productivity y Potential to bring data mining to a wider audience – database community – business intelligence users y Applications provide out the box competitive results on challenging data
还剩26页未读

继续阅读

下载pdf到电脑,查找使用更方便

pdf的实际排版效果,会与网站的显示效果略有不同!!

需要 5 金币 [ 分享pdf获得金币 ] 0 人已下载

下载pdf

pdf贡献者

alex_hey

贡献于2012-11-13

下载需要 5 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!
下载pdf