• 1. DataStage 入门培训 讲师:邱明伟 日期:2010-03-01 东南融通 版权所有
  • 2. AgendaDataStage介绍 DataStage开发 DataStage四个客户端的使用 DataStage常用组件使用 DataStage常用命令 练习 2
  • 3. DataStage 介绍
  • 4. Ascential Platform4
  • 5. What is DataStage?Design jobs for Extraction, Transformation, and Loading (ETL) Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations Import, export, create, and managed metadata for use within jobs Schedule, run, and monitor jobs all within DataStage Administer your DataStage development and execution environments5
  • 6. DataStage开发
  • 7. DataStage Server and Clients7
  • 8. DataStage Server and ClientsAdministrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allows you to view and edit the contents of the repository8
  • 9. DataStage Administrator
  • 10. DataStage AdministratorIn DataStage all development work is done within a project. Projects are created during installation and after installation using Administrator. Each project is associated with a directory. The directory stores the objects (jobs, metadata, custom routines, etc.) created in the project. Before you can work in a project you must attach to it (open it). You can set the default properties of a project using DataStage Administrator10
  • 11. DataStage AdministratorUse the Administrator to specify general server defaults, add and delete projects, and to set project properties. Use the Administrator Project Properties window to: ·        Set job monitoring limits and other Director defaults on the General tab. ·        Set user group privileges on the Permissions tab. ·        Enable or disable server-side tracing on the Tracing tab. ·        Specify a user name and password for scheduling jobs on the Schedule tab. ·        Specify hashed file stage read and write cache sizes on the Tunables tab11
  • 12. DataStage Manager
  • 13. DataStage ManagerDataStage Manager manages two different types of objects: ·        Metadata describing sources and targets: -        Called table definitions in Manager. These are not to be confused with relational tables. DataStage table definitions are used to describe the format and column definitions of any type of source: sequential, relational, hashed file, etc. -        Table definitions can be created in Manager or Designer and they can also be imported from the sources or targets they describe.13
  • 14. DataStage Manager ·        DataStage components -        Every object in DataStage (jobs, routines, table definitions, etc.) is stored in the DataStage repository. Manager is the interface to this repository. -        DataStage components, including whole projects, can be exported from and imported into Manager. 14
  • 15. DataStage ManagerAny object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers 15
  • 16. DataStage ManagerImport Procedure In Manager, click “Import>DataStage Components” Select DataStage objects for import16
  • 17. DataStage ManagerExport Procedure In Manager, click “Export>DataStage Components” Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine17
  • 18. DataStage Director
  • 19. DataStage DirectorCan schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer Clear job log Set Director options Row limits Abort after x warnings19
  • 20. Director Log ViewClick the Log button in the toolbar to view the job log. The job log records events that occur during the execution of a job. These events include control events, such as the starting, finishing, and aborting of a job; informational messages; warning messages; error messages; and program-generated messages. 20
  • 21. DataStage Director21
  • 22. DataStage Desinger
  • 23. What Is a Job?Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface Compiles into Orchestrate shell language (OSH)23
  • 24. Create New JobSeveral types of DataStage jobs: Parallel – this course will concentrate on parallel jobs. Job Sequence – used to create jobs that control execution of other jobs.24
  • 25. Create New Job25
  • 26. Components IntroduceSequential file 功能特点:适用于一般顺序文件(定长或不定长),可识别文本文件或IBM大机ebcdic文件。 使用要点: 按照命名规范命名 点住文件,双击鼠标,在general说明此文件内容,格式,存储目录等 修改文件属性,文件名称,reject方式 26
  • 27. Sequential file27
  • 28. Sequential file修改文件格式,比如记录结束符是什么,字段分隔符,字符串是用什么区别等 28
  • 29. Sequential file29
  • 30. Sequential file输入此文件字段内容30
  • 31. Annotation 功能特点:一般用于注释,可利用其背景颜色在job中分颜色区别不同功能块 31
  • 32. Annotation32
  • 33. Copy Stage 功能说明:Copy Stage可以有一个输入,多个输出。它可以在输出时改变字段的顺序,但是不能改变字段类型。 33
  • 34. Copy Stage34
  • 35. Filter Stage功能说明:Filter Stage只有一个输入,可以有多个输出。根据不同的筛选条件,可以将数据输出到不同的output link 35
  • 36. Filter Stage36
  • 37. Sort Stage 功能说明:只能有一个输入及一个输出,按照指定的Key值进行排列。可以选择升序还是降序,是否去除重复的数据等等 37
  • 38. Sort Stage38
  • 39. Sort Stage Option具体说明: Allow Duplicates: 是否去除重复数据。为False时,只选取一条数据,当Stable Sort为True时,选取第一条数据。当Sort Unility为UNIX时此选项无效。 Sort Utility: 选择排序时执行应用程序,可以选择DataStage内建的命令或者Unix的Sort命令 Output Statistics: 是否输出排序统计信息到job日志 Stable Sort: 是否对数据进行二次整理 39
  • 40. Sort StageCreate Cluster Key Change Column:是否为每条记录创建一个新的字段:clusterKeyChange。当Sort Key Mode为Don’t Sort(Previously Sorted) 或 Don’t Sort (Previously Grouped)时,对于第一条记录该字段被设置为1,其余的记录设置为0。 Create Key Change Column:是否为每一条记录创建一个新的字段KeyChange40
  • 41. Remove Duplicates Stage 功能说明: 输入根据关键字分好类的有序数据,去除所有记录中关键字重复的记录,通常与sort stage配合使用 41
  • 42. Remove Duplicates Stage42
  • 43. Tansformer Stage 功能说明:一个功能极为强大的Stage。有一个input link,多个output link,可以将字段进行转换,也可以通过条件来指定数据输出到那个output link。在开发过程中可以使用拖拽 43
  • 44. Tansformer Stage44
  • 45. Tansformer StageConstraint及Derivation的区别: Constraint通过限定条件使符合条件的数据输出到这个output link。 Derivation通过定义表达式来转换字段值。 在Constraint及Derivation中可以使用Job parameters及Stage Variables。 注意:Transformer Stage功能强大,但在运行过程中是以牺牲速度为代价的。在只有简单的变换,拷贝等操作时,最好用Modify Stage,Copy Stage,Filter Stage等来替换Transformer Stage 45
  • 46. LookUp Stage功能说明:LookUp Stage把数据读入内存执行查询操作,将匹配的字段输出,或者在在符合条件的记录中修改或加入新的字段。 46
  • 47. LookUp Stage47
  • 48. Join Stage 功能说明:将多个表连接后输出 48
  • 49. Aggregator Stage 功能说明: 将输入的数据分组,计算各组数据的总和或者按组进行其他的操作,最后将结果数据输出到其他的stage 49
  • 50. Aggregator Stage50
  • 51. Aggregator Stage51
  • 52. Change Capture Stage 功能特点:Change Capture Stage有两个输入,分别标记为before link 及 after link。输出的数据表示before link和after link的区别,我们称作change set。Change Capture Stage可以和Change Apply Stage配合使用来计算after set 52
  • 53. Change Capture Stage53
  • 54. Change Capture Stagekey及value的说明 key值是比较的关键值,value是当key值相同是作进一步比较用的。 change mode选项说明: All keys,Explicit Values 需要指定value,其余字段为key Explicit Keys&Values key及value都需要指定 Explicit Keys,All Values 需要指定key,其余的字段为value 54
  • 55. Funnel Stage功能说明:将多个字段相同的数据文件合并为一个单独的文件输出 55
  • 56. Funnel Stage56
  • 57. Funnel Stage合并策略说明 Continuous Funnel:从每一个input link中循环取一条记录 Sort Funnel:按照Key值排序合并输出 Sequence:先输出第一个input link的数据,输出完毕后再输出第二个input link的数据,依此类推,直到结束。(此时可以通过调整link Ordering调整输出顺序) 57
  • 58. DataStage 常用命令介绍
  • 59. dsjob执行Job dsjob -run [-mode ] 指定状态,默认为NORMAL [-param =] 指定参数运行,不指定使用默认值 [-warn ] 限制warning的日志行数 [-rows ] 限制日志行数 [-wait] 等待作业运行完 [-opmetadata ] 产生metadata [-disableprjhandler] [-disablejobhandler] [-jobstatus] 等待作业返回运行状态 [-userstatus] 等待作业返回用户定义的状态 [-local] 使用本地脚本调起job,环境变量使用脚本里面定义的环境变量。 [-useid] 是否使用jobid(使用dsjob -jobid定义别名) 59
  • 60. dsjob停止Job dsjob -stop [-useid] 如果为作业定了了别名(使用dsjob -jobid),就使用-useid告诉系统后面跟的是作业的别名。 列出全部工程 dsjob –lprojects 列出project下的全部Job dsjob –ljobs project60
  • 61. dsjob列出某个Job的实例调用情况 dsjob –linvocations project job 列出某个Job的所有stage dsjob -lstages [-useid] 列出没个Stage的LINK信息 dsjob -llinks [-useid] 列出工程信息 dsjob -projectinfo 列出Job信息 dsjob -jobinfo [-useid] 列出某个Stage信息 dsjob -stageinfo [-useid] 61
  • 62. dsadmindsadmin dsadmin主要用于创建、删除Project、设置环境变量等等62
  • 63. dsadmin dsadmin [-file | [-server ][-user ][-password ]] [] Valid primary command options are: -createproject -deleteproject -oshvisible -enablercp -enablejobadmin -envadd -envdelete -envset -advancedruntime -basedirectory -deploymentdirectory -customdeployment -listprojects -listproperties -listenv -enablegeneratexml 63
  • 64. dsadmin创建Project dsadmin -createproject [-location ] 建立Project必须用root或dsadm用户 dsadmin –createproject test –location /tmp/test 删除Project dsadmin -deleteproject 同样必须用root或dsadm用户 设置OSH是否可见 dsadmin -oshvisible TRUE | FALSE 效果与在Administrator中设置相同 设置RCP是否可用 dsadmin -enablercp TRUE | FALSE 设置Project是否支持RCP 效果与Adminintrator中设置相同64
  • 65. DataStage 练习