I have a series of Spark jobs, all scheduled through OOZIE on Yarn (HDP). My Big data stack is HDFS, Spark, OOZIE. The spark jars after building through sbt are uploaded to HDFS and fetched by oozie to read and orchestrate the jobs. All of my oozie scripts have the following files: coordinator.xml, workflow.xml, job.properties, run-oozie.sh
I am trying to keep my Spark jobs the same and not trying to use any advance features of Airflow i,e keep the code base the same. I am merely trying to write a corresponding Airflow Python representation from oozie.
These are how some of my files look like:
coordinator.xml
<coordinator-app xmlns = "uri:oozie:coordinator:0.2" name = "SampleCoordinator" frequency = "${coordFrequency}" start ="${startTime}" end ="2099-09-19T21:00Z" timezone = "UTC"> <controls> <timeout>1</timeout> <concurrency>1</concurrency> <execution>LIFO</execution> <throttle>1</throttle> </controls> <action> <workflow> <app-path>${workflowDir}</app-path> <configuration> <property> <name>SCHEDULED_TIME</name> <value>${coord:nominalTime()}</value> </property> </configuration> </workflow> </action> </coordinator-app>
workflow.xml
<workflow-app xmlns = "uri:oozie:workflow:0.4" name = "Sample-Workflow"> <start to = "Sample-Workflow"/> <!--> Multiple Upload Workflows will be used for all the hourly upload jobs instead of a single one. Jobs should be independent of each other and not wait for another job to complete before executing <--> <action name="Sample-Workflow"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.launcher.mapreduce.map.memory.mb</name> <value>26384</value> </property> <property> <name>oozie.launcher.mapreduce.map.java.opts</name> <value>-Xmx4096m</value> </property> <property> <name>oozie.launcher.yarn.app.mapreduce.am.resource.mb</name> <value>16384</value> </property> <property> <name>oozie.launcher.mapreduce.map.java.opts</name> <value>-Xmx4096m</value> </property> </configuration> <exec>${scriptDir}/${scriptFile}</exec> <argument>${sampleJar}</argument> <argument>${databaseFile}</argument> <argument>${wf:id()}</argument> <argument>${env}</argument> <argument>${APIKeyFile}</argument> <argument>${JobName}</argument> <argument>${TableId}</argument> <argument>${InputDir}</argument> <env-var>USER_NAME=${user}</env-var> <file>${scriptDir}/${scriptFile}#${scriptFile}</file> <file>${databaseFile}#${databaseFile}</file> <capture-output/> </shell> <ok to="job_complete"/> <error to="job_fail"/> </action> <kill name = "job_fail"> <message>Job failed, the error is [${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name = "job_complete" /> </workflow-app>
I have a job.properties file which I'm not sharing which defines these workflow params. The coordFrequency in oozie workflow is defined in the properties file as a cron: coordFrequency=5 * * * *
And a run-oozie.sh script which looks like this:
#!/usr/bin/env bash oozie job -config ./job.properties -run
Since I am a complete newbie to Airflow and couldn't find a good resource for this migration I thought of reaching out to the community. Hoping for a solution, some relevant OOZIE -> Airflow doc where steps are clearly outlined, or any kind of help/ suggestions.
https://stackoverflow.com/questions/65820968/best-way-to-migrate-from-oozie-to-airflow-for-on-prem-hadoop-cluster-spark-jobs January 21, 2021 at 11:56AM
没有评论:
发表评论