Changes between Version 68 and Version 69 of Tutorial/JobSubm


Ignore:
Timestamp:
Mar 3, 2009, 1:35:15 PM (17 years ago)
Author:
/O=GRID-FR/C=FR/O=CNRS/OU=LAL/CN=Michel Jouvin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Tutorial/JobSubm

    v68 v69  
    135135=== Controlling Retries done by WMS ===
    136136
    137 After the job has been submitted to WMS, WMS goes through several phases until final submission of the job to the CE. For various reasons, errors can occured at each stage and a user can control the number of retries WMS must do through 2 JDL attributes:
    138  * `RetryCount`: defines the maximum number the WMS try to process a job in case of an error occured inside the WMS. Default is site specific and there is a maximum defined on the WMS itself.
    139  * `ShallowRetryCount`: defines the maximum number the WMS try to submit a job to the selected CE in case of an error occured on the CE. Default is site specific and there is a maximum defined on the WMS itself.
    140 
    141 Until the maximum attempt has been reached the job never reached the state 'Done'.
     137After the job has been submitted to WMS, WMS goes through several phases until final submission of the job to the CE and its execution. For various reasons, errors can occured at each stage and a user can control the number of retries WMS must do through 2 JDL attributes:
     138 * `ShallowRetryCount`: defines the maximum number of times the WMS try to submit a job to the selected CE in case of an error occured at submission time, before the job has actually started. This is called a ''shallow resubmission'' : at each attempt, a different CE is selected. Default is site specific and there is a maximum defined on the WMS itself.
     139 * `RetryCount`: defines the maximum number of times the WMS try to resubmit a job in case of an error occured after the job started to run on the CE. This is called a ''deep resubmission'' : specific actions may be required to cleanup files left by the previous run attempt... Default is site specific and there is a maximum defined on the WMS itself.
     140
     141In addition to resubmission, the WMS can retry the first phase of the job processing, called ''match making'', responsible for selecting a CE. In the event of an error during this phase, the ''match making'' is retried at an interval defined by the site during a maximum period also defined by the site. The user has no control on this. During this period the job status is `Waiting` and if the ''match making'' fails after the maximum period allowed the request job fails with a status reason which is either `No compatible resource` (match making process failed to find a resource matching job requirements) or `Request expired` if the match making failed because the maximum period allowed was reached before `ShallowRetryCount`.
    142142
    143143''Note: when debugging jobs, it is often desirable to set `RetryCount` and `ShallowRetryCount` to 0 to get a quick feedback in case of errors.''