Main Content

Troubleshooting and Debugging

Attached Files Size Limitations

The combined size of all attached files for a job is limited to 4 GB.

File Access and Permissions

Ensuring That Workers on Windows Operating Systems Can Access Files

By default, a worker on a Windows® operating system is installed as a service running as LocalSystem, so it does not have access to mapped network drives.

Often a network is configured to not allow services running as LocalSystem to access UNC or mapped network shares. In this case, you must run the mjs service under a different user with rights to log on as a service. See the section Set the User (MATLAB Parallel Server) in the MATLAB® Parallel Server™ System Administrator's Guide.

Task Function Is Unavailable

If a worker cannot find the task function, it returns the error message

Error using ==> feval
      Undefined command/function 'function_name'.

The worker that ran the task did not have access to the function function_name. One solution is to make sure the location of the function's file, function_name.m, is included in the job's AdditionalPaths property. Another solution is to transfer the function file to the worker by adding function_name.m to the AttachedFiles property of the job.

Load and Save Errors

If a worker cannot save or load a file, you might see the error messages

??? Error using ==> save
Unable to write file myfile.mat: permission denied.
??? Error using ==> load
Unable to read file myfile.mat: No such file or directory.

In determining the cause of this error, consider the following questions:

  • What is the worker's current folder?

  • Can the worker find the file or folder?

  • What user is the worker running as?

  • Does the worker have permission to read or write the file in question?

Tasks or Jobs Remain in Queued State

A job or task might get stuck in the queued state. To investigate the cause of this problem, look for the scheduler's logs:

  • Spectrum LSF® schedulers might send emails with error messages.

  • Microsoft® Windows HPC Server (including CCS), LSF®, PBS Pro®, and TORQUE save output messages in a debug log. See the getDebugLog reference page.

  • If using a generic scheduler, make sure the submit function redirects error messages to a log file.

Possible causes of the problem are:

  • The MATLAB worker failed to start due to licensing errors, the executable is not on the default path on the worker machine, or is not installed in the location where the scheduler expected it to be.

  • MATLAB could not read/write the job input/output files in the scheduler's job storage location. The storage location might not be accessible to all the worker nodes, or the user that MATLAB runs as does not have permission to read/write the job files.

  • If using a generic scheduler:

    • The environment variable PARALLEL_SERVER_DECODE_FUNCTION was not defined before the MATLAB worker started.

    • The decode function was not on the worker's path.

No Results or Failed Job

Task Errors

If your job returned no results (i.e., fetchOutputs(job) returns an empty cell array), it is probable that the job failed and some of its tasks have their Error properties set.

You can use the following code to identify tasks with error messages:

errmsgs = get(yourjob.Tasks, {'ErrorMessage'});
nonempty = ~cellfun(@isempty, errmsgs);
celldisp(errmsgs(nonempty));

This code displays the nonempty error messages of the tasks found in the job object yourjob.

Debug Logs

If you are using a supported third-party scheduler, you can use the getDebugLog function to read the debug log from the scheduler for a particular job or task.

For example, find the failed job on your LSF scheduler, and read its debug log:

c = parcluster('my_lsf_profile')
failedjob = findJob(c, 'State', 'failed');
message = getDebugLog(c, failedjob(1))

Connection Problems Between the Client and MATLAB Job Scheduler

For testing connectivity between the client machine and the machines of your compute cluster, you can use Admin Center. For more information about Admin Center, including how to start it and how to test connectivity, see Start Admin Center (MATLAB Parallel Server) and Test MATLAB Job Scheduler Cluster Connectivity in Admin Center (MATLAB Parallel Server).

Detailed instructions for other methods of diagnosing connection problems between the client and MATLAB Job Scheduler can be found in some of the Bug Reports listed on the MathWorks Web site.

The following sections can help you identify the general nature of some connection problems.

Client Cannot See the MATLAB Job Scheduler

If you cannot locate or connect to your MATLAB Job Scheduler with parcluster, the most likely reasons for this failure are:

  • The MATLAB Job Scheduler is currently not running.

  • Firewalls do not allow traffic from the client to the MATLAB Job Scheduler.

  • The client and the MATLAB Job Scheduler are not running the same version of the software.

  • The client and the MATLAB Job Scheduler cannot resolve each other's short hostnames.

  • The MATLAB Job Scheduler is using a nondefault BASE_PORT setting as defined in the mjs_def file, and the Host property in the cluster profile does not specify this port.

MATLAB Job Scheduler Cannot See the Client

If a warning message says that the MATLAB Job Scheduler cannot open a TCP connection to the client computer, the most likely reasons for this are

  • Firewalls do not allow traffic from the MATLAB Job Scheduler to the client.

  • The MATLAB Job Scheduler cannot resolve the short hostname of the client computer. Use pctconfig to change the hostname that the MATLAB Job Scheduler will use for contacting the client.

"One of your shell's init files contains a command that is writing to stdout..."

The example code for generic schedulers with non-shared file systems contacts an sftp server to handle the file transfer to and from the cluster's file system. This use of sftp is subject to all the normal sftp vulnerabilities. One problem that can occur results in an error message similar to this:

One of your shell's init files contains a command that is writing to stdout,
interfering with RemoteClusterAccess.
The stdout read was:
<some output>

Find and wrap the command with a conditional test, such as

	if ($?TERM != 0) then
		if ("$TERM" != "dumb") then
			<your command>
		endif
	endif

The sftp server starts a shell, usually bash or tcsh, to set your standard read and write permissions appropriately before transferring files. The server initializes the shell in the standard way, calling files like .bashrc and .cshrc. The problem occurs if your shell emits text to standard out when it starts. That text is transferred back to the sftp client running inside MATLAB, and is interpreted as the size of the sftp server's response message.

To work around this error, locate the shell startup file code that is emitting the text, and either remove it or bracket it within if statements to see if the sftp server is starting the shell:

if ($?TERM != 0) then
    if ("$TERM" != "dumb") then
        /your command/
    endif
endif

You can test this outside of MATLAB with a standard UNIX or Windows sftp command-line client before trying again in MATLAB. If the problem is not fixed, an error message persists:

> sftp yourSubmitMachine
Connecting to yourSubmitMachine...
Received message too long 1718579042

If the problem is fixed, you should see:

> sftp yourSubmitMachine
Connecting to yourSubmitMachine...