Workflow Automation in Spectrum Mill


Table of Contents


Introduction to Workflow Automation

Spectrum Mill B.04.00 introduced workflow automation via a Service Request Manager(SRM), which allows you to place a series of tasks into a queue; then the SRM executes them in the proper order. You can automate the tasks involved in a typical data analysis for MS/MS data files from protein digests:

The SRM Workflow automation was designed and developed with the following goals in mind: Overview of Spectrum Mill modules and workflow
SRM_workflow_overview

Parameter Files

The first step to set up an automated workflow is to save and name parameter sets you want to use for each task in your workflow. Most Spectrum Mill forms allow you to save and load parameter files associated with the form. For example, Data Extractor, MS/MS Search, Autovalidation, Quality Metrics, de novo and Protein/Peptide Summary all include buttons to save settings in parameter files, and to load settings from parameter files. In nearly all cases, you use the parameter files to build and execute workflows. Peptide Selector and MRM Selector are exceptions. For these programs, it is convenient to save parameter files, but you can not use them in a workflow.

Parameter files are stored on the Spectrum Mill server under the folder \SpectrumMill\millauto. Only one level of subfolder is allowed. We recommend that you assign subfolders to users or types of experiments or studies.  Each task type stores (in a cookie) the last parameter file that you selected for that task. The program shows the selected parameter file in red in the title portion of the form:

Parameter file name 

Workflow Automation

Workflow automation user interface (UI) elements consist of a Workflows form, an Edit Workflow form, and a Request Queue / Completion Log viewer. You access these forms through the Process Automation Tools section of the Spectrum Mill home page. Workflow automation allows the Service Request Manager(SRM) to execute a workflow consisting of an ordered list of tasks on one or more data folders using the same set of parameters for all data folders. Each task in the workflow must be accompanied by a parameter file which specifies the settings for the task. Consequently, all data folders processed together should contain the same data types for each task (for example with Extraction, all .raw, all .d files, etc.). When the SRM executes a workflow, it adds the tasks to a Request Queue, and executes them in order for a particular data folder. If you select multiple data file folders before you start the workflow, the SRM will perfrom the Data Extraction and MS/MS Search tasks in each folder in parallel. With a typical multi core CPU configuration, the SRM processes a task on each logical processor, up to the available processors. Certain tasks (Xtraction, MS/MS Search and de novo Sequencing) are launched for a data directory but execute in a series of batches consisting of subsets of LC-MS/MS runs (Xtraction) or subsets of MS/MS spectra (MS/MS Search and de novo Sequencing) within the directory. These batches are processed in parallel as sub-tasks on available processors when the max CPU option is checked on the workflow form or the individual task's form. After a search is complete, autovalidation is performed independently for each folder unless "Group proteins across ..." is marked. After autovalidation, results for all folders are summarized together by P/P Summary.

Note: If you want to have an autovalidation task using protein grouping process multiple directories independently, then you must separately check and execute each one in the Data Directories section.

The Workflows and Edit Workflow forms allow you to view an individual task’s form in read-only mode, with the parameters shown for the parameter file you select.

Service Request Manager(SRM)

In order to process workflows Spectrum Mill uses a Service Request Manager (SRM), implemented as a Windows service, to maintain and operate a Request Queue, as illustrated below.

Overview of the SRM functionality
SRM_automation

Together a Windows service, 2 programs, 4 scripts, the SRM configuration file, a log file, and UI generated parameter files for workflows/tasks listed in the table below constitute the infrastructure of the SRM portion of Spectrum Mill.
\\SpectrumMill/millsrm/ \\SpectrumMill/millscripts/ \\SpectrumMill/millauto/
  • SRMHostSvc.exe
  • submitRequest.exe
  • queueStatus.exe
  • SMSRM.Config
  • requests.log
  • workflow.pl
  • submitRequest.pl
  • queueStatus.pl
  • viewRequestsLog.pl
  • workflow.*.tsv
  • xtractor.*.params
  • mstag.*.params
  • autovalidation.*.params
  • ppsummary.*.params
  • qualityMetrics.*.params
  • sherenga.*.params
  • specsummary.*.params
  • specMatcher.*.params

Processor Utilization

The number of parallel tasks is limited by the number of (logical) processors on the server. By default, the Spectrum Mill installer configures the SRM limit to be one less than the number of (logical) processors. For example, a single CPU quad-core system with hyperthreading enabled should have 8 logical processors and be configured to allow the SRM to concurrently use 7. You can edit a configuration file to alter the maximum number of parallel processes. As a general rule, you should have enough RAM to support each parallel process, at 2 Gb per process.
For more information, see Multicore (Maximize CPUs) Data Extraction.
See Stop and restart the SRM service in the event of reconfiguring the amount of system memory, or number of CPUs.

How/where to find the Results

Results for most Spectrum Mill modules will be placed in the same individual data directory as the original LC-MS/MS data files that were selected for analysis. For tasks like Protein/Peptide Summary that combine results from multiple directories, the results will be placed in the first directory selected in the input form.

SRM troublehooting aids

When a task is executed through the SRM, the output that was formerly output to a Results pane (typically the right-side or below the form for the individual task) in Spectrum mill versions prior to 4.0, is now redirected to a task-generated HTML log file located in the data folder. If the task (for example, Autovalidation or P/P Summary) uses multiple data directories, then the redirected output HTML is located in the first data directory in the list. This is the same behavior as with the result tables produced for the Excel export option. these redirected task-generated log files and are particulary useful when troubleshooting errant performance.

Redirected task-generated output logs produced during task execution The files have names like:
   xtractorFinnigan.190101135721.1901P.htm
   mstag.190101135723.1907P.htm
   validate.190101135724.1909.htm
   sherenga.190101135722.1903P.htm
   qualityMetrics.190101135725.1910.htm

When tasks are launched from an individual task form for execution via the SRM, the links shown below are present in the SRM acknowledgment output written to the lower pane. That output is redirected to an HTML file when the tasks are executed as part of a workflow.

Redirected SRM acknowledgment of task submission to SRM request queue The files have names like:
   xtractorFinnigan.190101135721.1902.htm
   mstag.190101135723.1908.htm
   sherenga.190101135722.1904.htm

To access the task-generated output log results via a web browser, look for these links in the SRM acknowledgment lower pane or redirected HTML output file:

See details about the Request Queue and Completion Log, and how you can use them to view parameters and results output.

Workflow mode vs Individual Task mode

You can execute portions of a workflow one task at a time, by lanching requests directly from an individual task’s form.

In individual task mode, SRM Workflow automation remains dependency-aware and ensures that all tasks in a data folder are processed in the proper order. As long as no dependencies exist, the tasks for different data folders execute in parallel (up to the number of available CPUs) for extraction and search. When dependencies do exist (for example, an MS/MS search cannot start until data extraction is complete), then the SRM ensures that the tasks execute in the correct order and at the proper time.

In practice, certain combinations of tasks are well suited to workflow mode execution (Data Extraction, MS/MS Search, and Autovalidation) and might most often be run that way. When all individual tasks are performed on a a single directory the combination of them will tend to be Workflow friendly.

For some tasks it will tend to be convenient to simply run them in individual task mode. Examples include Autovalidation in Protein polishing mode, and Protein/Peptide Summary in Protein Comparison mode when run across multiple data directories, particularly for projects where the data in each directory is generated day or weeks apart.

When doing data analysis method development it will often be more convenient to run in individual task mode.

Bypassing the SRM

Before the SRM was developed all Spectrum Mill tasks could be executed by a user via direct connection between an individual task's form and its script. This continues to be the case, and amounts to bypassing the SRM. For tasks where users seek to generate tables of output in Excel Export format and primarily be viewed by with external tools, bypassing the SRM would be unlikely to be beneficial.

However, for tasks such as Protein/Peptide Summary and Spectrum Summary that can produce HTML output with interactive links, bypassing the SRM can be preferable. To enable this, check neither the Queue request checkbox nor the Excel Export checkbox on the form. Example use cases include when one's primary goal is to:

Note: Although the forms for Data Extractor, MS/MS search, de novo Sequencing no longer allow the SRM to be bypassed, the accompanying PERL scripts can still be run without the SRM. Instead of a Queue request checkbox, the forms have a hidden variable that forces the feature to always be enabled when using the form.

Bypassing the User Interface with command line execution

KRC 1/4/2019 left off editing here

The main PERL scripts in Spectrum Mill are capable of receiving their parameters two different ways:

Command line syntax for launching Spectrum Mill PERL scripts is shown below:

each of the below is of the form:
   >scriptName.pl paramsFilePath dataDirectoryPaths (space-delimited)
with detailed attention given to the pathes to the working directory, script, params and data directories.

E:\\SpectrumMill>millscripts\runXtractor.pl millauto\xtractor.name.params dataDir1 dataDir2 ... > msdataSM\dataDir1\xtractor.CLItest.htm
E:\\SpectrumMill>millscripts\batchTagPara.pl millauto\mstag.name.params dataDir1 dataDir2 ... > msdataSM\dataDir1\mstag.CLItest.htm
E:\\SpectrumMill>millscripts\validateTable.pl millauto\autovalidation.name.params dataDir1 dataDir2 ... > msdataSM\dataDir1\validate.CLItest.htm
E:\\SpectrumMill>millscripts\batchSherenga.pl millauto\sherenga.name.params dataDir1 dataDir2 ... > msdataSM\dataDir1\sherenga.CLItest.htm
E:\\SpectrumMill>millscripts\meterMaid.pl millauto\qualityMetrics.name.params dataDir1 dataDir2 ... > msdataSM\dataDir1\qualityMetrics.CLItest.htm

In order to get the maxCPU behavior one needs to use the SRM, rather than directly launching the individual scripts.
E:\\SpectrumMill>millscripts\submitRequest.pl batchTagPara.pl millauto\mstag.name.params dataDir1 > msdataSM\dataDir1\mstag.CLItest.htm

One can also run an entire workflow via the SRM:
E:\\SpectrumMill>millscripts\workflow.pl workflow.name.tsv dataDir1 dataDir2 ... > msdataSM\dataDir1\workflow.CLItest.htm

More information is available on creation of Parameter Files (*.params). The simple name=value formatted lines in a .params file is derived from the use of the PERL CGI package command: $queryCGI->save(\*PARAMSFILE)

Some of these PERL scripts (runXtractor.pl, batchTagPara.pl) launch C++ programs that process batches of LC-MS/MS runs (xtractorAgilent.cgi, xtractorFinnigan.cgi) or batches of MS/MS spectra(mstagpara.cgi) The batches are created from items located in a single directory. The C++ programs can receive their params via either CGI or CLI. The PERL script (batchSherenga.pl) similarly processes batches of MS/MS spectra by running a Java program (sherenga.jar). Sherenga receives parameters only by CLI. In max CPU mode, each PERL script is run once to create all the batches. Then the SRM directly launches the corresponding C++ program (?via CLI or CGI?) or the Java program Sherenga (via CLI) for each batch. For MS/MS Search and de novo Sequencing the SRM relies on the BatchTasks.txt and SherengaBatchTasks.txt files produced by the PERL scripts.

The SRM receives data directories by CLI and parameters by CGI. The SRM sends all parameters (including data directories) to the PERL scripts by CGI, except for batchSherenga.pl (by CLI).


To Use the Workflows Form

The Workflows form is the entry point to execute and edit workflows. You can also view the parameters associated with a particular parameter file. The following topics describe options available on the Workflows form.

Data Directories

Workflow


To Use the Edit Workflow Form

This form allows you to edit a workflow, or to create new workflows.  To create a new workflow, edit an existing one and save the changes to a new name.  See Chapter 3, Automating Workflows, of the Application Guide to learn how to create new workflows.

To get to this form:

  1. In the Workflows form, click the name of the workflow you wish to edit.
  2. Click the Edit Workflow button.
  3. Verify that the Edit Workflow form shows the name of the workflow in the red title bar, and its list of (ordered) tasks in the Workflow tasks list.
    Workflow tasks are parameter files, which you create in each form.

The following  describes options available on the Edit Workflow form:

Edit Workflow Tasks

After you edit a workflow:

  1. Click the Save As button to save it.
  2. To see the changes in the Workflows form, click the Refresh button within that form.

To Use the Request Queue Viewer

The Request Queue Viewer shows a list of all tasks that are currently executing and those that are queued for execution. It lists the tasks in the order they were queued. Because some tasks depend upon earlier tasks, the tasks that are currently executing do not always appear at the top of the list.

Notes:


To Use the Completion Log Viewer

The Completion Log viewer shows a list of all tasks that have completed, with the most recent shown at the top. This log includes all queued requests, whether you queued them interactively or via a workflow. Two of the columns allow you to view additional information:


To Use the Open Workflow Dialog Box

The Open Workflow dialog box allows you to select a workflow in the Edit Workflow form.


To Use the Save Workflow Dialog Box

The Save Workflow dialog box allows you to save a workflow that you have created in the Edit Workflow form.


To Use the New Folder Dialog Box

The New Folder dialog box allows you to create a new folder where you save a workflow.