pfasta

pfasta is a command-line tool for working with FASTA files to filter and sanitize them based on various criteria. It is installed automatically with protfasta and can be invoked as pfasta from the command line.

At its simplest, pfasta takes a single FASTA file as input and writes a sanitized FASTA file as output. It can:

  • Filter out (or convert, or fail on) sequences containing non-standard amino-acid characters

  • Remove, ignore, or fail on duplicate FASTA records and duplicate sequences

  • Filter sequences by minimum and/or maximum length

  • Randomly sub-sample a set of sequences (useful for building a small test set from a large FASTA file)

  • Print summary statistics (count, median / quartile / min / max length)

  • Replace commas in FASTA headers with semicolons (helpful when the downstream pipeline treats the header as part of a CSV)

Usage

pfasta <flags> filename.fasta

Command-line options

filename
    Positional argument: path to the input FASTA file.

-o <output filename>                    (default: output.fasta)
    Output FASTA file.

--non-unique-header
    If set, multiple FASTA records are allowed to share the same header.
    By default duplicate headers cause pfasta to fail.

--duplicate-record {ignore,fail,remove} (default: fail)
    How to deal with duplicate records (same header AND same sequence):
        fail   - raise an exception and exit
        ignore - keep all duplicate records
        remove - keep only the first occurrence

--duplicate-sequence {ignore,fail,remove} (default: ignore)
    How to deal with duplicate sequences (same sequence, any header):
        fail   - raise an exception and exit
        ignore - keep all duplicate sequences
        remove - keep only the first occurrence of each sequence

--invalid-sequence <mode>                (default: fail)
    How to deal with non-standard amino-acid characters. Available
    modes:

        ignore
            Accept invalid residues without changes.

        fail
            Raise an exception on the first invalid residue.

        remove
            Discard any sequence that contains invalid residues.

        convert-all
            Apply the standard conversion table
            ``B->N, U->C, X->G, Z->Q, '*'->'', '-'->''``
            and raise an exception if any residues remain invalid
            afterwards.

        convert-res
            Same as ``convert-all`` but keeps the alignment gap
            character ``'-'`` untouched.

        convert-all-ignore
            Same as ``convert-all`` but silently keeps any residues
            that remain invalid after conversion.

        convert-res-ignore
            Same as ``convert-res`` but silently keeps any residues
            that remain invalid after conversion.

        convert-all-remove
            Same as ``convert-all`` but removes any sequence that
            still contains invalid residues after conversion.

        convert-res-remove
            Same as ``convert-res`` but removes any sequence that
            still contains invalid residues after conversion.

--number-lines <int>                     (default: 60)
    Number of residues per line in the output FASTA file. Must be
    at least 5.

--shortest-seq <int>                     (default: none)
    Minimum sequence length to include. Sequences shorter than or
    equal to this length are discarded.

--longest-seq <int>                      (default: none)
    Maximum sequence length to include. Sequences longer than or
    equal to this length are discarded. If both ``--longest-seq``
    and ``--shortest-seq`` are given, ``--longest-seq`` must be
    larger.

--random-subsample <int>                 (default: none)
    Randomly sub-sample this many sequences from the final set.
    Useful for generating small test FASTA files from large inputs.
    If the input contains fewer sequences than requested, all
    sequences are returned.

--print-statistics
    Print length statistics (count, 25th / 50th / 75th percentile,
    longest, shortest) for the final set of sequences.

--no-outputfile
    Do not write an output FASTA file. Useful together with
    ``--print-statistics`` for pure summary runs.

--remove-comma-from-header
    Replace ``,`` with ``;`` in every FASTA header on read. Useful
    when downstream tools parse FASTA headers as CSV fields.

--silent
    Suppress all ``[INFO]`` output to stdout.

--version
    Print the installed **protfasta** version and exit.

Examples

Clean up a FASTA file by removing duplicate records and converting non-standard residues, writing the result to clean.fasta:

pfasta --duplicate-record remove --invalid-sequence convert-all \
       -o clean.fasta input.fasta

Filter sequences between 50 and 500 residues and randomly keep 1000 of them:

pfasta --shortest-seq 50 --longest-seq 500 \
       --random-subsample 1000 \
       -o subset.fasta input.fasta

Just print length statistics, without writing a file:

pfasta --print-statistics --no-outputfile input.fasta