read_fasta

read_fasta is the primary entry point to protfasta. It reads a FASTA file, optionally sanitizes its contents, and returns the records either as a dictionary (header -> sequence) or as a list of [header, sequence] pairs.

At its simplest:

import protfasta
sequences = protfasta.read_fasta('proteins.fasta')

Many optional keyword arguments customize the behaviour - including duplicate handling, invalid-residue correction, alignment-gap support, custom header parsing, and automatic writing of the sanitized output.

What read_fasta can do

  • Ignore, remove, convert, or fail on sequences containing non-standard amino-acid characters (B, U, X, Z, *, -, ' ', …).

  • Ignore, remove, or fail on duplicate FASTA records (same header and same sequence).

  • Ignore, remove, or fail on duplicate sequences (same sequence, different headers).

  • Preserve alignment gap characters (-) when alignment=True.

  • Apply a caller-supplied header_parser function to every raw header (useful for extracting accession IDs).

  • Optionally write the sanitized result to a new FASTA file via output_filename.

  • Override the built-in invalid-character conversion table via a custom correction_dictionary.

Processing pipeline

Sanitization happens in a fixed order:

  1. File is streamed from disk, headers are parsed with header_parser (if provided), and header uniqueness is checked (when expect_unique_header=True).

  2. Duplicate records are processed according to duplicate_record_action.

  3. Duplicate sequences are processed according to duplicate_sequence_action.

  4. Invalid residues are processed according to invalid_sequence_action.

  5. The sanitized set is optionally written to output_filename.

  6. The result is returned as a dict (default) or a list of [header, sequence] pairs (when return_list=True).

Incompatible option combinations (for example, expect_unique_header=True together with duplicate_record_action='ignore') are caught before the file is read.

Default conversion table

When invalid_sequence_action includes conversion and no custom correction_dictionary is supplied, these replacements are applied:

  • B -> N

  • U -> C

  • X -> G

  • Z -> Q

  • * -> '' (removed)

  • - -> '' (removed; preserved if alignment=True)

  • ' ' -> '' (whitespace removed)

Large files

For files that do not fit comfortably in memory, consider using the streaming parser protfasta.iter_fasta() instead. read_fasta itself streams the file from disk (so it will not load the entire file as a single string), but it still builds an in-memory data structure of all records; iter_fasta avoids that.

For usage examples see the Examples page. Full API documentation is shown below.

Documentation

protfasta.read_fasta(filename: str, expect_unique_header: bool = True, header_parser: Callable[[str], str] | None = None, check_header_parser: bool = True, duplicate_sequence_action: str = 'ignore', duplicate_record_action: str = 'fail', invalid_sequence_action: str = 'fail', alignment: bool = False, return_list: bool = False, output_filename: str | None = None, correction_dictionary: dict[str, str] | None = None, verbose: bool = False) dict[str, str] | list[list[str]][source]

Read a FASTA file, sanitize sequences, and return a dict or list.

This is the primary entry point for protfasta. At its simplest:

sequences = read_fasta('proteins.fasta')

returns a dictionary whose keys are FASTA headers and whose values are amino-acid sequences. Many optional parameters allow automatic handling of duplicates, invalid residues, alignment gap characters, and more.

Sanitization is applied in the following order:

  1. File is read, custom headers are parsed, and header uniqueness is checked (when expect_unique_header is True).

  2. Duplicate records are processed (duplicate_record_action).

  3. Duplicate sequences are processed (duplicate_sequence_action).

  4. Invalid residues are processed (invalid_sequence_action).

  5. Final sequences are optionally written to output_filename.

  6. A dictionary or list is returned to the caller.

Incompatible option combinations are caught before the file is read.

Parameters:
  • filename (str) – Path to the FASTA file to read.

  • expect_unique_header (bool, optional) – If True (default), an exception is raised when a duplicate header is encountered during parsing. Set to False when the file is known to contain duplicate headers – in that case return_list should typically be True as well so that no entries are silently lost via dictionary-key overwriting.

  • header_parser (callable or None, optional) – A function (str) -> str applied to every raw header before any uniqueness checks. Useful for extracting accession IDs from structured headers. When check_header_parser is True (the default) the function is smoke-tested with the string 'this test string should work' before parsing begins.

  • check_header_parser (bool, optional) – If True (default), header_parser is tested with a dummy string before the file is read to catch obvious problems early. Set to False to skip this pre-check.

  • duplicate_record_action (str, optional) – How to handle records that are identical in both header and sequence. Default 'fail'.

    • 'ignore' – keep all occurrences (requires expect_unique_header = False).

    • 'fail' – raise an exception.

    • 'remove' – keep only the first occurrence.

  • duplicate_sequence_action (str, optional) – How to handle entries that share the same sequence regardless of header. Default 'ignore'.

    • 'ignore' – keep all occurrences.

    • 'fail' – raise an exception.

    • 'remove' – keep only the first occurrence.

  • invalid_sequence_action (str, optional) – How to handle sequences containing non-standard amino-acid characters. Default 'fail'.

    • 'ignore' – silently accept invalid residues.

    • 'fail' – raise an exception.

    • 'remove' – discard the entire sequence.

    • 'convert' – convert non-standard residues using correction_dictionary (or built-in defaults); fail if any unconvertible residues remain.

    • 'convert-ignore' – convert what can be converted, then ignore any remaining invalid residues.

    • 'convert-remove' – convert what can be converted, then discard sequences that still contain invalid residues.

  • alignment (bool, optional) – If True, dash ('-') characters are treated as valid gap characters and are neither flagged as invalid nor converted. Default False.

  • return_list (bool, optional) – If True, return a list of [header, sequence] pairs instead of a dictionary. Required when duplicate headers are present and you want to keep all of them. Default False.

  • output_filename (str or None, optional) – If provided, the final (sanitized) set of sequences is written to a new FASTA file at this path before the function returns.

  • correction_dictionary (dict or None, optional) – A mapping of non-standard characters to replacement strings used when invalid_sequence_action involves conversion. When None, the built-in table is used:

    • B -> N, U -> C, X -> G, Z -> Q

    • * -> '', - -> '', ' ' -> ''

    A custom dictionary replaces the built-in table entirely.

  • verbose (bool, optional) – If True, informational messages are printed to stdout during each processing step. Default False.

Returns:

When return_list is False (default), a dictionary mapping headers to sequences. When True, a list of two-element lists [header, sequence]. Ordering always matches the original file.

Return type:

dict[str, str] or list[list[str]]

Raises:

ProtfastaException – If any validation check fails or incompatible options are provided.