read_fasta
read_fasta is the primary entry point to protfasta. It reads a
FASTA file, optionally sanitizes its contents, and returns the records
either as a dictionary (header -> sequence) or as a list of
[header, sequence] pairs.
At its simplest:
import protfasta
sequences = protfasta.read_fasta('proteins.fasta')
Many optional keyword arguments customize the behaviour - including duplicate handling, invalid-residue correction, alignment-gap support, custom header parsing, and automatic writing of the sanitized output.
What read_fasta can do
Ignore, remove, convert, or fail on sequences containing non-standard amino-acid characters (
B,U,X,Z,*,-,' ', …).Ignore, remove, or fail on duplicate FASTA records (same header and same sequence).
Ignore, remove, or fail on duplicate sequences (same sequence, different headers).
Preserve alignment gap characters (
-) whenalignment=True.Apply a caller-supplied
header_parserfunction to every raw header (useful for extracting accession IDs).Optionally write the sanitized result to a new FASTA file via
output_filename.Override the built-in invalid-character conversion table via a custom
correction_dictionary.
Processing pipeline
Sanitization happens in a fixed order:
File is streamed from disk, headers are parsed with
header_parser(if provided), and header uniqueness is checked (whenexpect_unique_header=True).Duplicate records are processed according to
duplicate_record_action.Duplicate sequences are processed according to
duplicate_sequence_action.Invalid residues are processed according to
invalid_sequence_action.The sanitized set is optionally written to
output_filename.The result is returned as a
dict(default) or alistof[header, sequence]pairs (whenreturn_list=True).
Incompatible option combinations (for example,
expect_unique_header=True together with
duplicate_record_action='ignore') are caught before the file is
read.
Default conversion table
When invalid_sequence_action includes conversion and no custom
correction_dictionary is supplied, these replacements are applied:
B->N
U->C
X->G
Z->Q
*->''(removed)
-->''(removed; preserved ifalignment=True)
' '->''(whitespace removed)
Large files
For files that do not fit comfortably in memory, consider using the
streaming parser protfasta.iter_fasta() instead. read_fasta
itself streams the file from disk (so it will not load the entire file
as a single string), but it still builds an in-memory data structure
of all records; iter_fasta avoids that.
For usage examples see the Examples page. Full API documentation is shown below.
Documentation
- protfasta.read_fasta(filename: str, expect_unique_header: bool = True, header_parser: Callable[[str], str] | None = None, check_header_parser: bool = True, duplicate_sequence_action: str = 'ignore', duplicate_record_action: str = 'fail', invalid_sequence_action: str = 'fail', alignment: bool = False, return_list: bool = False, output_filename: str | None = None, correction_dictionary: dict[str, str] | None = None, verbose: bool = False) dict[str, str] | list[list[str]][source]
Read a FASTA file, sanitize sequences, and return a dict or list.
This is the primary entry point for protfasta. At its simplest:
sequences = read_fasta('proteins.fasta')
returns a dictionary whose keys are FASTA headers and whose values are amino-acid sequences. Many optional parameters allow automatic handling of duplicates, invalid residues, alignment gap characters, and more.
Sanitization is applied in the following order:
File is read, custom headers are parsed, and header uniqueness is checked (when expect_unique_header is
True).Duplicate records are processed (duplicate_record_action).
Duplicate sequences are processed (duplicate_sequence_action).
Invalid residues are processed (invalid_sequence_action).
Final sequences are optionally written to output_filename.
A dictionary or list is returned to the caller.
Incompatible option combinations are caught before the file is read.
- Parameters:
filename (str) – Path to the FASTA file to read.
expect_unique_header (bool, optional) – If
True(default), an exception is raised when a duplicate header is encountered during parsing. Set toFalsewhen the file is known to contain duplicate headers – in that case return_list should typically beTrueas well so that no entries are silently lost via dictionary-key overwriting.header_parser (callable or None, optional) – A function
(str) -> strapplied to every raw header before any uniqueness checks. Useful for extracting accession IDs from structured headers. When check_header_parser isTrue(the default) the function is smoke-tested with the string'this test string should work'before parsing begins.check_header_parser (bool, optional) – If
True(default), header_parser is tested with a dummy string before the file is read to catch obvious problems early. Set toFalseto skip this pre-check.duplicate_record_action (str, optional) – How to handle records that are identical in both header and sequence. Default
'fail'.'ignore'– keep all occurrences (requires expect_unique_header =False).'fail'– raise an exception.'remove'– keep only the first occurrence.
duplicate_sequence_action (str, optional) – How to handle entries that share the same sequence regardless of header. Default
'ignore'.'ignore'– keep all occurrences.'fail'– raise an exception.'remove'– keep only the first occurrence.
invalid_sequence_action (str, optional) – How to handle sequences containing non-standard amino-acid characters. Default
'fail'.'ignore'– silently accept invalid residues.'fail'– raise an exception.'remove'– discard the entire sequence.'convert'– convert non-standard residues using correction_dictionary (or built-in defaults); fail if any unconvertible residues remain.'convert-ignore'– convert what can be converted, then ignore any remaining invalid residues.'convert-remove'– convert what can be converted, then discard sequences that still contain invalid residues.
alignment (bool, optional) – If
True, dash ('-') characters are treated as valid gap characters and are neither flagged as invalid nor converted. DefaultFalse.return_list (bool, optional) – If
True, return a list of[header, sequence]pairs instead of a dictionary. Required when duplicate headers are present and you want to keep all of them. DefaultFalse.output_filename (str or None, optional) – If provided, the final (sanitized) set of sequences is written to a new FASTA file at this path before the function returns.
correction_dictionary (dict or None, optional) – A mapping of non-standard characters to replacement strings used when invalid_sequence_action involves conversion. When
None, the built-in table is used:B->N,U->C,X->G,Z->Q*->'',-->'',' '->''
A custom dictionary replaces the built-in table entirely.
verbose (bool, optional) – If
True, informational messages are printed to stdout during each processing step. DefaultFalse.
- Returns:
When return_list is
False(default), a dictionary mapping headers to sequences. WhenTrue, a list of two-element lists[header, sequence]. Ordering always matches the original file.- Return type:
dict[str, str] or list[list[str]]
- Raises:
ProtfastaException – If any validation check fails or incompatible options are provided.