iter_fasta
iter_fasta is a memory-efficient streaming parser for very large
FASTA files (tens or hundreds of millions of sequences) where loading
the entire dataset into memory with protfasta.read_fasta() is not
feasible.
Unlike read_fasta, iter_fasta:
Yields
(header, sequence)tuples one at a time instead of returning a fully materialiseddict/list.Keeps peak memory usage to O(single record), regardless of how large the input file is.
Performs no duplicate detection, invalid-residue handling, or alignment-gap logic. Callers who need those features should apply them record-by-record as they consume the iterator.
If you are working with a FASTA file that fits comfortably in memory,
prefer protfasta.read_fasta() - it offers full sanitization and
is only marginally slower per record.
Basic usage
import protfasta
for header, sequence in protfasta.iter_fasta('huge.fasta'):
# process each sequence on the fly
if len(sequence) > 1000:
print(header, len(sequence))
Using a header parser
The optional header_parser argument is a callable (str) -> str
applied to every raw header before it is yielded. This is commonly
used to extract a UniProt accession from a structured header:
import protfasta
def uniprot_id(header):
# '>sp|P12345|NAME_HUMAN ...' -> 'P12345'
return header.split('|')[1]
for acc, seq in protfasta.iter_fasta('uniprot.fasta',
header_parser=uniprot_id):
...
Streaming filtering + writing
Because write_fasta accepts a list of [header, sequence] pairs,
you can stream through an input file, keep only the records you care
about, and write them out in bounded memory:
import protfasta
keep = []
for header, seq in protfasta.iter_fasta('huge.fasta'):
if 100 <= len(seq) <= 500:
keep.append([header, seq])
# optionally flush every N records to bound memory further
protfasta.write_fasta(keep, 'filtered.fasta')
Documentation
- protfasta.iter_fasta(filename: str, header_parser: Callable[[str], str] | None = None)[source]
Yield
(header, sequence)pairs from a FASTA file, streaming.This is a memory-efficient alternative to
protfasta.read_fasta()designed for very large files (hundreds of millions of sequences) where holding the entire dataset in memory is not feasible.No duplicate detection, invalid-residue handling, or alignment-gap logic is performed – each record is yielded as parsed. Callers that need those features should consume the iterator and apply their own filtering.
- Parameters:
filename (str) – Path to a FASTA file.
header_parser (callable or None, optional) – Optional
(str) -> strtransform applied to every raw header.
- Yields:
tuple[str, str] –
(header, sequence)pairs in the order they appear in the file. Sequences are upper-cased.- Raises:
ProtfastaException – If the file cannot be opened.