iter_fasta

iter_fasta is a memory-efficient streaming parser for very large FASTA files (tens or hundreds of millions of sequences) where loading the entire dataset into memory with protfasta.read_fasta() is not feasible.

Unlike read_fasta, iter_fasta:

  • Yields (header, sequence) tuples one at a time instead of returning a fully materialised dict / list.

  • Keeps peak memory usage to O(single record), regardless of how large the input file is.

  • Performs no duplicate detection, invalid-residue handling, or alignment-gap logic. Callers who need those features should apply them record-by-record as they consume the iterator.

If you are working with a FASTA file that fits comfortably in memory, prefer protfasta.read_fasta() - it offers full sanitization and is only marginally slower per record.

Basic usage

import protfasta

for header, sequence in protfasta.iter_fasta('huge.fasta'):
    # process each sequence on the fly
    if len(sequence) > 1000:
        print(header, len(sequence))

Using a header parser

The optional header_parser argument is a callable (str) -> str applied to every raw header before it is yielded. This is commonly used to extract a UniProt accession from a structured header:

import protfasta

def uniprot_id(header):
    # '>sp|P12345|NAME_HUMAN ...' -> 'P12345'
    return header.split('|')[1]

for acc, seq in protfasta.iter_fasta('uniprot.fasta',
                                     header_parser=uniprot_id):
    ...

Streaming filtering + writing

Because write_fasta accepts a list of [header, sequence] pairs, you can stream through an input file, keep only the records you care about, and write them out in bounded memory:

import protfasta

keep = []
for header, seq in protfasta.iter_fasta('huge.fasta'):
    if 100 <= len(seq) <= 500:
        keep.append([header, seq])
        # optionally flush every N records to bound memory further

protfasta.write_fasta(keep, 'filtered.fasta')

Documentation

protfasta.iter_fasta(filename: str, header_parser: Callable[[str], str] | None = None)[source]

Yield (header, sequence) pairs from a FASTA file, streaming.

This is a memory-efficient alternative to protfasta.read_fasta() designed for very large files (hundreds of millions of sequences) where holding the entire dataset in memory is not feasible.

No duplicate detection, invalid-residue handling, or alignment-gap logic is performed – each record is yielded as parsed. Callers that need those features should consume the iterator and apply their own filtering.

Parameters:
  • filename (str) – Path to a FASTA file.

  • header_parser (callable or None, optional) – Optional (str) -> str transform applied to every raw header.

Yields:

tuple[str, str](header, sequence) pairs in the order they appear in the file. Sequences are upper-cased.

Raises:

ProtfastaException – If the file cannot be opened.