read_fasta
=================

``read_fasta`` is the primary entry point to **protfasta**. It reads a
FASTA file, optionally sanitizes its contents, and returns the records
either as a dictionary (``header -> sequence``) or as a list of
``[header, sequence]`` pairs.

At its simplest::

    import protfasta
    sequences = protfasta.read_fasta('proteins.fasta')

Many optional keyword arguments customize the behaviour - including
duplicate handling, invalid-residue correction, alignment-gap support,
custom header parsing, and automatic writing of the sanitized output.


What ``read_fasta`` can do
...........................

    *  Ignore, remove, convert, or fail on sequences containing
       non-standard amino-acid characters
       (``B``, ``U``, ``X``, ``Z``, ``*``, ``-``, ``' '``, ...).
    *  Ignore, remove, or fail on duplicate FASTA records
       (same header **and** same sequence).
    *  Ignore, remove, or fail on duplicate sequences (same sequence,
       different headers).
    *  Preserve alignment gap characters (``-``) when
       ``alignment=True``.
    *  Apply a caller-supplied ``header_parser`` function to every
       raw header (useful for extracting accession IDs).
    *  Optionally write the sanitized result to a new FASTA file via
       ``output_filename``.
    *  Override the built-in invalid-character conversion table via
       a custom ``correction_dictionary``.


Processing pipeline
.....................

Sanitization happens in a fixed order:

    1. File is streamed from disk, headers are parsed with
       ``header_parser`` (if provided), and header uniqueness is
       checked (when ``expect_unique_header=True``).
    2. Duplicate **records** are processed according to
       ``duplicate_record_action``.
    3. Duplicate **sequences** are processed according to
       ``duplicate_sequence_action``.
    4. **Invalid residues** are processed according to
       ``invalid_sequence_action``.
    5. The sanitized set is optionally written to
       ``output_filename``.
    6. The result is returned as a ``dict`` (default) or a ``list``
       of ``[header, sequence]`` pairs (when ``return_list=True``).

Incompatible option combinations (for example,
``expect_unique_header=True`` together with
``duplicate_record_action='ignore'``) are caught before the file is
read.


Default conversion table
..........................

When ``invalid_sequence_action`` includes conversion and no custom
``correction_dictionary`` is supplied, these replacements are applied:

    *  ``B`` -> ``N``
    *  ``U`` -> ``C``
    *  ``X`` -> ``G``
    *  ``Z`` -> ``Q``
    *  ``*`` -> ``''`` (removed)
    *  ``-`` -> ``''`` (removed; preserved if ``alignment=True``)
    *  ``' '`` -> ``''`` (whitespace removed)


Large files
............

For files that do not fit comfortably in memory, consider using the
streaming parser :func:`protfasta.iter_fasta` instead. ``read_fasta``
itself streams the file from disk (so it will not load the entire file
as a single string), but it still builds an in-memory data structure
of all records; ``iter_fasta`` avoids that.


For usage examples see the :doc:`examples` page. Full API
documentation is shown below.


Documentation
...............

.. toctree::
   :maxdepth: 2
   :caption: Contents:


.. automodule:: protfasta

.. autofunction:: read_fasta