read_fasta¶
read_fasta
is a one-stop-shop for reading in FASTA files! Customizable keywords allow a variety of sanitizing functions which include:
- Ignore, remove, or convert sequences with invalid amino acid characters (
B
/U
/X
/*
/-
)- Ignore or remove duplicate sequences or duplicate FASTA records
- Alternatively, allow duplicate sequences, headers, and FASTA records (something most other parsers do not)
- Arbitrary conversion of amino acids via a customizable
correction_dictionary
Once parsed, read_fasta
returns either a dictionary of header-to-sequence values or a nested list, where each sub-list contains two elements (header, sequence).
For usage examples see the Examples page. Full documentation is shown below.
Documentation¶
-
protfasta.
read_fasta
(filename, expect_unique_header=True, header_parser=None, check_header_parser=True, duplicate_sequence_action='ignore', duplicate_record_action='fail', invalid_sequence_action='fail', alignment=False, return_list=False, output_filename=None, correction_dictionary=None, verbose=False)[source]¶ read_fasta
is the main one of of only two user-facing functions associated with protfasta. It is designed as a catch-all function for reading in a FASTA file, performing sanitization, and returning a list or dictionary of sequences and their associated headers.There are a number of parameters which can be included, but as one might expect the simplest usage is just
>>> x = read_fasta(filename)
This will read in the file associated with filename and return a dictionary, where the keys are the FASTA file headers and the values are the amino acid sequences associated with each.
Note that as of python 3.7 the order in which one adds items to a dictionary is guaranteed to be the order in which they’re retrieved, so cycling through the resulting dictionary will in fact allow you to cycle through in order.
In addition to this simple usage, there are a number of keywords which are described in depth below and allow additional processing to be complete.
There is an order of options in which sanitization occurs:
- File is read in, custom headers are parsed, and unique headers are tested (if
expect_unique = True
) - Check for duplicate records and respond appropriately (optional)
- Check for duplicate sequences and respond appropriately (optional)
- Invalid sequences dealt with (optional)
- Final set of sequences/headers written to a new FASTA file (optional)
- Dictionary/list returned to user.
Understanding there is a specific order is important when considering what options to pass. If a set of options are incompatible, this will be caught before the file is read.
Parameters: expect_unique_header (bool) – [Default = True] Should the function expect each header to be unique? In general this is true for FASTA files, but this is strictly not guarenteed. If this is set to True and a duplicate header is found then this means an error will be thrown. If it’s set to false duplicate headers are dealt with, although for this to work
return_list
must also be set to True. Note that this won’t happen automatically to avoid the scenario where you expect a dictionary to return and actually get a list.header_parser (function) – [Default = None]
header_parser
allows a user-defined function that will be fed the FASTA header and whatever it returns will be used as the actual header as the files are parsed. This can be useful if you know your FASTA header has a consistent format that you want to take advantage of. A function provided here MUST (1) Take a single input argument (the header string) and (2) Return a single string. When parsing this function the following test is applied, unlesscheck_header_parser
is set to false.>>> return_string = header_parser('this test string should work')
Where
return_string
is tested to be a string. The function will show an exception if this test fails andcheck_header_parser
is set to true.check_header_parser (bool) – [Default = True] Flag which - if set to false - will not test if the header_parser function returns a valid string. This may lead to unexpected header values if the passed header_parser function is not well defined.
duplicate_record_action (
'ignore'
,'fail'
,'remove'
) – [Default = ‘fail’] Selector that determines how to deal with duplicate entries. Note that duplicate records refers to entries in the fasta file where both the sequence and the header are identical. duplicate_record_action is only relevant keyword when expect_unique_header is False. Options are as follows:ignore
- duplicate entries are allowed and ignoredfail
- duplicate entries cause parsing to fail and throw an exceptionremove
- duplicate entries are removed, so there’s only one copy of any duplicates
duplicate_sequence_action (
'ignore'
,'fail'
,'remove'
) – [Default = ‘ignore’] Selector that determines how to deal with duplicate sequences. This completely ignores the header and simply asks is two sequences are duplicated (or not).ignore
- duplicate sequences are allowed and ignoredfail
- duplicate sequences cause parsing to fail and throw an exceptionremove
- duplicate sequences are removed, so there’s only one copy of any duplicates (1st instance kept)
invalid_sequence_action (
'ignore'
,'fail'
,'remove'
,'convert'
,'convert-ignore', ``'convert-remove'
) – [Default = ‘fail’] Selector that determines how to deal with invalid sequences. Ifconvert
orconvert-ignore
are chosen, then conversion is completed with either the standard conversion table (shown under thecorrection_dictionary
documentation) or with a custom conversion dictionary passed tocorrection_dictionary
. Options are as follows:ignore
- invalid sequences are completely ignoredfail
- invalid sequence cause parsing to fail and throw an exceptionremove
- invalid sequences are removedconvert
- invalid sequences are convertconvert-ignore
- invalid sequences are converted to valid sequences and any remaining invalid residues are ignoredconvert-remove
- invalid sequences are converted to valid sequences where possible, and any remaining sequences with invalid residues are removed
alignment (bool) – [Default = False] Flag which - if set to true - the Fasta file is treated as containing alignments (with dashes) such that ‘-’ characters are not treated as invalid or converted. Works in concert with other flags.
return_list (bool) – [Default = False] Flag that tells the function to return a list of 2-mer lists (where position 0 is the header and position 1 the sequence). If you have duplicate identical headers which you want to deal with, this is required.
output_filename (string) – [Default = None] If you are performing sanitization of the input file it is often useful to write out the actual set of sequences you’ll be analyzing, so you have a persistent copy of this data for further analysis later on. If you provide a string to output filename it will cause a new FASTA file to be written with the final set of sequences returned.
correction_dictionary (dict) – [Default = None] protfasta can automatically correct non-standard amino acids to standard amino acids using the
invalid_sequence
keyword. This is useful if downstream analysis assumes/requires fully standard amino acids. This is also useful for removing ‘-’ from aligned sequences. The standard conversions used are:B -> N
U -> C
X -> G
Z -> Q
" " -> <empty string>
(i.e. a whitespace character)* -> <empty string>
- -> <empty string>
However, if alternative definitions are needed they can be passed via the
correction_dictionary
keyword. Thecorrection_dictionary
should be a dictionary that maps sequences characters to some other character (ideally valid amino acid characters). In principle this could be used to perform arbitrary coarse-graining if a sequence…verbose (bool) – [Default = False] If set to True, protfasta will print out information as it works its way through reading and parsing FASTA files. This can be useful for diagnosis.
Returns: - Return type is *list or dict*
- If
return_list
is set toTrue
then the function returns a list of lists. In each sublist contains two elements, where the first is the FASTA record header and the second the sequence. The order of FASTA records will match the order they were read in from the FASTA file. Ifreturn_list
isFalse
then the function returns a dictionary where the keys are the FASTA record heades and the values are the sequences. NOTE the order of keys will match the order that the FASTA file was read in IF the Python version is 3.7 or higher.
- File is read in, custom headers are parsed, and unique headers are tested (if