'AttributeError: 'numpy.ndarray' object has no attribute 'split'

I am trying to answer the following question "A colleague has produced a file with one DNA sequence on each line. Download the file and load it into Python using numpy.loadtxt(). You will need to use the optional argument dtype=str to tell loadtxt() that the data is composed of strings.

Calculate the GC content of each sequence. The GC content is the percentage of bases that are either G or C (as a percentage of total base pairs). Print the result for each sequence as "The GC content of the sequence is XX.XX%" where XX.XX is the actual GC content. Do this using a "formatted strings". "

Having imported the file of dna sequences and joining these together I now want to split the string into 5 sequences (corresponding to each of the 5 rows) to then start the calculations...

NB: This is the file source: http://www.programmingforbiologists.org/data/dna_sequences_1.txt

This is my code:

import numpy
dna_data=numpy.loadtxt("dna_sequences",dtype=str)
",".join(dna_data)
seq1,seq2,seq3,seq4,seq5=dna_data.split(",",4)

I am getting this error message: AttributeError: 'numpy.ndarray' object has no attribute 'split'

Please help!!!



Solution 1:[1]

As it was said in the comments : ",".join(dna_data) does not modify dna_data , it just returns a string that you have to store in an other variable. Like this :

s = ",".join(dna_data)
seq1,seq2,seq3,seq4,seq5=s.split(",",4)

Going further :

(Note as you seem to be new to numpy: In the following I'll assume dna_data has a shape (5,) if it is not the case, you can get back to that shape using very basic slicing )

That being said, with that code, you are just turning your array into a list to then put in 5 different variables so going array->string->list->variables is very excessive when you could just go array->variables in one trivial line : seq1,seq2,seq3,seq4,seq5 = dna_data.

And I would go even further : don't do it at all ! What is the point of having several variables when you just can just use dna_data[n] instead of any of your seq* variables ? The former is more convenient and allows to painlessly do things such as looping over all the sequence with for-loops. eg:

for seq in dna_data: 
    print(seq)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jadsq