'Read text file and look for certain words from key word list
I am new to Python, and I am trying to build a script where I import text_file_1 that contains a body of text. I want the script to read the body of text, and look for certain words that I have defined in a list called (key_words) that contain words with a Capital letter in the beginning (Nation) and lowercase (nation). After Python does the searching, it will output the list of words vertically in a new text file called "List of Words", along with the number of times that word occurs in the body. If I read text_file_2 with a body of text, it will do the same, but ADD to the List of Words from the original file.
Example:
List of Words
File 1:
God: 5
Nation: 4
creater: 8
USA: 3
File 2:
God: 10
Nation: 14
creater: 2
USA: 1
Here is what I have so far:
from sys import argv
from string import punctuation
script = argv[0] all_filenames = argv[1:]
print "Text file to import and read: " + all_filenames
print "\nReading file...\n"
text_file = open(all_filenames, 'r')
all_lines = text_file.readlines()
#print all_lines
text_file.close()
for all_filenames in argv[1:]:
print "I get: " + all_filenames
print "\nFile read finished!"
#print "\nYour file contains the following text information:"
#print "\n" + text_file.read()
#~ for word, count in word_freq.items():
#~ print word, count
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
Maybe use this code somehow?
import fileinput
for line in fileinput.input('List_of_words.txt', inplace = True):
if line.startswith('Existing file that was read'):
#if line starts Existing file that was read then do something here
print "Existing file that was read"
elif line.startswith('New file that was read'):
#if line starts with New file that was read then do something here
print "New file that was read"
else:
print line.strip()
Solution 1:[1]
This way you have result on the screen.
from sys import argv
from collections import Counter
from string import punctuation
script, filename = argv
text_file = open(filename, 'r')
word_freq = Counter([word.strip(punctuation) for line in text_file for word in line.split()])
#~ for word, count in word_freq.items():
#~ print word, count
key_words = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater'
'Country', 'country', 'People', 'people', 'Liberty', 'liberty',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage']
for word in key_words:
if word in word_freq:
print word, word_freq[word]
Now you have to save it in file.
For more files use
for filename in argv[1:]:
# do your job
EDIT:
With this code (my_script.py)
for filename in argv[1:]:
print( "I get", filename )
You can run script
python my_script.py file1.txt file2.txt file3.txt
and get
I get file1.txt
I get file2.txt
I get file3.txt
You can use it to count words in many files.
-
Using readlines()
you read all lines into memory so you need more memory - for very, very big file it can be problem.
In current version Counter()
count all words in all lines - test it - but use less memory.
So using readlines()
you get the same word_freq
but you use more memory.
-
writelines(list_of_result)
will not add "\n" after every line - and don't add ':' in "God: 3"
Better use something similar to
output_file = open("List_of_words.txt", "w")
for word in key_words:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
EDIT: new version - it append result to the end of List_of_words.txt
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq = Counter([word.strip(punctuation) for line in all_lines for word in line.split()])
print "Append result to the end of file: List_of_words.txt"
output_file = open("List_of_words.txt", "a")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
EDIT: write sum of results in one file
from sys import argv
from string import punctuation
from collections import *
keyWords = ['God', 'Nation', 'nation', 'USA', 'Creater', 'creater', 'Country', 'Almighty',
'country', 'People', 'people', 'Liberty', 'liberty', 'America', 'Independence',
'honor', 'brave', 'Freedom', 'freedom', 'Courage', 'courage', 'Proclamation',
'proclamation', 'United States', 'Emancipation', 'emancipation', 'Constitution',
'constitution', 'Government', 'Citizens', 'citizens']
word_freq = Counter()
for one_filename in argv[1:]:
print "Text file to import and read:", one_filename
print "\nReading file...\n"
text_file = open(one_filename, 'r')
all_lines = text_file.readlines()
text_file.close()
print "\nFile read finished!"
word_freq.update( [word.strip(punctuation) for line in all_lines for word in line.split()] )
print "Write sum of results: List_of_words.txt"
output_file = open("List_of_words.txt", "w")
for word in keyWords:
if word in word_freq:
output_file.write( "%s: %d\n" % (word, word_freq[word]) )
output_file.close()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |