'Count words in different sections of a text file
for a project I have to analyze a txt file with over 200 resumes in it with python. I have to search trough the file and have to count if a specific key is mentioned. This is my very easy code:
file = open("CVC.txt")
data=file.read()
occurence = data.count("Biology")
print('Number of occurrences of the word :', occurence)
The problem is when I search for e.g. Enginnering it is mentioned several times in one CV. But I just want to count it once. Every resume starts with the word 'contact'. My question is how can I specify an Algorithm that can differentiate between the resumes and only counts for a specific keyword ones in the cv.
Thanks in advance!
Solution 1:[1]
The logic is somewhat straightforward. Parse each line of the file, when you see a line that starts a contact, then store the line and all after until you see the next contact line. When the file is done being read, store the remaining lines as part of the last started contact.
contacts = []
current_contact = None
with open("CVC.txt") as data:
for line in data.splitlines():
# skip page lines (e.g. in middle of a contact)
if line.strip().startswith("Page "):
continue
# start a new contact
if line.strip() == "Contact":
if current_contact is not None:
# store the current contact lines, if they exist
contacts.append('\n'.join(current_contact))
current_contact = []
continue
# collect all lines for a single contact
if current_contact is not None:
current_contact.append(line.rstrip())
else:
print(f"Not seen 'Contact' yet... '{line.rstrip()}'") # for debugging, e.g. start of the file
# store remaining data after all lines are read
if current_contact:
contacts.append('\n'.join(current_contact))
del current_contact
I made an example file like this
Contact
https://linkedin.com/1
Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus
Page 1 of 2
Hic dignissimos consequatur error.
Contact
https://linkedin.com/2
Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.
And this test output
>>> for c in contacts:
... print(c.splitlines())
...
['', 'https://linkedin.com/1', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus', '', '', 'Hic dignissimos consequatur error.']
['', 'https://linkedin.com/2', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.']
To count words in one contact, you can access by the position
contacts[0].count("Biology")
Solution 2:[2]
Here is a solution with simpler logic, create a flag that tells if 1. We are inside a contact and 2. if we have already seen that word in this contact.
counter = 0
is_counted = True # Initialize the flag to avoid the code breaking
word = 'engineering' # Change this
with open('cv.txt','r') as file:
line = file.readline()
while line:
if "contact" in line.lower():
is_counted = False
elif is_counted == False and word in line.lower():
counter += 1
is_counted = True
line = file.readline()
print(counter)
I have tried it on a small sample successfully, try it on your input and see if it works.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | OneCricketeer |
Solution 2 | Mohamed Yasser |