'Count words in different sections of a text file

for a project I have to analyze a txt file with over 200 resumes in it with python. I have to search trough the file and have to count if a specific key is mentioned. This is my very easy code:

file = open("CVC.txt")

data=file.read()

occurence = data.count("Biology")

print('Number of occurrences of the word :', occurence) 

The problem is when I search for e.g. Enginnering it is mentioned several times in one CV. But I just want to count it once. Every resume starts with the word 'contact'. My question is how can I specify an Algorithm that can differentiate between the resumes and only counts for a specific keyword ones in the cv.

Thanks in advance!

ex1 ex2



Solution 1:[1]

The logic is somewhat straightforward. Parse each line of the file, when you see a line that starts a contact, then store the line and all after until you see the next contact line. When the file is done being read, store the remaining lines as part of the last started contact.

contacts = []
current_contact = None

with open("CVC.txt") as data:

    for line in data.splitlines():
      # skip page lines (e.g. in middle of a contact)
      if line.strip().startswith("Page "):
        continue
      # start a new contact
      if line.strip() == "Contact":
        if current_contact is not None:
          # store the current contact lines, if they exist
          contacts.append('\n'.join(current_contact))
        current_contact = []
        continue
      # collect all lines for a single contact
      if current_contact is not None:
        current_contact.append(line.rstrip())
      else:
        print(f"Not seen 'Contact' yet... '{line.rstrip()}'")  # for debugging, e.g. start of the file
    # store remaining data after all lines are read
    if current_contact:
      contacts.append('\n'.join(current_contact))
      del current_contact

I made an example file like this

Contact

https://linkedin.com/1

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus

Page 1 of 2

Hic dignissimos consequatur error.

Contact

https://linkedin.com/2

Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.

And this test output

>>> for c in contacts:
...   print(c.splitlines())
... 
['', 'https://linkedin.com/1', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus', '', '', 'Hic dignissimos consequatur error.']
['', 'https://linkedin.com/2', '', 'Fugit dicta voluptates iusto. Aut nam iste impedit. A aliquam repellendus consectetur esse vero placeat doloremque. Necessitatibus est labore provident atque possimus. Hic dignissimos consequatur error.']

To count words in one contact, you can access by the position

contacts[0].count("Biology")

Solution 2:[2]

Here is a solution with simpler logic, create a flag that tells if 1. We are inside a contact and 2. if we have already seen that word in this contact.

counter = 0 
is_counted = True # Initialize the flag to avoid the code breaking
word = 'engineering' # Change this
with open('cv.txt','r') as file:
    line = file.readline()
    while line:
        if "contact" in line.lower():
            is_counted = False
        elif is_counted == False and word in line.lower():
            counter += 1
            is_counted = True
        line = file.readline()

print(counter)

I have tried it on a small sample successfully, try it on your input and see if it works.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 OneCricketeer
Solution 2 Mohamed Yasser