How to Categorize Known Known, Known Unknown, and Unknown Unknown Data in Python?

Question

Asked: September 26, 20242024-09-26T15:55:41+05:30 2024-09-26T15:55:41+05:30In: Python

How to Categorize Known Known, Known Unknown, and Unknown Unknown Data in Python?

I’ve been pondering a little brain teaser inspired by a certain famous quote about known knowns, known unknowns, and unknown unknowns, and I can’t seem to wrap my head around it! I thought it might be fun to turn it into a coding challenge and see how creative everyone can get.

Here’s the scenario: Imagine you’re a data scientist tasked with analyzing a mysterious set of data that your company received from an unknown source. This dataset contains various categories of information, but it’s a bit of a mess. Some entries are straightforward; other entries are obscured or misfiled. You’ve got your “known knowns,” which are entries with clear data. However, you also have “known unknowns,” which are placeholders for missing data, and “unknown unknowns,” where the data is completely out of context or isn’t even in a recognizable format.

Now for the challenge! You need to write a function that categorizes and counts these three types of data points. Your function should take a list as input, where each item is a string representing a data point. Based on the content of each string, the function should return a dictionary with counts of each type. Here’s how you can define them:

1. **Known Knowns:** Any string that contains recognizable, valid data (for example, a valid number or a properly formatted email).
2. **Known Unknowns:** Items that are explicitly marked as “unknown” or similar phrases.
3. **Unknown Unknowns:** Anything else that doesn’t fit the first two categories, including random gibberish, completely empty strings, or odd characters.

For extra fun, you can get creative with how you define recognizable data. Should you use regular expressions? How sophisticated do you want your processing to be?

I’m super curious to see how everyone approaches this challenge, especially in terms of how you handle the “unknown unknowns”! Let’s see your solutions and any edge cases you come up with. Can’t wait to see the magic unfold!

Leave an answer
Cancel reply

You must login to add an answer.

Continue with Google

or use

Need An Account,

Continue with Google

2 Answers

anonymous user · Answer 1 · 2024-09-26T15:55:43+05:30

Here’s a Python function that categorizes and counts the three types of data points as described in the challenge. It utilizes regular expressions for validating “known knowns” and classifies the data accordingly. The function scans through each string in the input list, checking for recognizable data formats such as valid numbers or properly formatted emails. Known unknowns are identified by checking if the string contains the term “unknown”, while all other strings are classified as unknown unknowns.

import re
from collections import defaultdict

def categorize_data(data_list):
    counts = defaultdict(int)
    
    for entry in data_list:
        entry = entry.strip()
        
        if re.match(r'^[\w\.-]+@[\w\.-]+\.\w+$', entry):  # Valid email regex
            counts['Known Knowns'] += 1
        elif re.match(r'^\d+(\.\d+)?$', entry):  # Valid number regex
            counts['Known Knowns'] += 1
        elif 'unknown' in entry.lower():
            counts['Known Unknowns'] += 1
        else:
            counts['Unknown Unknowns'] += 1
            
    return dict(counts)

# Example usage
data_points = ['john.doe@example.com', '12345', 'unknown data', '???', '', 'unknown', 'not@an.email', '42.0']
result = categorize_data(data_points)
print(result)  # Output will show counts of each category

anonymous user · Answer 2 · 2024-09-26T15:55:42+05:30

Data Categorization Challenge

Here’s a simple function to categorize data points into known knowns, known unknowns, and unknown unknowns. I’m not super experienced, so I just did my best!


def categorize_data(data_points):
    counts = {'known_knowns': 0, 'known_unknowns': 0, 'unknown_unknowns': 0}
    
    for point in data_points:
        if point.strip().lower() == "unknown":
            counts['known_unknowns'] += 1
        elif point.strip() == "" or not any(char.isalnum() for char in point):
            counts['unknown_unknowns'] += 1
        elif "@" in point and "." in point:  # rough check for email
            counts['known_knowns'] += 1
        elif point.isdigit():  # checks if point is a valid number
            counts['known_knowns'] += 1
        else:
            counts['unknown_unknowns'] += 1
    
    return counts

# Example usage
data = [
    "john.doe@example.com", 
    "12345", 
    "unknown", 
    "", 
    "???", 
    "something random", 
    "12.34", 
    "not an email"
]

result = categorize_data(data)
print(result)

This code checks each string to see which category it falls into. It just looks for emails and numbers pretty simply and counts up the categories.

askthedev.com Latest Questions

How to Categorize Known Known, Known Unknown, and Unknown Unknown Data in Python?

Leave an answerCancel reply

2 Answers

Data Categorization Challenge

Related Questions

Leave an answer
Cancel reply