This is probably way out of the norm and I apologize if not appropriate to this forum. I need some help from someone who understands internal/external IP addresses and scraping and that is not me.
I do mostly data analytics, data modeling work. A good portion of what I do involves building classification models for predicting categories to place data into based on the known categorization of other data. I was recently handed a data set and asked to just look at it and figure out what I can from the data with no known data sets to use as comps. I have no known concept of "normal" for anything I am looking at. Two variables in this set are the IP address and web browser scraped from the online application and/or registration submitted. In this particular case, we are looking for potentially fraudulent submittals either multiple applicants and/or identity theft. Most of the IPs are non-duplicates. Several of them are duplicates, triplicates, and some repeat into the hundreds. Since neither of us has any real-world networking experience we have a difference in opinion on the relevance of this.
Two schools of thought:
- The duplicates in the hundreds are simply external IP to the same internet provider or neighborhood or something and they are subnetted to individual users after that and not of a great concern and we should be more concerned with 3,4,5 from the same IP
- The duplicates in the hundreds are likely to be someone who has purchased stolen identities and is committing large-scale fraud
Yes, we could do some correlating and make an educated guess but with zero real-world knowledge and the unknown validity of the other data given either scenario, it seems better to not waste a lot of time making assumptions if I can go to someone with the actual knowledge.