Fraudulent and Legitimate Online Shops Dataset

Published: 22 December 2023| Version 1 | DOI: 10.17632/m7xtkx7g5m.1
Contributors:
,

Description

The dataset contains fake (fraudulent) e-shops data together with legitimate e-shops data. The dataset is balanced and contains 1140 records of 579 fake (fraudulent) and 561 real (legitimate) online shops. Each record contains the following fields: 1. Online shop’s URL; 2. Label - {legitimate, fraudulent}; 3. Domain length - Number of symbols in the host domain name; 4. Top domain length - Number of symbols in the top domain name; 5. Presence of prefix “www” in the active URL of the online shop, values {0 - no, 1 - yes}; 6. Number of digits in the URL; 7. Number of letters in the URL; 8. Number of dots (.) in the URL; 9. Number of hyphens (-) in the URL; 10. Presence of credit card payment, values {0 - no, 1 - yes}; 11. Presence of money back payment, including PayPal, Alipay, Apple Pay, Google Pay, Samsung Pay, and Amazon Pay, values {0 - no, 1 - yes}; 12. Presence of cash on delivery payment, values {0 - no, 1 - yes}; 13. Presence of the ability to use cryptocurrencies for payments, values {0 - no, 1 - yes}; 14. Presence of free contact emails, including Gmail, Hotmail, Outlook, Yahoo Mail, Zoho Mail, ProtonMail, iCloud Mail, GMX Mail, AOL Mail, mail.com, Yandex Mail, Mail2World, or Tutanota, values {0 – email address not found, 1 - free email address, 2 - domain email address, 3 – other email address}; 15. Presence of logo URL, values {0 - no, 1 - yes}; 16. SSL certificate issuer name; 17. SSL certificate expire date; 18. SSL certificate issuer organization name; 19. SSL certificate issuer organization ID, values {1 - Cloudflare, Inc., 2 - Let's Encrypt, 3 - Sectigo Limited, 4 - cPanel, Inc., 5 - GoDaddy.com, Inc., 6 - Amazon, 7 - DigiCert, Inc., 8 - GlobalSign nv-sa, 9 - Google Trust Services LLC, 10 - ZeroSSL, 11 - other organization}; 20. Indication of young domain, registered 400 days ago or later, values {0 - ‘old’ domain name, 1 - ‘young’ domain name, 2 - ‘hidden’}; 21. Domain registration date; 22. Presence of TrustPilot reviews, values {0 - no, 1 - yes}; 23. TrustPilot score, values - real number from 0 to 5 or -1 if no reviews are available; 24. Presence of SiteJabber reviews, values {0 - no, 1 - yes}; 25. Presence in the standard Tranco list, values {0 - no, 1 - yes}; 26. Tranco List rank, values - integer number from 1 to 1000000 or -1 if domain is not listed in the Tranco list.

Files

Steps to reproduce

The data are created from publicly available sources (e.g. WHOIS service, SSL Certificate data, Trustpilot and SiteJabber reviews, Tranco list) and e-shop source code directly using Python script. Fake shops URLs are taken from publicly available sources: https://www.watchlist-internet.at/, https://db.aa419.org, https://www.kaggle.com/datasets/wiredwith/websites-list. URLs of legitimate online shops are taken from https://www.trustedshops.eu, https://ecommercetrustmark.eu/, https://ehi-siegel.de/, https://www.retailexcellence.ie and https://www.similarweb.com/, and manually adding URLs of well-known shops. Please note: since fake e-shops and their domains are short-lived, most of the fake shops are not reachable at this moment. The data are extracted/generated during the June-July 2023 period. In the future, we are planning to expand the dataset and update it on a regular basis, as well as to share Python code, allowing you to extract data yourself.

Institutions

Kauno Technologijos Universitetas

Categories

e-Commerce

Licence