Google, in collaboration with Stanford Medicine, has introduced SCIN – an open dataset comprising 10,000 images of dermatological diseases. Models trained on this dataset will be able to remotely diagnose allergic, inflammatory, and infectious skin, nail, and hair conditions.
Moreover, medical datasets play a vital role in scientific research and medical education. However, their collection is complicated due to serious requirements for data representativeness. For instance, dermatological diseases vary in appearance and severity and manifest differently depending on skin tone.
SCIN reflects a wide spectrum of diseases that people search for on the Internet. It contains images of various skin tones and body parts, ensuring that future models will effectively cater to all patients.
This dataset comprises over 10,000 images of skin, nail, or hair conditions provided directly by individuals experiencing them. To provide context for retrospective dermatologist labeling, participants were asked to take pictures both up close and from a slightly farther distance. Most images in the SCIN dataset showcase early-stage problems – over half of them arose less than a week before being photographed, and 30% arose less than a day before the image was captured. Conditions within this timeframe are rarely observed in the healthcare system and thus are inadequately represented in existing dermatological datasets.
Google utilized a new crowdsourcing method to collect the dataset, allowing individuals to play an active role in healthcare research by using an ad block in search results inviting them to submit images. This helps reach people at earlier stages of their health problems before they seek official help. The company deemed the experiment successful, primarily due to the low spam level – over 97.5% of the photos were genuine images of skin diseases.
The dataset is available via this link.