The benchmark dataset consists of 288 video clips composed of 261,908 frames and 10,209 static photos collected by several drone-mounted cameras, encompassing a wide variety of features such as location (taken from 14 different cities separated by thousands of kilometres in China), environment (urban and country), objects (pedestrian, automobiles, bicycles, etc.), and density (sparse and crowded scenes). It should be noted that the dataset was gathered utilising a variety of drone platforms (i.e., drones of various types), in a variety of settings, and under a variety of weather and lighting circumstances.