A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.
Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population.
A geospatial analysis company processes thousands of new satellite images each day to produce vessel
detection data for commercial shipping. The company stores the training data in Amazon S3. The training data
incrementally increases in size with new images each day.
The company has configured an Amazon SageMaker training job to use a single ml.p2.xlarge instance with File
input mode to train the built-in Object Detection algorithm. The training process was successful last month but
is now failing because of a lack of storage. Aside from the addition of training data, nothing has changed in the
model training process.
A machine learning (ML) specialist needs to change the training configuration to fix the problem. The solution
must optimize performance and must minimize the cost of training.
Which solution will meet these requirements?
Given the following confusion matrix for a movie classification model, what is the true class frequency for
Romance and the predicted class frequency for Adventure?
A Machine Learning Specialist needs to create a data repository to hold a large amount of time-based training data for a new model. In the source system, new files are added every hour Throughout a single 24-hour period, the volume of hourly updates will change significantly. The Specialist always wants to train on the last 24 hours of the data
Which type of data repository is the MOST cost-effective solution?