by jsendak | Mar 4, 2024 | DS Articles
[This article was first published on
Mark H. White II, PhD, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
The Academy Awards are a week away, and I’m sharing my
machine-learning-based predictions for Best Picture as well as some
insights I took away from the process (particularly XGBoost’s
sparsity-aware split finding). Oppenheimer is a heavy favorite
at 97% likely to win—but major surprises are not uncommon, as we’ll
see.
I pulled data from three sources. First, industry awards. Most unions
and guilds for filmmakers—producers, directors, actors,
cinematographers, editors, production designers—have their own awards.
Second, critical awards. I collected as wide as possible, from the
Golden Globes to the Georgia Film Critics Association. More or less: If
an organization had a Wikipedia page showing a historical list of
nominees and/or winners, I scraped it. Third, miscellaneous information
like Metacritic score and keywords taken from synopses to learn if it
was adapted from a book, what genre it is, the topics it covers, and so
on. Combining all of these was a pain, especially for films that have
bonkers names like BİRDMAN or (The Unexpected Virtue of
Ignorance).
The source data generally aligns with what
FiveThirtyEight used to do, except I casted a far wider net in
collecting awards. Other differences include FiveThirtyEight choosing a
closed-form solution for weighting the importance of awards and then
rating films in terms of “points” they accrued (out of the potential
pool of points) throughout the season. I chose to build a machine
learning model, which was tricky.
To make the merging of data feasible (e.g., different tables had
different spellings of the film or different years associated with the
film), I only looked at the movies who received a nomination for Best
Picture, making for a tiny dataset of 591 rows for the first 95
ceremonies. The wildly small N presents a challenge for building a
machine learning model, as does sparsity and missing data.
Sparsity and Missing Data
There are a ton of zeroes in the data, creating sparsity. Every
variable (save for the Metacritic score) is binary. Nomination variables
(i.e., was the film nominated for the award?) may have multiple films
for a given year with a 1, but winning variables (i.e., did the film win
the award?) only have a single 1 each year.
There is also the challenge of missing data. Not every award in the
model goes back to the late 1920s, meaning that each film has an
NA
if it was released in a year before a given award. For
example, I only included Metacritic scores for contemporaneous releases,
and the site launched in 2001, while the Screen Actors Guild started
their awards in 1995.
My first thought was an ensemble model. Segment each group of awards,
based on their start date, into different models. Get predicted
probabilities from these, and combine them weighted on the inverse of
out-of-sample error. After experimenting a bit, I came to the conclusion
so many of us do when building models: Use XGBoost. With so little data
to use for tuning, I simply stuck with model defaults for
hyper-parameters.
Outside of its reputation for being accurate out of the box, it
handles missing data. The docs simply
state: “XGBoost supports missing values by default. In tree algorithms,
branch directions for missing values are learned during training.” This
is discussed in deeper detail in the “sparsity-aware split finding”
section of the paper
introducing XGBoost. The full algorithm is shown in that paper, but the
general idea is that an optimal default direction at each split in a
tree is learned from the data, and missing values follow that
default.
Backtesting
To assess performance, I backtested on the last thirty years of
Academy Awards. I believe scikit-learn would call this group
k-fold cross-validation. I removed a given year from the dataset,
fit the model, and then made predictions on the held-out year. The last
hiccup is that the model does not know that if Movie A from Year X wins
Best Picture, it means Movies B – E from Year X cannot. It also does not
know that one of the films from Year X must win. My cheat
around this is I re-scale all the predicted probabilities to sum to
one.
The predictions for the last thirty years:
1993 |
schindler’s list |
0.996 |
1 |
schindler’s list |
1994 |
forrest gump |
0.990 |
1 |
forrest gump |
1995 |
apollo 13 |
0.987 |
0 |
braveheart |
1996 |
the english patient |
0.923 |
1 |
the english patient |
1997 |
titanic |
0.980 |
1 |
titanic |
1998 |
saving private ryan |
0.938 |
0 |
shakespeare in love |
1999 |
american beauty |
0.995 |
1 |
american beauty |
2000 |
gladiator |
0.586 |
1 |
gladiator |
2001 |
a beautiful mind |
0.554 |
1 |
a beautiful mind |
2002 |
chicago |
0.963 |
1 |
chicago |
2003 |
the lord of the rings: the return of the king |
0.986 |
1 |
the lord of the rings: the return of the king |
2004 |
the aviator |
0.713 |
0 |
million dollar baby |
2005 |
brokeback mountain |
0.681 |
0 |
crash |
2006 |
the departed |
0.680 |
1 |
the departed |
2007 |
no country for old men |
0.997 |
1 |
no country for old men |
2008 |
slumdog millionaire |
0.886 |
1 |
slumdog millionaire |
2009 |
the hurt locker |
0.988 |
1 |
the hurt locker |
2010 |
the king’s speech |
0.730 |
1 |
the king’s speech |
2011 |
the artist |
0.909 |
1 |
the artist |
2012 |
argo |
0.984 |
1 |
argo |
2013 |
12 years a slave |
0.551 |
1 |
12 years a slave |
2014 |
birdman |
0.929 |
1 |
birdman |
2015 |
spotlight |
0.502 |
1 |
spotlight |
2016 |
la la land |
0.984 |
0 |
moonlight |
2017 |
the shape of water |
0.783 |
1 |
the shape of water |
2018 |
roma |
0.928 |
0 |
green book |
2019 |
parasite |
0.576 |
1 |
parasite |
2020 |
nomadland |
0.878 |
1 |
nomadland |
2021 |
the power of the dog |
0.981 |
0 |
coda |
2022 |
everything everywhere all at once |
0.959 |
1 |
everything everywhere all at once |
Of the last 30 years, 23 predicted winners actually won, while 7
lost—making for an accuracy of about 77%. Not terrible. (And,
paradoxically, many of the misses are predictable ones to those familiar
with Best Picture history.) However, the mean predicted probability of
winning from these 30 cases is about 85%, which means the model is maybe
8 points over-confident. We do see recent years being more prone to
upsets—is that due to a larger pool of nominees? Or something else, like
a change in the Academy’s makeup or voting procedures? At any rate, some
ideas I am going to play with before next year are weighting more
proximate years higher (as rules, voting body, voting trends, etc.,
change over time), finding additional awards, and pulling in other
metadata on films. It might just be, though, that the Academy likes to
swerve away from everyone else sometimes in a way that is not readily
predictable from outside data sources. (Hence the fun of watching and
speculating and modeling in the first place.)
This Year
I wanted to include a chart showing probabilities over time, but the
story has largely remained the same. The major inflection point was the
Directors Guild of America (DGA) Awards.
Of the data we had on the day the nominees were
announced (January 23rd), the predictions were:
Killers of the Flower Moon |
0.549 |
The Zone of Interest |
0.160 |
Oppenheimer |
0.147 |
American Fiction |
0.061 |
Barbie |
0.039 |
Poor Things |
0.023 |
The Holdovers |
0.012 |
Past Lives |
0.005 |
Anatomy of a Fall |
0.005 |
Maestro |
0.001 |
I was shocked to see Oppenheimer lagging in third and to see
The Zone of Interest so high. The reason here is that, while
backtesting, I saw that the variable importance for winning the DGA
award for Outstanding Directing – Feature Film was the highest by about
a factor of ten. Since XGBoost handles missing values nicely, we can
rely on the sparsity-aware split testing to get a little more
information from these data. If we know the nominees of an award but not
the winner yet, we can still infer: Anyone who was nominated is left
NA
, while anyone who was not nominated is set to zero. That
allows us to partially use this DGA variable (and the other awards where
we knew the nominees on January 23rd, but not the winners). When we do
that, the predicted probabilities as of the announcing of the
Best Picture nominees were:
Killers of the Flower Moon |
0.380 |
Poor Things |
0.313 |
Oppenheimer |
0.160 |
The Zone of Interest |
0.116 |
American Fiction |
0.012 |
Barbie |
0.007 |
Past Lives |
0.007 |
Maestro |
0.003 |
Anatomy of a Fall |
0.002 |
The Holdovers |
0.001 |
The Zone of Interest falls in favor of Poor Things,
since the former was not nominated for the DGA award while the latter
was. I was still puzzled, but I knew that the model wouldn’t start being
certain until we knew the DGA award. Those top three films were
nominated for many of the same awards. Then Christopher Nolan won the
DGA award for Oppenheimer, and the film hasn’t been below a 95%
chance for winning Best Picture since.
Final Predictions
The probabilities as they stand today, a week before the ceremony,
have Oppenheimer as the presumptive winner at a 97% chance of
winning.
Oppenheimer |
0.973 |
Poor Things |
0.010 |
Killers of the Flower Moon |
0.005 |
The Zone of Interest |
0.004 |
Anatomy of a Fall |
0.003 |
American Fiction |
0.002 |
Past Lives |
0.001 |
Barbie |
0.001 |
The Holdovers |
0.001 |
Maestro |
0.000 |
There are a few awards being announced tonight (Satellite Awards, the
awards for the cinematographers guild and the edtiors guild), but they
should not impact the model much. So, we are in for a year of a
predictable winner—or another shocking year where a CODA or a
Moonlight takes home film’s biggest award. (If you’ve read this
far and enjoyed Cillian Murphy in Oppenheimer… go check out his
leading performance in Sunshine,
directed by Danny Boyle and written by Alex Garland.)
Continue reading: Modeling the Oscar for Best Picture (and Some Insights About XGBoost)
Predicting the Oscars with Machine Learning
The original article offered data on the use of machine learning to predict the winners of the Academy Awards. This approach used numerous datasets related to previous awards, critical accolades, and additional factors such as genre or adaptation. This data was then fed into machine learning algorithms, notably the XGBoost Model, which can deal with missing values and data sparsity, common issues that occur when compiling studies this vast and arduous.
Implications
The ability to predict Oscar outcomes accurately, even with an accuracy rate of about 77%, is both intriguing and revealing. Additionally, the algorithm was observed to be slightly overconfident, indicating a potential area for future work. On the other hand, the consistent accuracy might also signify that the model was able to capture some outright patterns or rules that determine Oscar wins, possibly shedding light on tendencies or biases inherent in the Academy’s decision-making process.
Possible Future Developments
Technology and artificial intelligence are gradually ingraining themselves into the film industry, and this predictive algorithm is a clear example of their potential usage. The model could theoretically be extended to predict other outcomes, perhaps even aiding film production companies in designing films to maximize their Oscar potentials. Though this would require the accuracy to be substantially improved and the existence of consistent, predictable patterns in Oscar decision-making.
Actionable Advice
Model Improvement
The first area where action can be taken is model improvement. As mentioned in the original article, there are changes in the rules, voting body, and voting trends over time – weighting more proximate years higher might be a feasible improvement to the model. It may also be worthwhile to consider if any other variables might impact the Academy’s voting behavior and try incorporating them into the current model.
Field Usage
The model could be of interest to film production companies, news agencies, or even betting companies – all of which would profit from accurate predictions about the Oscars. This could create market demand, leading to commercialization opportunities for such a model.
Studying Voting Decisions
If the model continues to predict Oscar outcomes correctly, it might indicate that there are consistent rules behind the voting decisions. Further exploration might reveal tendencies or biases in the Academy’s voting, which would pose interesting questions about the fairness and independence of the voting process.
In Conclusion
Although this a promising and exciting predictive model, its accuracy and subsequent analysis must be taken with a grain of salt, as this model isn’t perfect. Regardless, this use of machine learning is a fascinating peek into possible applications of AI and data science within the film industry. Keep an eye on future developments in this area – it’s definitely a space worth watching.
Read the original article
by jsendak | Mar 4, 2024 | DS Articles
Transform your understanding of current and future tech with these top 5 AI reads to explore the minds shaping our future.
Expanding Your Technological Insight: Digesting AI’s Top Reads
As technological advancements continue to surge, an ever-evolving landscape of Artificial Intelligence (AI) unfolds. To keep abreast with these advancements, it’s essential to delve deep into knowledge by exploring the top AI-related reads as they provide keen insights and projections about the future.
Long-Term Implications and Possible Future Developments
The intricate interplay of AI with numerous sectors signifies its potent influence on the future. Technology’s pulsating momentum creates a future where AI is deeply embedded in our day-to-day tasks, business operations, and societal functions. Here are some long-term implications and possible future developments:
- AI in Everyday Life: Right from autonomous vehicles to personalized recommendations, AI will become more prevalent in our daily routines. This will lead to a surge in dependency on AI-driven systems.
- Business Operations: AI systems will significantly augment decision-making, streamline operations, and deliver profound competitive advantages to businesses across all sectors.
- Societal Impact: AI has the potential to enhance societal functions, from traffic management to predictive healthcare.
Actionable Advice
Recognizing the profound influence of AI, there’s a need to align with this technological wave. Here are some actions that can be taken:
- Self-Educate: Engage with books, articles, and think pieces on AI to broaden your understanding of its capabilities and potential impacts.
- Integrate AI in Business: Businesses should consider how AI can enhance their daily operations – whether streamlining processes, finding efficiencies, or predicting trends.
- Policy And Legislation: With AI poised to become more prevalent, policymakers should work towards developing guidelines and regulations to safeguard societal interests.
“Anyone who has not made his way to the digital age is quickly feeling the effects. Not only is AI becoming a necessity in businesses, but it will also become the core of many societal operations.”
The Future of AI
The future of AI is brimming with possibilities. As a double-edged sword, it presents unmatched opportunities and unprecedented challenges. Ultimately, awareness regarding AI’s potential, its ethical implications, and the measures needed to harness it responsibly will shape our collective future.
Read the original article
by jsendak | Mar 4, 2024 | DS Articles
Explore how NLP revolutionizes business operations, from enhancing customer service with chatbots to extracting market insights and personalizing content.
Exploring the Revolutionary Impact of Natural Language Processing (NLP) on Business Operations
Natural Language Processing (NLP), a sub-branch of artificial intelligence, is revolutionizing business operations across a wide spectrum of sectors. Its capabilities extend from enhancing customer service through chatbots, extracting market insights, to personalizing content. The dynamic application of NLP offers a compelling glimpse into the future of industries and how they may evolve over time.
Long-term Implications and Potential Future Developments
The use of NLP in business operations has vast potential possibilities and significant long-term implications. NLP lays the groundwork for intelligent automation in several key industries from finance to health care.
Some key long-term implications of NLP include increased efficiency, accuracy in data processing, and personalized customer experiences. Furthermore, advancements in NLP could redefine how businesses interact with customers, conduct market analysis, and operate internally.
Predicting Future Developments in NLP
As AI continues to evolve, the applications of NLP in the business world will likely expand. Businesses may pivot towards fully automated customer service departments, intelligent business analytics software, and personalized advertising methods based on natural language understanding. The transformative impact of NLP will create new frontiers for technological innovation within the business sector.
Actionable Advice
Invest in NLP-capable Platforms
Businesses should seriously consider investing in NLP-capable platforms or software. The potential for automation, personalized customer service, and insightful data analysis are compelling reasons for this investment. Embracing NLP technology now can ensure a competitive edge in the near future.
Focus on Skilling and Re-skilling
With the advent of new technologies, the workforce will need to adapt quickly. Companies should focus on skilling and re-skilling their employees in order to harness the full potential of NLP. This includes training on managing chatbots, interpreting NLP data, and understanding the nuances of AI interfaces.
Stay Updated and Review Strategy Regularly
Given the rapid pace of advancements in AI, it is crucial for businesses to stay updated with the latest developments in NLP. A regular review of business strategies and operational protocols is vital to ensure that the organization fully leverages the potential of NLP.
Conclusion
The revolutionary potential of Natural Language Processing (NLP) on business operations is significant and wide-reaching. Early adoption, investment in technology, upskilling of employees, and strategic review mechanisms can ensure that businesses stay competitive in the AI-driven era.
Read the original article
by jsendak | Mar 3, 2024 | DS Articles
[This article was first published on
R – Xi'an's Og, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Antoine Luciano, Robin Ryder and I posted a revised version of our insufficient Gibbs sampler on arXiv last week (along with three other revisions or new deposits of mine’s!), following comments and suggestions from referees. Thanks to this revision, we realised that the evidence based on an (insufficient) statistic was also available for approximation by a Monte Carlo estimate attached to the completed sample simulated by the insufficient sampler. Better, a bridge sampling estimator can be used in the same conditions as when the full data is available! In this new version, we thus revisited toy examples first explored in some of my ABC papers on testing (with insufficient statistics), as illustrated by both graphs on this post.
Continue reading: insufficient Gibbs sampling bridges as well!
Understanding the Refinements in Insufficient Gibbs Sampler
Luciano, Ryder, and Xi’an recently released an updated version of their insufficient Gibbs sampler on arXiv, incorporating revisions based on feedback received from referees. A major development in the updated version is that evidence based on an insufficient statistic is now applicable for approximation by a Monte Carlo estimate linked to the sample completed by the insufficient sampler.
Furthermore, the updated Gibbs sampler can implement a bridge sampling estimator under similar conditions as when the full data is available. This means that the revised Gibbs sampler affords more comprehensive and accurate insights from the available data. The authors also revisited toy examples first incorporated in some of Xi’an’s papers on testing with insufficient statistics. These examples were again explored to illustrate the enhancements in the new methodology.
Long-term Implications and Future Developments
The improvements in the Gibbs sampler have the potential to significantly enhance the quality of statistical analysis and insights generated therefrom. This development could have far-reaching implications in fields such as data science, economics, and any field that relies on the use of statistics for informed decision-making.
By maximizing the use of available data through even an insufficient Gibbs sampler, analysts are enabled to gain deeper insights and make more informed predictions. The practical implications of this development range from improved business strategy planning, to more accurate economic forecasting, and more targeted marketing strategies.
Potential Future Developments
Despite the progress, there remain several frontiers for exploration. For example, the application of the revised Gibbs sampler to more complex statistical models could yield further insights. Additionally, continued improvements could help refine and enhance the robustness and reliability of results generated using the Gibbs sampler.
Actionable Insights
Organizations and individuals who rely on statistical analysis for informed decision-making should consider integrating this updated version of the Gibbs sampler into their analytical setup. Training and development programs focused on this tool could be beneficial in familiarizing analysts with the workings of the revised Gibbs sampler.
Staying informed about future developments in this field is equally essential, as advancements continue to streamline and enhance statistical analysis techniques and applications. Consequently, keeping a close eye on related academic papers and maintaining an active participation in relevant industry discussions could prove to be a valuable practice.
Evolving statistical methodology, such as the revised Gibbs sampler, offers enriched insights from available data, providing the foundation for improved decision-making across numerous fields.
Read the original article
by jsendak | Mar 3, 2024 | DS Articles
Want to start your data science journey from home, for free, and work at your own pace? Have a dive into this data science roadmap using the YouTube series.
Expanding Your Data Science Skills: Implications and Future Developments
The evolution of technology has broadened the scope of professions worldwide. One such domain that has gained tremendous popularity and importance in recent years is Data Science. The emergence of myriad online resources like YouTube series for honing data science skills has allowed countless individuals to begin or continue their data science journey at their convenience, right from the comfort of their homes.
Implications of Learning Data Science at Home
The joy of self-paced learning is its flexibility and convenience. They empower working professionals to upskill themselves without giving up their current jobs, and allow students to learn at a pace that suits them best. Further, with free resources, one can gain the necessary knowledge without having to burn a hole in their pockets.
Anyone with a determination to learn can now start their data science journey from home, for free, and work at their own pace. With resources like YouTube series, gaining proficiency in data science has become more accessible than ever.
Long Term Implications
Although the immediate benefits of learning data science at home are quite apparent, the long-term implications are even more profound. It can lead to career growth or an entirely new career in data science, irrespective of the person’s educational background. Additionally, the acquired data science skills can be applied to a wide range of industries, offering broad job prospects. Such skills are becoming increasingly valuable in the era of digital transformation and data-driven decision making.
Future Developments
While this learning format has its merits, there is always scope for improvement. Future developments in this learning mode could include more interactive content, personalized learning paths, advancement in projecting complex concepts through visual content etc. AI and ML algorithms can also be deployed to provide customized support and recommendations for users to enhance their learning experiences.
Actionable Steps to Accelerate Your Data Science Journey
- Plan your studies: Outline your learning journey. Break down the learning path into manageable milestones, and celebrate when you achieve them.
- Collaborative learning: Join study groups or forums consisting of people who’re also learning data science. This will foster a sense of community and create a space for mutual aid.
- Hands-on learning: Aim to apply your knowledge through projects or exercises. Hands-on practice will significantly improve your proficiency.
- Stay updated: The field of data science is always evolving. Stay updated with the latest trends and developments.
To conclude, the advancement of technology has made it possible to learn and excel in subjects such as data science. Now, all it requires from the learner’s end is consistency and active participation in the learning process.
Read the original article
by jsendak | Mar 3, 2024 | DS Articles
Generative AI (GenAI) Chatbots like Microsoft Copilot (formerly Bing AI), Google’s Gemini (formerly Google Bard), and OpenAI ChatGPT (still OpenAI ChatGPT) are driving extraordinary productivity improvements by assisting knowledge workers in providing highly relevant information, answering questions, and engaging in wide-ranging exploratory conversations. However, the Wall Street Journey Article “Microsoft’s most ambitious AI upgrade could… Read More »GenAI: The User Interface to Artificial Intelligence (AI)?
Analysis of Key Points
The text highlights the significant role of Generative AI (GenAI) Chatbots, including Microsoft Copilot, Google’s Gemini, and OpenAI ChatGPT, in enhancing productivity. These sophisticated tools serve knowledge workers in diverse ways, from providing essential information to facilitating extensive exploratory discussions.
Long-term Implications
The Emergence of GenAI as a Primary User Interface
Increasingly, GenAI technology is becoming a primary user interface for Artificial Intelligence (AI). With tools like Microsoft’s Copilot, OpenAI’s ChatGPT, and Google’s Gemini, data access and manipulation are becoming not only simpler but also more interactive. In the future, this trend is likely to continue, with these chatbots becoming more intuitive and more contextually aware.
Reliance on AI for Knowledge Work
The long-term implications also include the increased dependence of knowledge workers on AI applications for data handling. AI support in sifting through big data sets, answering complex questions, and engaging in sophisticated discussions is expected to increase over time, altering how tasks in knowledge-intensive sectors are performed.
Future Developments
Advancement in Conversational Abilities
In terms of future developments, the conversational capabilities of ChatGPT and other GenAI chatbots are expected to evolve. Their ability to understand and respond to user inputs will improve, allowing them to engage in more intricate, multifaceted conversations.
Integration with Different Tasks and Platforms
Another potential development could be the integration of GenAI chatbots with various tasks and platforms. The future may see these applications embedded in different software, web and mobile apps, offering a more personalized and interactive user experience in various fields—from productivity tools to entertainment platforms.
Actionable Advice
Incorporating GenAI into Business Operations
Given the rapidly increasing role of GenAI chatbots in improving productivity, businesses should explore ways to integrate these technologies into their operations. They can serve a variety of functions, like customer service representatives, data analysts, or even virtual assistants for employees.
Training and Skill Development
Companies should also invest in training and development, helping their workforce to adapt to this changing digital landscape. Understanding how to effectively use these AI interfaces is critical to maximize their potential benefits.
Keeping Abreast with AI Developments
Lastly, keeping up to date with the latest trends and advancements in AI, in particular GenAI, is critical. This knowledge allows businesses to adapt and leverage any new capabilities these chatbots may gain over time, thereby ensuring they continue to reap the benefits of these evolving technologies.
Read the original article