Intro to scikit-learn

•

8 likes•4,268 views

In this talk by AWeber's Michael Becker, you will get a brief overview of Machine Learning and scikit-learn. This is a scaled down version of this talk from Pycon 2013: http://github.com/jakevdp/sklearn_pycon2013

Technology Education

Intro to scikit-learn
Michael Becker
PyData Boston 2013

Who is this guy?
Software Engineer @ AWeber
Founder of the DataPhilly Meetup group
@beckerfuffle
beckerfuffle.com
These slides and more @ github.com/mdbecker

On the shoulders of giants
• Machine Learning 101
tutorial from scikit-learn.

On the shoulders of giants
• Machine Learning 101
tutorial from scikit-learn.
• IPython notebooks
from pycon 2013.

Data in scikit-learn
• Stored as a 2d-array
• [n_samples, n_features]
• n_samples: items to process
• n_features: distinct traits

Feature Extraction
Often data is unstructured & non-numerical:
•Text documents

Feature Extraction
Often data is unstructured & non-numerical:
•Text documents
•Images

Feature Extraction
Often data is unstructured & non-numerical:
•Text documents
•Images
•Sounds

Supervised Learning:
Classification
• Email classification
• Language identification
• New article categorization
• Sentiment analysis
• Facial recognition
• ...

Additional Resources
• Machine Learning 101
tutorial from scikit-learn.

Additional Resources
• IPython notebooks
from pycon 2013.

My info
Tweet me @beckerfuffle
Find me at beckerfuffle.com
These slides and more @ github.com/mdbecker

What's hot

What is an API?

Muhammad Zuhdi

In the age of microservices, understanding how applications are executing in a highly distributed environment can be complicated. Looking at log files only gives a snapshot of the whole story and looking at a single service in isolation simply does not give enough information. Each service is just one side of a bigger story. Distributed tracing has emerged as an invaluable technique that succeeds in summarizing all sides of the story into a shared timeline. Yet deploying it can be quite challenging, especially in the large scale, polyglot environments of modern companies that mix together many different technologies. During this session, we will take a look at patterns and means to implement Tracing for services. After introducing the basic concepts we will cover how the tracing model works, and how to safely use it in production to troubleshoot and diagnose issues.

Everything You wanted to Know About Distributed Tracing

Amuhinda Hungai

Python Programming Language | Python Classes | Python Tutorial | Python Train...

Edureka!

Monitoring using Prometheus and Grafana

Arvind Kumar G.S

Observability

Enes Altınok

Observability, Distributed Tracing, and Open Source: The Missing Primer

VMware Tanzu

Video: https://data-artisans.com/flink-forward-berlin/resources/monitoring-flink-with-prometheus Live Demo Code: https://github.com/mbode/flink-prometheus-example Prometheus is a cloud-native monitoring system prioritizing reliability and simplicity – and Flink works really well with it! This session will show you how to leverage the Flink metrics system together with Pronetheus to improve the observability of your jobs. There will be a live demo establishing how everything ties in together. The talk is aimed at people already building and running Flink jobs who would like to gain more insight into them. It is fine if you are not familiar with Prometheus yet as the basic concepts will be introduced. If you have ever wondered how you could use modern monitoring tools to be alerted in the middle of the night in case your Flink job‘s 99th percentile end-to-end latency degraded for some reason, this might just be the talk you are looking for.

Monitoring Flink with Prometheus

Maximilian Bode

This Edureka Python Matplotlib tutorial (Python Tutorial Blog: https://goo.gl/wd28Zr) explains what is data visualization and how to perform data visualization using Matplotlib. It also explains how to modify your plot and how to plot various types of graphs. Below are the topics covered in this tutorial: 1. Why Data Visualization? 2. What Is Data Visualization? 3. Various Types Of Plots 4. What Is Matplotlib? 6. How To Use Matplotlib?

Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...

Edureka!

Introduction to IPython & Jupyter Notebooks

Eueung Mulyana

Everyone wants observability into their system, but find themselves with too many vendors and tools, each with its own API, SDK, agent and collectors. In this talk I will present OpenTelemetry, an ambitious open source project with the promise of a unified framework for collecting observability data. With OpenTelemetry you could instrument your application in a vendor-agnostic way, and then analyze the telemetry data in your backend tool of choice, whether Prometheus, Jaeger, Zipkin, or others. I will cover the current state of the various projects of OpenTelemetry (across programming languages, exporters, receivers, protocols), some of which not even GA yet, and provide useful guidance on how to get started with it.

THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io

DevOpsDays Tel Aviv

Introduction to Kubernetes

rajdeep

2020-02-20 - HashiCorpUserGroup Madring - Integrating HashiCorp Vault and Kub...

Andrey Devyatkin

Observability in the world of microservices

Chandresh Pancholi

Dynatrace Cloud-Native Workshop Slides

VMware Tanzu

Ensure smooth running of operations by using API Integration For Building Software Applications PowerPoint Presentation Slides. Present the major financial highlights before API implementation, application programming interface issues, solutions, etc, by employing API integration PowerPoint templates. Highlight the process of integration of application programming interface in business by using communication protocol PPT slideshow. The strategies for implementing API in business can be effectively discussed using our PPT themes. Showcase benefits related to API testing and time estimate to develop an API by using our visually attention-grabbing API integration service PPT infographics. It is easy to present an API roadmap with different time-intervals by employing our PPT slides. Our content-ready API integration platform PPT slides allow you to showcase the monthly API roadmap with the development process. Cover various API testing models for business, application programming interface value chain, and structure. Thus, understand technical architects by downloading our professionally designed application programming interface strategy. https://bit.ly/3vwNVGh

API Integration For Building Software Applications Powerpoint Presentation Sl...

SlideTeam

Distributed tracing using open tracing & jaeger 2

Chandresh Pancholi

Python

Mohammad Junaid Khan

Lyft is on the mission to improve people’s lives with the world’s best transportation. Starting 2019, Lyft has been running both Batch ETL and ML spark workloads primarily on Kubernetes with the Apache Spark on k8s operator. However, with the increasing scale of workloads in frequency and resource requirements, we started hitting numerous reliability issues related to IP allocation, container images, IAM role assignment, and Kubernetes Control Plane. To continue supporting growing Spark usage with Lyft, the team came up with a hybrid architecture optimized for containerized and non-containerized workload based on Kubernetes and YARN. In this talk, we will also cover a dynamic runtime controller that helps with per environment config overrides and easy switchover between resource managers.

Hybrid Apache Spark Architecture with YARN and Kubernetes

Databricks

Introducing DevOps

Nishanth K Hydru

Opentelemetry - From frontend to backend

Sebastian Poxhofer

What's hot (20)

What is an API?

Everything You wanted to Know About Distributed Tracing

Python Programming Language | Python Classes | Python Tutorial | Python Train...

Monitoring using Prometheus and Grafana

Observability

Observability, Distributed Tracing, and Open Source: The Missing Primer

Monitoring Flink with Prometheus

Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...

Introduction to IPython & Jupyter Notebooks

THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io

Introduction to Kubernetes

2020-02-20 - HashiCorpUserGroup Madring - Integrating HashiCorp Vault and Kub...

Observability in the world of microservices

Dynatrace Cloud-Native Workshop Slides

API Integration For Building Software Applications Powerpoint Presentation Sl...

Distributed tracing using open tracing & jaeger 2

Python

Hybrid Apache Spark Architecture with YARN and Kubernetes

Introducing DevOps

Opentelemetry - From frontend to backend

Viewers also liked

Presented at PyOhio 2017: https://pyohio.org/schedule/presentation/284/ The Python data ecosystem provides amazing tools to quickly get up and running with machine learning models, but the path to stably serving them in production is not so clear. We'll discuss details of wrapping a minimal REST API around scikit-learn, training and persisting models in batch, and logging decisions, then compare to some other common approaches to productionizing models.

Machine learning in production with scikit-learn

Jeff Klukas

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

Arnaud Joly

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Pôle Systematic Paris-Region

Intro to machine learning with scikit learn

Yoss Cohen

Clustering: A Scikit Learn Tutorial

Damian R. Mingle, MBA

scikit-learn has emerged as one of the most popular open source machine learning toolkits, now widely used in academia and industry. scikit-learn provides easy-to-use interfaces to perform advanced analysis and build powerful predictive models. The tutorial will cover basic concepts of machine learning, such as supervised and unsupervised learning, cross validation, and model selection. We will see how to prepare data for machine learning, and go from applying a single algorithm to building a machine learning pipeline. We will also cover how to build machine learning models on text data, and how to handle very large datasets.

Machine Learning with scikit-learn

odsc

Think machine-learning-with-scikit-learn-chetan

Chetan Khatri

Scikit-learn: the state of the union 2016

Gael Varoquaux

Intro to scikit learn may 2017

Francesco Mosconi

Data Science and Machine Learning Using Python and Scikit-learn

Asim Jalis

Tree models with Scikit-Learn: Great models with little assumptions

Gilles Louppe

Realtime predictive analytics using RabbitMQ & scikit-learn

AWeber

Exploring Machine Learning in Python with Scikit-Learn

Kan Ouivirach, Ph.D.

Introduction to Machine Learning with Python and scikit-learn

Matt Hagy

Machine learning with scikit-learn

Qingkai Kong

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...

PyData

Converting Scikit-Learn to PMML

Villu Ruusmann

Text Classification/Categorization

Oswal Abhishek

Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?

Scikit-learn for easy machine learning: the vision, the tool, and the project

Gael Varoquaux

Random Forests are without contest one of the most robust, accurate and versatile tools for solving machine learning tasks. Implementing this algorithm properly and efficiently remains however a challenging task involving issues that are easily overlooked if not considered with care. In this talk, we present the Random Forests implementation developed within the Scikit-Learn machine learning library. In particular, we describe the iterative team efforts that led us to gradually improve our codebase and eventually make Scikit-Learn's Random Forests one of the most efficient implementations in the scientific ecosystem, across all libraries and programming languages. Algorithmic and technical optimizations that have made this possible include: - An efficient formulation of the decision tree algorithm, tailored for Random Forests; - Cythonization of the tree induction algorithm; - CPU cache optimizations, through low-level organization of data into contiguous memory blocks; - Efficient multi-threading through GIL-free routines; - A dedicated sorting procedure, taking into account the properties of data; - Shared pre-computations whenever critical. Overall, we believe that lessons learned from this case study extend to a broad range of scientific applications and may be of interest to anybody doing data analysis in Python.

Accelerating Random Forests in Scikit-Learn

Gilles Louppe

Viewers also liked (20)

Machine learning in production with scikit-learn

Numerical tour in the Python eco-system: Python, NumPy, scikit-learn

Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux

Intro to machine learning with scikit learn

Clustering: A Scikit Learn Tutorial

Machine Learning with scikit-learn

Think machine-learning-with-scikit-learn-chetan

Scikit-learn: the state of the union 2016

Intro to scikit learn may 2017

Data Science and Machine Learning Using Python and Scikit-learn

Tree models with Scikit-Learn: Great models with little assumptions

Realtime predictive analytics using RabbitMQ & scikit-learn

Exploring Machine Learning in Python with Scikit-Learn

Introduction to Machine Learning with Python and scikit-learn

Machine learning with scikit-learn

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...

Converting Scikit-Learn to PMML

Text Classification/Categorization

Scikit-learn for easy machine learning: the vision, the tool, and the project

Accelerating Random Forests in Scikit-Learn

Similar to Intro to scikit-learn

Continuum Analytics and Python

Travis Oliphant

Machine Learning - Classification

Vikram Nandini

SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...

Keiichiro Ono

Laying the Foundation for Ionic Platform Insights on Spark

Ionic Security

# Talk given at PyCon UK 2017 The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project. In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including: - General information about the dataset, including data quality of each of the columns; - Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables; - 2D distribution between pairs of columns; - Correlation coefficient matrix for all numerical columns. Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Víctor Zabalza

MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013

Maurício Aniche

Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections: (1) I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation. (2) Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied. (3) The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.

NDC Oslo : A Practical Introduction to Data Science

Mark West

Neo4j Import Webinar

Neo4j

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

PyData

I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.

Python ml

Shubham Sharma

Sql Server 2008 Portfolio

Eugene Kilpatrick

New M-Culture + Elementary WordPress

Sitdhibong Laokok

dbt Python models - GoDataFest by Guillermo Sanchez

GoDataDriven

Web scrapping - practical guide

SeeQuality.net

Deep ML Architecture at Wildcard: At Wildcard we think about technologies for a future native mobile web experience through cards. Cards are a new UI paradigm for content on mobile for which we schematize unstructured web content. Part of the challenge is to develop an understanding of online content through machine learning algorithms. The extracted information is used to create cards that are surfaced in the Wildcard iOS app and in other card ecosystems. I will describe the challenge and the way we structure the problem of content extraction with a deep architecture of classification and optimization algorithms that combines traditionally factorized problems of content extraction which allows the various stages to inform each other. The talk will include an overview of the used data, features and our training strategy with a partly human-powered labeling system. This ML system, called sic, is used in production and I will show our approach to using only fast or a mix of fast and slow features depending on the use case in the app.

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

MLconf

Azure Data Factory v2

inovex GmbH

Scaling Up Presentation

Jiaqi Xie

Machine learning on Hadoop data lakes

DataWorks Summit

The Briefing Room with Dr. Robin Bloor and Think Big, a Teradata Company Live Webcast April 7, 2015 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=4114b87441ab7b2b4c52f6b24776e5a1 The more things change in Big Data, the more they stay the same. Indeed, there are many similarities between a Hadoop-based Data Lake and today’s modern Data Warehouse. Regardless of platform, information workers must still be able to turn their assets into action quickly, without taking a hit on governance or downstream performance. Register for this episode of The Briefing Room to hear veteran Analyst Dr. Robin Bloor as he explains the challenges facing organizations who endeavor on Big Data projects. He’ll be briefed by Rick Stellwagen of Think Big, a Teradata Company, who will outline his company’s approach to handling Big Data implementations. Rick will discuss the role of the data lake, and how timely response of queries is critical for reporting and analysis. Visit InsideAnalysis.com for more information.

The Great Lakes: How to Approach a Big Data Implementation

Inside Analysis

Philly CocoaHeads 20160414 - Building Your App SDK With Swift

Jordan Yaker

Similar to Intro to scikit-learn (20)

Continuum Analytics and Python

Machine Learning - Classification

SDCSB Advanced Tutorial: Reproducible Data Visualization Workflow with Cytosc...

Laying the Foundation for Ionic Platform Insights on Spark

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

MetricMiner: Supporting Researchers in Mining Software Repositories - SCAM 2013

NDC Oslo : A Practical Introduction to Data Science

Neo4j Import Webinar

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux

Python ml

Sql Server 2008 Portfolio

New M-Culture + Elementary WordPress

dbt Python models - GoDataFest by Guillermo Sanchez

Web scrapping - practical guide

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

Azure Data Factory v2

Scaling Up Presentation

Machine learning on Hadoop data lakes

The Great Lakes: How to Approach a Big Data Implementation

Philly CocoaHeads 20160414 - Building Your App SDK With Swift

More from AWeber

ASCEND Summit 2014 provided tons of learning opportunities specific to improving your efforts in content marketing. If content marketing is your top priority, these five ASCEND sessions will show how to build your business with proven tactics, the latest trends and tools, and sage advice from experts in the field. Featuring: Andy Crestodina, Tim Paige, Michael Brenner, Marcus Sheridan, and Lynette Young We've also organized these speakers into a video package to help you capture the energy, inspiration and actionable takeaways from ASCEND Summit 2014. Order your Content Marketing Power Tools video today: http://contentvideo.ascendsummit.com

ASCEND Content Marketing Power Tools

AWeber

ASCEND Summit 2014 provided tons of learning opportunities specific to improving your efforts in multichannel marketing. Want to drill down into marketing channels like SEO, email, affiliate marketing, landing pages and mobile? These four ASCEND sessions cover today's most effective marketing methods, with actionable insights you can use right away. Featuring: Justine Jordan, Hunter Boyle, Oli Gardner, Brian Massey, Mohammed Ahmed, Tricia Meyer, Sarah Bundy, Jennifer Myers Ward, Geno Prussakov, and Brian Littleton We've also organized these speakers (and two others - Peter Shankman and Wil Reynolds) into a video package to help you capture the energy, inspiration and actionable takeaways from ASCEND Summit 2014. Order your Multichannel Marketing Power Tools video today: http://multichannelvideo.ascendsummit.com

ASCEND Multichannel Marketing Power Tools

AWeber

Beginner's Guide to Marketing on Social Networks

AWeber

5 Content Blind Spots and How to Avoid Them

AWeber

ASCEND is a two-day summit that focuses on delivering digital marketing tactics and strategies that will cover a range of topics intended to help you create successful, results-driven digital marketing initiatives, including: email marketing, content creation, social media, SEO, conversion optimization and buyer psychology. Email is the hub for successful marketing programs. Marketers have been growing their businesses with AWeber over the past 15 years and we thought it was time to share our collective knowledge, and experiences from other customers, in a more personal way. AWeber is bringing an awesome roster of content and digital marketers to its hometown of Philadelphia to help your business ASCEND to the next level. Seats are limited, so now's the time to secure your ticket for this exclusive event. Join us at ASCEND 2014 - http://bit.ly/1mUsB5K

Digital Marketing Tips from Experts at the Top of the Summit

AWeber

Looking at a photo and deciding whether the person depicted is happy, angry or sad may seem like a trivial task for anyone to do. However, differing contexts and other subtle factors make it very costly for a computer to do the same. Being able to analyze subjective information automatically is an invaluable tool for small businesses. This data can be used to shape business decisions and drive profits. One way to achieve this goal is through crowdsourcing. In other words, getting a large group of volunteers to participate in a common problem and combining their contirbutions. Actually organizing, funding, and managing a project like this can be daunting and expensive, this is where Amazon's Mechanical Turk comes in. This talk explains how Mechanical Turk works and cover various ways in which it can be leveraged by anyone. We will cover use cases that have been successful, the mechanics of posting, processing and testing tasks, and specific tools for accomplishing these goals. This talk was given by Michael Becker and Kelly O'Brien at the 2013 Philly Tech Week on April 23, 2013.

Data Processing with Mechanical Turk

AWeber

5 WordPress Plugins that will Rock Your World

AWeber

How to Grow Your Email List Like the Pros

AWeber

How to Create Killer Emails that Make Readers Love You

AWeber

Has your email marketing become a routine? It happens. When we get too bogged down in patterns, our creative juices can get stagnant. Let's shake things up for 2013. Infuse your campaigns with new flavor as we review clever, fun campaigns that worked (and a few that didn't). You'll come way with ideas and inspiration you can put to work right away to revitalize your ROI. Presented by Hunter Boyle at MarketingSherpa's Email Summit 2013, Las Vegas Learn more at: http://www.aweber.com/blog

Breathing Life (and ROI) Back Into Your Email Marketing

AWeber

Want to turn strangers into raving fans while you sleep? It may not happen overnight, but automated marketing can help you build your audience, nurture relationships and grow your bottom-line results. All at a fraction of the effort and investment of standard email marketing processes. Want to learn more? Whether you're just starting out or improving an existing program, join us to get the lowdown on simple ways to make automated marketing do your heavy lifting. We'll look at real-world examples and research, so you'll come away ready to take action. Presented by Hunter Boyle of AWeber & DJ Waldow of Waldow Social at Explore Social Media, Portland

More Engagement, Less Effort: The Lowdown on Marketing Automation

AWeber

Learn how to dramatically grow your email marketing lists with these 25 ideas and resources. Compiled with input and real examples from a variety of marketing all-stars, you're sure to find new tricks to increase your subscriber base and keep them more engaged with your content. Presented by Hunter Boyle at Affiliate Summit East NYC, #ASE12, Aug. 2012. For more tricks, visit: http://www.aweber.com/blog

25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI

AWeber

Email List-Building 101: How to Reel In New Readers with a Few Simple Steps

AWeber

30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012

AWeber

How To Get The Results You Want From An Email Campaign

AWeber

What does it mean to market with email? To some, it simply means slapping a form on a website and sending out the occasional newsletter. But to the savvy small business marketer, it means creating a valuable incentive for subscribing, respecting the subscriber's time and attention, and using email to increase the lifetime value (LTV) of a subscriber. For more email marketing tips, visit http://www.aweber.com/blog/

Smart Email Marketing: Engage Your Customers and Grow Your Business

AWeber

Get More Email Subscribers

AWeber

Efficient Marketing: The Tools You Need and How to Use Them

AWeber

Dustin Maher (dustinmaherfitness.com) graduated from the University of Wisconsin in 2006 with a degree in Kinesiology and Business knowing that he wanted to help people get in shape. Soon after graduating, he launched MamaTone Fitness in Madison, Wisconsin. In this presentation at the Greater Philly Email Marketers Meetup on June 6, Crystal Gouldey shared how this fresh-out-of-college fitness instructor grew his local fitness company to a national business with 10 locations, 28 DVDs, a published book, and an email list of 12,000+ subscribers using online marketing tactics.

From Local Business to National Sensation

AWeber

Live h2gs

AWeber

More from AWeber (20)

ASCEND Content Marketing Power Tools

ASCEND Multichannel Marketing Power Tools

Beginner's Guide to Marketing on Social Networks

5 Content Blind Spots and How to Avoid Them

Digital Marketing Tips from Experts at the Top of the Summit

Data Processing with Mechanical Turk

5 WordPress Plugins that will Rock Your World

How to Grow Your Email List Like the Pros

How to Create Killer Emails that Make Readers Love You

Breathing Life (and ROI) Back Into Your Email Marketing

More Engagement, Less Effort: The Lowdown on Marketing Automation

25 List Building Tricks: Ideas, Examples and Resources to Improve Your Email ROI

Email List-Building 101: How to Reel In New Readers with a Few Simple Steps

30 Ideas in 30 Minutes: Top Holiday Marketing Ideas You Can Steal For 2012

How To Get The Results You Want From An Email Campaign

Smart Email Marketing: Engage Your Customers and Grow Your Business

Get More Email Subscribers

Efficient Marketing: The Tools You Need and How to Use Them

From Local Business to National Sensation

Live h2gs

Recently uploaded

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Data Cloud, More than a CDP by Matt Robison

Anna Loughnan Colquhoun

The presentation explores the development and application of artificial intelligence (AI) from its inception to its current status in the modern world. The term "artificial intelligence" was first coined by John McCarthy in 1956 to describe efforts to develop computer programs capable of performing tasks that typically require human intelligence. This concept was first introduced at a conference held at Dartmouth College, where programs demonstrated capabilities such as playing chess, proving theorems, and interpreting texts. In the early stages, Alan Turing contributed to the field by defining intelligence as the ability of a being to respond to certain questions intelligently, proposing what is now known as the Turing Test to evaluate the presence of intelligent behavior in machines. As the decades progressed, AI evolved significantly. The 1980s focused on machine learning, teaching computers to learn from data, leading to the development of models that could improve their performance based on their experiences. The 1990s and 2000s saw further advances in algorithms and computational power, which allowed for more sophisticated data analysis techniques, including data mining. By the 2010s, the proliferation of big data and the refinement of deep learning techniques enabled AI to become mainstream. Notable milestones included the success of Google's AlphaGo and advancements in autonomous vehicles by companies like Tesla and Waymo. A major theme of the presentation is the application of generative AI, which has been used for tasks such as natural language text generation, translation, and question answering. Generative AI uses large datasets to train models that can then produce new, coherent pieces of text or other media. The presentation also discusses the ethical implications and the need for regulation in AI, highlighting issues such as privacy, bias, and the potential for misuse. These concerns have prompted calls for comprehensive regulations to ensure the safe and equitable use of AI technologies. Artificial intelligence has also played a significant role in healthcare, particularly highlighted during the COVID-19 pandemic, where it was used in drug discovery, vaccine development, and analyzing the spread of the virus. The capabilities of AI in healthcare are vast, ranging from medical diagnostics to personalized medicine, demonstrating the technology's potential to revolutionize fields beyond just technical or consumer applications. In conclusion, AI continues to be a rapidly evolving field with significant implications for various aspects of society. The development from theoretical concepts to real-world applications illustrates both the potential benefits and the challenges that come with integrating advanced technologies into everyday life. The ongoing discussion about AI ethics and regulation underscores the importance of managing these technologies responsibly to maximize their their benefits while minimizing potential harms.

Artificial Intelligence: Facts and Myths

Joaquim Jorge

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Created by Mozilla Research in 2012 and now part of Linux Foundation Europe, the Servo project is an experimental rendering engine written in Rust. It combines memory safety and concurrency to create an independent, modular, and embeddable rendering engine that adheres to web standards. Stewardship of Servo moved from Mozilla Research to the Linux Foundation in 2020, where its mission remains unchanged. After some slow years, in 2023 there has been renewed activity on the project, with a roadmap now focused on improving the engine’s CSS 2 conformance, exploring Android support, and making Servo a practical embeddable rendering engine. In this presentation, Rakhi Sharma reviews the status of the project, our recent developments in 2023, our collaboration with Tauri to make Servo an easy-to-use embeddable rendering engine, and our plans for the future to make Servo an alternative web rendering engine for the embedded devices industry. (c) Embedded Open Source Summit 2024 April 16-18, 2024 Seattle, Washington (US) https://events.linuxfoundation.org/embedded-open-source-summit/ https://ossna2024.sched.com/event/1aBNF/a-year-of-servo-reboot-where-are-we-now-rakhi-sharma-igalia

A Year of the Servo Reboot: Where Are We Now?

Igalia

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Recently uploaded (20)

presentation ICT roal in 21st century education

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Tata AIG General Insurance Company - Insurer Innovation Award 2024

A Domino Admins Adventures (Engage 2024)

GenAI Risks & Security Meetup 01052024.pdf

Top 10 Most Downloaded Games on Play Store in 2024

MINDCTI Revenue Release Quarter One 2024

Data Cloud, More than a CDP by Matt Robison

Artificial Intelligence: Facts and Myths

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

AWS Community Day CPH - Three problems of Terraform

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Powerful Google developer tools for immediate impact! (2023-24 C)

How to Troubleshoot Apps for the Modern Connected Worker

Axa Assurance Maroc - Insurer Innovation Award 2024

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

A Year of the Servo Reboot: Where Are We Now?

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Scaling API-first – The story of a global engineering organization

Intro to scikit-learn

1. Intro to scikit-learn Michael Becker PyData Boston 2013

2. Who is this guy? Software Engineer @ AWeber Founder of the DataPhilly Meetup group @beckerfuffle beckerfuffle.com These slides and more @ github.com/mdbecker

3. On the shoulders of giants • Machine Learning 101 tutorial from scikit-learn.

4. On the shoulders of giants • Machine Learning 101 tutorial from scikit-learn. • IPython notebooks from pycon 2013.

5. What is Machine Learning?

6. What is Machine Learning?

7. Data in scikit-learn • Stored as a 2d-array • [n_samples, n_features] • n_samples: items to process • n_features: distinct traits

8. The Iris Dataset

9. The Iris Dataset

10. The Iris Dataset

11. The Iris Dataset: Loading

12. The Iris Dataset: Loading

13. The Iris Dataset: Loading

14. The Iris Dataset: Loading

15. The Iris Dataset: Loading

16. The Iris Dataset: Loading

17. The Iris Dataset: Loading

18. Machine Learning: Supervised

19. Machine Learning: Unsupervised

20. Scikit-learn's interface

21. Scikit-learn's interface

22. Feature Extraction Often data is unstructured & non-numerical: •Text documents

23. Feature Extraction Often data is unstructured & non-numerical: •Text documents •Images

24. Feature Extraction Often data is unstructured & non-numerical: •Text documents •Images •Sounds

25. Supervised Learning: Classification

26. Supervised Learning: Classification

27. Supervised Learning: Classification

28. Supervised Learning: Classification

29. Supervised Learning: Classification

30. Supervised Learning: Classification

31. Supervised Learning: Classification

32. Supervised Learning: Classification

33. Supervised Learning: Classification

34. Supervised Learning: Classification • Email classification • Language identification • New article categorization • Sentiment analysis • Facial recognition • ...

35. Unsupervised Learning

36. Dimensionality Reduction

37. Principal Component Analysis

38. Unsupervised Learning: PCA

39. Unsupervised Learning: PCA

40. Unsupervised Learning: PCA

41. Unsupervised Learning: PCA

42. Validation & Testing

43. Validation & Testing

44. Validation & Testing

45. Overfitting

46. Cross-Validation

47. Cross-Validation

48. Cross-Validation

49. Additional Resources • Machine Learning 101 tutorial from scikit-learn.

50. Additional Resources • IPython notebooks from pycon 2013.

51. My info Tweet me @beckerfuffle Find me at beckerfuffle.com These slides and more @ github.com/mdbecker

Editor's Notes

Good morning everyone, My name is Michael Becker, I work in the Data Analysis and Management team at AWeber, an email marketing company in Chalfont, PA I'm also the founder of the DataPhilly Meetup group You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.
So I want to start this talk by thanking those who came before me. None of the content from this talk is original. It's been influenced heavily by various other talks and resources around the web. This talk is based primarily on the "Machine Learning 101" tutorial from the scikit-learn documentation.
Additional thanks also to Jake Vanderplas for creating an excellent set of ipython notebooks for pycon 2013 which I've used for my code samples. This talk will only cover a subset of what's available in these resources. I recommend you have a look at those to learn more about scikit-learn. I’m not currently a contributor to the scikit-learn project or in any way affiliated with it. I’m just a very happy user.
Machine learning algorithms can figure out how to perform important tasks based on previously seen data. To illustrate this point, let's take a look at two simple machine learning tasks. This plot represents data of two types. One is colored red; the other is colored blue. A classification algorithm may be used to draw a dividing line between the two clusters of points. This task may seem simple, but it illustrates an important point. By drawing this separating line, we have created a model which can generalize to new data: if you drop another point onto the plane which is unlabeled, this model can predict whether it's a blue or a red point.
This plot shows a series of values that appear correlated. A plot like this could for example represent the prices of houses on the y axis and the square footage of those houses on the x axis. We can pretty easily fit a line to this set of data. Again, this is an example of fitting a model to data, such that the model can make generalizations about new data. The model has been learned from the training data, and can be used to predict the result of test data: we might be given an x-value (square footage), and the model would allow us to predict the y value (price). Again, this might seem like a trivial task, but it is a basic example of the type of problem you can solve with Machine Learning.
Data in scikit-learn is usually represented as a 2d-array.. The size of the array is expected to be [n_samples by n_features] n_samples refers to the number of samples: each sample is an item to process. A sample can be a document, a picture, a sound, a row in a database, or anything you can describe with a fixed set of quantitative traits. n_features refers to the number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases. You have to choose your features in advance, but you can have as few or as many features as you want. Some of your features can represent traits that are relatively rare in your data set. In this case this feature would be set to zero for samples where it is not found.
As an example of a simple dataset, let's look at the iris dataset which comes with scikit-learn. The data consists of measurements of three different species of irises.
The data contains (4) features for each sample. Each sample represents an individual flower. For each flower the features are: sepal length sepal width petal length petal width
The iris dataset also contains the species of flower which is one of 3 classes.
scikit-learn embeds a copy of the iris data along with a helper function to load it into numpy arrays
The resulting dataset is called a “Bunch” object: you can see what's available using the method keys()
The features of the sampled flowers are stored in the data attribute of the dataset Data is a 2d array of 150 samples (by) 4 features Here we can see what an individual sample looks like
The information about the class of each sample is stored in the target attribute of the dataset While data is a 2d array...
target is a 1d array with 1 class per sample (150).
The names of the classes are stored in the target_names attribute. This can be used to convert the numerical target values to a human readable format.
The iris data has 4 features. We can’t easily visualize all 4 features plus the labels in a 2 or 3 dimensional graph. However one method for visualizing this data could be to plot two of the dimensions using a simple scatter-plot. In this plot we’ve graphed: y-axis: sepal width x-axis: sepal length The blue class seems reasonably distinct in this visualization. Unfortunately, it's hard to visually separate the green and the red classes using this technique.
Now let’s explore the different types of machine learning. Machine learning can be broken into two broad categories: supervised learning and unsupervised learning. In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. Using our iris data as an example, we could try to predict the species of iris given a set of measurements of its flower. Supervised learning can be further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. Our iris labels are discrete, there are only 3 possible values. Therefore predicting the species based on flower measurements would be a classification task.
Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the samples. You can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction and clustering. For example, in the iris data, we can used unsupervised methods to determine combinations of the features which are good at visualizing the structure of the data in 2 dimensions. We’ll see an example of this later. Sometimes you can even combine supervised and unsupervised learning. For example, unsupervised learning can be used to find useful features in the data, and then these features can be used within a supervised model.
In scikit-learn, almost all operations are done through an estimator object. For example, a linear regression estimator can be instantiated as follows:
Scikit-learn strives to have a uniform interface across all methods. Given a scikit-learn estimator object (named model), the following methods are available: All Estimators have a fit method. The fit method fits the model to a set of training data. Supervised estimators can have a few methods. All supervised estimators have a predict method: given a trained model this method predicts the label of a new set of data. For classification problems, some estimators also provide the predict_proba method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by the predict method. For classification or regression problems, most estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit. For most estimators score calculates the accuracy of the model. Unsupervised estimators have a few unique methods The transform method transforms data into a new format. Some estimators implement the fit_transform method, which more efficiently performs a fit and a transform on the same input data.
often, data does not come in a nice, structured, CSV file where every column measures the same thing. Let’s explore some common methods for extracting features in these cases. Text documents: Count the frequency of each word (n-grams) or pair of consecutive words in each document. This approach is called Bag of Words
Extracting features from Images: Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization) Take some transformation of the image Perform local feature extraction: split the image into small regions and perform feature extraction locally in each area, Then combine all the features of the individual areas into a single array.
Extracting features from Sounds: Same type of strategies as for images; the difference is it’s a 1D rather than 2D space.
Now that we’ve covered all the basics, let’s train a classification model using the iris dataset. First lets load the iris data like before.
Let’s say that we were assigned the task of guessing the class of an individual flower given the measurements of petals and sepals. This is a classification task You’ll note that we’re using the variable uppercase X to represent our data and the variable lower case y to present our targets. These two variables are frequently used in the Machine Learning field so you will likely see this format frequently. Once the data has this format it is trivial to train a classifier...
...for example let’s try out a support vector machine.
The first thing to do is to create an instance of the classifier. This can be done simply by calling the class name, with any arguments that the object accepts
clf is a statistical model that has parameters that control the learning algorithm. Those parameters can be supplied by the user in the constructor of the model. Each estimator has different parameters. There are several methods for choosing good values for each parameter. I won’t cover these methods in this talk, but these are covered in Jake Vanderplas’ ipython notebooks.
By default the model's fit parameters are not initialized. They will be tuned automatically from the data by calling the fit method with the data - X and labels - y
We can now see some of the fit parameters within the classifier object. In scikit-learn, parameters defined during training have a trailing underscore.
Once the model is trained, it can be used to predict the most likely outcome on new data. For instance let us define a list of simple samples that looks like the first sample of the iris dataset: Our model predicts the sample is of class 0
So now that we’ve trained our first model, let’s revisit the previous diagram and see where our fit and predict calls fit in. We can see that we called fit with our vectorized features. We were able to skip this step in this example because the iris dataset is already vectorized. We can see that once fit was called, we called predict on a new data point and got as output an expected label.
Classification involves predicting an unknown category based on observed features. Let’s go over a few examples of interesting classification tasks: E-mail classification: labeling email as spam or ham Language identification: labeling documents as English, Spanish, etc... News articles categorization: labeling articles as business, technology... Sentiment analysis: labeling customer feedback as negative, neutral, positive Facial recognition: label images as matching or not matching a person
Let’s revisit unsupervised learning. The major difference between supervised and unsupervised learning is that in the case of unsupervised learning, our data is unlabeled. Previously we visualized the iris data by plotting pairs of dimensions. Here we will use an unsupervised dimensionality reduction algorithm to improve on our previous technique.
Dimensionality reduction is the task of deriving a set of new abstract features that is smaller than the original feature set while retaining most of the variance of the original data. Here we'll use a common but powerful dimensionality reduction technique called Principal Component Analysis (PCA). We'll perform PCA on the iris dataset that we saw before: Since this is unsupervised learning, target (y) will be unused, however we'll use it later to visualize our results.
PCA allows you to re-express a set of data points in terms of basic components that explain the most variance in the data. This is accomplished by combining the original features. If the number of retained components is 2 or 3, PCA can be used to visualize the dataset.
We’ve used PCA to transform our original 4d data into 2d data
PCA normalizes and whitens the data, which means that the data is now centered on both components The mean of both of the artificial components is 0...
... and the standard deviation is 1
Now we can visualize the iris dataset along the two new dimensions Note that this visualization was generated without any information about the labels (y) (represented by the colors): this is the sense in which the learning is unsupervised. Even so, we see that the projection gives us insight into the distribution of the different flowers: notably, the red class is much more distinct than the other two species. And even among the green and blue classes, there is a pretty good division line that can be drawn.
The last thing we’ll cover in this talk is validation and testing. The most common mistake beginners make when training statistical models is to evaluate the quality of the model on the same data used for fitting the model.
Here we're training the classifier with all the data.
We’re getting pretty high accuracy with this model. Question: what might be the problem with this approach?
The problem is that some models can be subject to overfitting: they can learn the training data by heart without generalizing. The symptoms are: The accuracy on the data used for training can be excellent (sometimes 100%) The models do little better than random predictions when facing new data that was not part of the training set If you evaluate your model on your training data you won’t be able to tell whether your model is overfitting or not.
Learning the parameters of a prediction function and testing it on the same data is a mistake: a model that would just repeat the labels of the samples that it has seen would have a perfect score but would fail to predict anything useful on new data. To avoid over-fitting, we have to define two different sets: a training set X_train, y_train which is used for training the model a testing set X_test, y_test which is used for evaluating the fitted model In scikit-learn such a random split can be quickly computed with the train_test_split helper function.
using train_test_split, we can train on the training data... ...and test on the testing data
There is an issue here, however: by defining these two sets, we significantly reduce the number of samples which can be used for training the model, and the results can depend on a particular random choice for the pair of (train, test) sets. A solution is to split the whole dataset a few times randomly into different training and testing sets, and to calculate the average value of the prediction scores obtained with the different sets. Such a procedure is called cross-validation. This approach can be computationally expensive, but does not waste too much data. Information on cross validation, and a lot of other awesome things which I haven’t covered can be found in the following resources.
Thanks go to Jake Van-der-plas for creating an excellent set of ipython notebooks for pycon 2013 which I've used for my code samples.
You can find me online @beckerfuffle on Twitter. At beckerfuffle.com, and I'm also mdbecker on github. I'll be posting the materials for this talk on my github.

Intro to scikit-learn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intro to scikit-learn

Similar to Intro to scikit-learn (20)

More from AWeber

More from AWeber (20)

Recently uploaded

Recently uploaded (20)

Intro to scikit-learn

Editor's Notes