jacques cohen specif2007

Updating Computer
Science Education
Jacques Cohen
Brandeis University
Waltham, MA
USA
January 2007

Topics






Preliminary remarks
Present state of affairs and
concerns
Objectives of this talk
Trends (hardware, software, networks,
others)





Illustrative examples
Suggestions

Present state of affairs and
concerns


Huge increase in PC and internet usage.



Decreasing enrollment.
(USA mainly)

Possible Reasons









Previous high school preparation
Bubble burst (2000) + outsourcing
Widespread usage of computers
by lay persons
Interest in interdisciplinary
topics (e.g., biology, business,
economics)
Public perception about:
What is Computer Science?

The Nature of Computer
Science





Two main components:
Theoretical
and
Experimental
Mathematics and
Engineering
What characterizes CS is the notion of
Algorithms
Emphasis on the discrete and logic
An interdisciplinary approach with other
sciences may well revive the interest on
the continuous (or use of qualitative
reasoning)

Related fields



Sciences in general (scientific
computing),








Management,
Psychology (human interaction),
Business,
Communications,
Journalism,
Arts, etc.

The role of Computer
Science among other
sciences

(How we are perceived by the other sciences)




In physics, chemistry, biology, nature
is the ultimate umpire.
Discovery is paramount
In math and engineering: aesthetics,
ease of use, acceptance,
permanence, play key roles

Uneasy dialogue with
biologists


It is not unusual to hear from a
physicist, chemist or biologist:

“If computer scientists do not get

involved in our field, we will do it
ourselves!!”


It looks very likely that the
biological sciences (including, of
course, neuroscience) will
dominate the 21st century

Differences in approaches


Most scientific and creative discoveries
proceed in a bottom-up manner



Computer scientists are taught to
emphasize top-down approaches




Polya’s “How to solve it” often mentions

First specialize then generalize .


Hacking is beautiful (mostly bottom-up)

Objectives




Provide a bird’s eye view of what is
happening in CS education (USA) and
attempt to make recommendations
about possible directions. Hopefully,
some of it would be applicable to
European universities.

Premise
Changes ought to be gradual and
depend on resources and time
constraints

First we have to observe current
trends
Generality, Storage, Speed, Networks, others.
 Trying to make sense of present

directions.





Difficult and risky to foresee future,
e.g., PC (windows, mouse), internet,
parallelism
Topics influencing computer

science education.
Trends in hardware, software,
networks.

Huge volume of data

(terabytes and petabytes)





Statistical nature of data
Clustering, classification
Probability and Statistics
become increasingly
important

Trend towards generality









Need to know more about what is
going on in related topics
A few examples:
Robotics and mechanical engineering
Hardware, electrical engineering,
material science, nanotechnology
Multi-field visualization (e.g., medicine)
Biophysics and bioinformatics

Nature of data structures







Sequences (strings), streams
Trees, DAGs, and Graphs
3D structures
Emphasis in discrete structures
Neglect of the continuous
should be corrected (e.g., use of
MatLab)

Trends on data growth

How Much Information Is There In the
World?


The 20-terabyte size of the Library
of Congress derived by assuming
that LC has 20 million books and
each requires 1 MB. Of course, LC
has much other stuff besides printed
text, and this other stuff would take
much more space.



From Lesk
http://www.lesk.com/mlesk/ksg97/ksg.html

Library of Congress data
(cont)
1. Thirteen million photographs, even if
compressed to a 1 MB JPG each, would be 13
terabytes.
2. The 4 million maps in the Geography Division
might scan to 200 TB.
3. LC has over five hundred thousand movies; at
1 GB each they would be 500 terabytes (most
are not full-length color features).
4. Bulkiest might be the 3.5 million sound
recordings, which at one audio CD each, would
be almost 2,000 TB.
This makes the total size of the Library perhaps
about 3 petabytes (3,000 terabytes).

How Much Information Is There In the
World?

Lesk’s Conclusions


There will be enough disk space
and tape storage in the world to
store everything people write,
say, perform or photograph. For
writing this is true already; for the
others it is only a year or two
away.

Lesk’s Conclusions (cont)


The challenge for librarians and
computer scientists is to let us find
the information we want in other
people's work; and the challenge for
the lawyers and economists is to
arrange the payment structures so
that we are encouraged to use the
work of others rather than re-create it.

The huge volume of data
implies:







Linearity of algorithms is a must
Emphasis in pattern matching
Increased preprocessing
Different levels of memory transfer
rates
Algorithmic incrementality (avoid redoing
tasks)






Need of approximate algorithms
(optimization)
Distributed computing
Centralized parallelism (Blue Gene, Argonne)

The importance of pattern
matching (searches) in large
number of items
Pattern matching has to be “tolerant” (approximate)
Find closest matches (dynamic programming,
optimization)







Sequences
Pictures
3D structures (e.g. proteins)
Sound
Photos
Video

Trends in computer cycles
(speed)


Moore’s law appears to be applicable until at
least 2020

Use of supercomputers
(2006)


Researchers at Los Alamos National
Laboratory have set a new world's record
by performing the first million-atom
computer simulation in biology. Using the
"Q Machine" supercomputer, Los Alamos
computer scientists have created a
molecular simulation of the cell's proteinmaking structure, the ribosome. The
project, simulating 2.64 million atoms
in motion, is more than six times larger
than any biological simulations performed
to date.

Graphical visualization of the
simulation of a Ribosome at
work

Network transmission
speed (Lambda Rail Net)


USA backbone

Trends in Transmission Speed


The High Energy Physics
team's demonstration
achieved a peak throughput of
151 Gbps and an official mark
of 131.6 Gbps beating their
previous mark for peak
throughput of 101 Gbps by 50
percent.

Trends in Transmission
Speed II


The new record data transfer
speed is also equivalent to
serving 10,000 MPEG2 HDTV
movies simultaneously in real
time, or transmitting all of
the printed content of the
Library of Congress in 10
minutes.

Trend in Languages




Importance of scripting and
string processing
XML, Java C++, Trend towards
Python, Matlab, Mathematica
No ideal languages
No agreement of what the first
language ought to be

A recently proposed
language (Fortress 2006)



From Guy Steel, The Fortress Programming Language, Sun MicroSystemshttp://iic.harvard.edu/documents/steeleLecture2006public.pdf

Fortress Language
(Sun, Guy Steele)

Meta-level approach to
teaching






Learn 2 or 3 languages and assume that
expertise in other languages can be
acquired on the fly.
Hopefully, the same will occur in learning a
topic in depth. Once in-depth research is
taught using a particular area it can be
extrapolated to other areas.
Increasing usage of canned programs or
data banks Typical examples: GraphViz,
WordNet

Trends in Algorithmic
Complexity







Overcoming the scare of NP
problems
(it happened before with
undecidability)
3-SAT lessons
Mapping polynomial problems
within NP
Optimization, approximate or
random algorithms

Three Examples


Example I The lessons of BLAST
(preprocessing, incrementability,
approximation)



Example II The importance of analyzing
very large networks.
(probability, sensors, sociological implications)



Example III Time Series.
(data mining, pattern searches, classification)

Example I
(History of BLAST)
sequence alignment


Biologists matched sequences of
nucleotides or aminoacids
empirically using Dot Matrices

Dot matrices

No exact matching

Alignment with Gaps

Dynamic Programming
Approach

Dynamic Programming
2
complexity O(n )

Two solutions with gaps

Complexity can be exponential
for determining all solutions

The BLAST approach
complexity is almost
linear
Equivalent Dot Matrices would have
the size
3 billion columns (human genome)
and
Z rows where Z is the size of the
sequence being matched against a
genome (possibly tens of thousands)

BLAST Tricks






Preprocessing
Compile the locations in a genome
containing all possible “seeds”
(combinations of 6 nucleotides or
aminoacids)
Hacking
Follow diagonals as much as possible
(Blast strategy)
Use dynamic programming as a last
resort

Lots of approximations but a
very successful outcome









No multiple solutions
BLAST may not find best matches
The notion of p-values becomes very
important (probability of matches in
random sequences)
Tuning of the BLAST algorithm
parameters
Mixture of hacking and theory
Advantage: satisfies incrementability

Example II
(Networks and Sociology)

Money travels (bills)

Probabilities
P(time,distance)

Money travels





The entire process could be
implemented using sensors.
Mimics spread of disease.
The impact of computing will
go deeper into the sciences
and spread more into the
social sciences (Jon Kleinberg,
2006)

Example III (Time Series)
Illustrates data mining and
how much CS can help other
sciences
Slides from
Dr Eamonn Keogh
University of California.
Riverside,CA


Examples of time
series

Time Series (cont 1)

Time Series (cont 2)

Time Series (cont 3)

Time Series (cont 4)

Time Series (cont 5)

Using Logic Programming in
Multivariate Time Series (Sleep
Apnea)
from G Guimarães and L. Moniz Pereira

Eve nt2

9000

Eve nt3
Eve nt5

8000

Eve nt Ta ce t
7000

No ribca ge a nd a bdomina l
move me nts without s noring
S trong ribca ge a nd a bdomina l
move me nts
Re duce d ribca ge a nd a bdomina l
move me nts without s noring
Ta ce t

6000

5000

4000

No a irflow without s noring
3000

S trong a irflow with s noring
2000

Ta ce t

1000

Airflow
Ribca ge move me nts
04:00:00
04:00:05
04:00:10
04:00:14
04:00:19
04:00:24
04:00:28
04:00:33
04:00:38
04:00:43
04:00:48
04:00:52
04:00:58
04:01:02
04:01:07
04:01:12
04:01:16
04:01:21
04:01:26
04:01:31
04:01:36
04:01:40
04:01:45
04:01:50
04:01:55
04:02:00
04:02:04
04:02:09
04:02:14
04:02:19
04:02:24
04:02:28
04:02:33
04:02:38
04:02:43
04:02:48
04:02:53
04:02:58
04:03:02
04:03:07
04:03:12
04:03:16
04:03:21
04:03:26
04:03:31
04:03:36
04:03:40
04:03:46
04:03:50
04:03:55
04:04:00

0

Abdomina l move me nts
S noring

Back to curricula
recommendations
 Present

status (USA)
and suggested
changes

Current recommended
curricula ACM, SIGCSE 2001 (USA)
1. Discrete Structures (43 core hours)
2. Programming Fundamentals (54 core hours)
3. Algorithms and Complexity (31 core hours)
4. Programming Languages (6 core hours)
5. Architecture and Organization (36 core hours)
6. Operating Systems (18 core hours)
7. Net-Centric Computing (15 core hours)
8. Human-Computer Interaction (6 core hours)
9. Graphics and Visual Computing (5 core hours)
10. Intelligent Systems (10 core hours)
11. Information Management (10 core hours)
12. Software Engineering (30 core hours)
13. Social and Professional Issues (16 core hours)
14. Computational Science (no core hours)
From Domik G.: Glimpses into the Future of Computer Science
Education University of Paderhor, Germany

Changing Curricula


Two extremes

Increased Generality and Limited Depth

Limited Generality
Depth

and Increased

The two extremes in graphical
form

Breadth
(

generality)

D

Depth

The MIT pilot program for
freshmen


At MIT there is a unified EECS
department

Two choices for the first year course:
 Robotics using probabilistic
Bayesian approaches (CS)


Study of cell phones inside out (EE)

Concrete suggestions I


Teaching is inextricably linked to research.



Time and resources govern curriculum
changes.







Gradual changes are essential.
Avoid overlap of material among different
required courses.
If possible introduce an elective course on

Current trends in computer science.
Deal with massive data even in intro courses.

Concrete suggestions II
When teaching algorithms stress
the potential of:
 Preprocessing
 Incrementality
 Parallelization
 Approximations
 Taking advantage of sparseness

Concrete suggestions III








Emphasize probability and
statistics
Bayesian approaches
Hidden Markov Models
Random algorithms
Clustering and classification
Machine learning and Data
Mining

Finally, …


Encourage
interdisciplinary work.
It will inspire new directions
in computer science.

Thank you!!

Future of Computer Intensive
Science in the U.S. (Daniel Reed 2006)




Ten years – a geological epoch on the computing time scale.
Looking back, a decade brought the web and consumer email,
digital cameras and music, broadband networking, multifunction
cell phones, WiFi, HDTV, telematics, multiplayer games,
electronic commerce and computational science.
It also brought spam, phishing, identity theft, software insecurity,
outsourcing and globalization, information warfare and blurred
work-life boundaries. What will a decade of technology advances
bring in communications and collaboration, sensors and
knowledge management, modeling and discovery, electronic
commerce and digital entertainment, critical infrastructure
management and security?



What will it mean for research and education?



Daniel A. Reed is the director of the Renaissance Computing Institute. He also is Chancellor's
Eminent Professor and Vice-Chancellor for Information Technology at the University of North
Carolina at Chapel Hill.

Cyberinfrastructure and Economic
Curvature Creating Curvature in a
Flat World (Singtae Kim, Purdue, 2006)


Cyberinfrastructure is central to scientific advancement in
the modern, data-intensive research environment. For
example, the recent revolution in the life sciences, including
the seminal achievement of sequencing the human genome
on an accelerated time frame, was made possible by parallel
advances in cyberinfrastructure for research in this dataintensive field.



But beyond the enablement of basic research,
cyberinfrastructure is a driver for global economic growth
despite the disruptive 'flattening' effect of IT in the
developed economies. But even at the regional level,
visionary cyber investments to create smart infrastructures
will induce 'economic curvature' a gravitational pull to
overcome the dispersive effects of the 'flat' world and the
consequential acceleration in economic growth.

Miscellaneous I









Claytronics
Game theory (economics - psychology)
Other examples in bioinformatics
Beautiful interaction between sequence
(strings) and structures
Reverse engineering
In biology Geography and Phenotype
(external structural appearance) are of
paramount importance
Systems Biology

Miscellaneous II



Cross word puzzle using Google
Skiena and statistical NLP